no code implementations • 30 Apr 2024 • Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui
VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results.
1 code implementation • 6 Jul 2023 • Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong
We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task.
no code implementations • NeurIPS 2023 • Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, Hartwig Adam
We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model.
Ranked #1 on Zero-Shot Action Recognition on Kinetics (using extra training data)
no code implementations • 29 Mar 2023 • Guan Zhe Hong, Yin Cui, Ariel Fuxman, Stanley H. Chan, Enming Luo
Furthermore, we perform comprehensive experiments using the label hierarchies of iNaturalist 2021 and observe that the following conditions, in addition to proper choice of label granularity, enable the transfer to work well in practice: 1) the pretraining dataset needs to have a meaningful label hierarchy, and 2) the pretraining and target label functions need to align well.
1 code implementation • ICCV 2023 • Long Zhao, Liangzhe Yuan, Boqing Gong, Yin Cui, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu
To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs).
Human-Object Interaction Detection Relationship Detection +2
no code implementations • 13 Feb 2023 • James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, Balaji Lakshminarayanan
In particular, we ask "Given a large pool of prompts, can we automatically score the prompts and ensemble those that are most suitable for a particular downstream dataset, without needing access to labeled validation data?".
no code implementations • CVPR 2023 • Hong-You Chen, Yandong Li, Yin Cui, Mingda Zhang, Wei-Lun Chao, Li Zhang
We study the problem of how to train a "personalization-friendly" model such that given only the task descriptions, the model can be adapted to different end-users' needs, e. g., for accurately classifying different subsets of objects.
1 code implementation • 20 Oct 2022 • Digbalay Bose, Rajat Hebbar, Krishna Somandepalli, Haoyang Zhang, Yin Cui, Kree Cole-McLaughlin, Huisheng Wang, Shrikanth Narayanan
Longform media such as movies have complex narrative structures, with events spanning a rich variety of ambient visual scenes.
1 code implementation • 30 Sep 2022 • Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.
no code implementations • 15 Jul 2022 • Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition.
Ranked #1 on Zero-Shot Action Recognition on HMDB51
1 code implementation • ICLR 2022 • Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, Ting Liu
Instead, we define a \textit{surrogate gap}, a measure equivalent to the dominant eigenvalue of Hessian at a local minimum when the radius of the neighborhood (to derive the perturbed loss) is small.
1 code implementation • 22 Dec 2021 • Golnaz Ghiasi, Xiuye Gu, Yin Cui, Tsung-Yi Lin
We propose OpenSeg to address the above issue while still making use of scalable image-level supervision of captions.
no code implementations • 14 Dec 2021 • Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown
The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.
1 code implementation • CVPR 2022 • Liangzhe Yuan, Rui Qian, Yin Cui, Boqing Gong, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu
Modern self-supervised learning algorithms typically enforce persistency of instance representations across views.
no code implementations • 8 Dec 2021 • Rui Qian, Yeqing Li, Liangzhe Yuan, Boqing Gong, Ting Liu, Matthew Brown, Serge Belongie, Ming-Hsuan Yang, Hartwig Adam, Yin Cui
The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips.
5 code implementations • 3 Sep 2021 • Xianzhi Du, Yeqing Li, Yin Cui, Rui Qian, Jing Li, Irwan Bello
A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition.
Ranked #40 on Action Classification on Kinetics-600
no code implementations • 17 Aug 2021 • Chun-Han Yao, Boqing Gong, Yin Cui, Hang Qi, Yukun Zhu, Ming-Hsuan Yang
We further take the server-client and inter-client domain shifts into account and pose a domain adaptation problem with one source (centralized server data) and multiple targets (distributed client data).
2 code implementations • 25 Jun 2021 • Boyi Li, Yin Cui, Tsung-Yi Lin, Serge Belongie
In this paper, we propose and explore the problem of image translation for data augmentation.
no code implementations • 18 Jun 2021 • Marco Fornoni, Chaochao Yan, Liangchen Luo, Kimberly Wilber, Alex Stark, Yin Cui, Boqing Gong, Andrew Howard
When interacting with objects through cameras, or pictures, users often have a specific intent.
4 code implementations • ICLR 2022 • Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui
On COCO, ViLD outperforms the previous state-of-the-art by 4. 8 on novel AP and 11. 4 on overall AP.
Ranked #2 on Open Vocabulary Object Detection on Objects365
2 code implementations • NeurIPS 2021 • Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
Ranked #3 on Zero-Shot Video Retrieval on YouCook2 (text-to-video Mean Rank metric)
5 code implementations • CVPR 2021 • Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, Barret Zoph
Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3. 6 mask AP on rare categories.
Ranked #1 on Object Detection on PASCAL VOC 2007
no code implementations • ECCV 2020 • Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Yin Cui, Mingxing Tan, Quoc Le, Xiaodan Song
Furthermore, SpineNet is built with a uniform resource distribution over operations.
1 code implementation • 9 Sep 2020 • Yi Zhou, Yin Cui, Xiaoke Xu, Jidong Suo, Xiaoming Liu
It is challenging to detect small-floating object in the sea clutter for a surface radar.
4 code implementations • CVPR 2021 • Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, Yin Cui
Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away.
Ranked #1 on Self-Supervised Action Recognition on Kinetics-600
2 code implementations • NeurIPS 2020 • Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D. Cubuk, Quoc V. Le
For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data.
Ranked #1 on Semantic Segmentation on PASCAL VOC 2012 val
5 code implementations • ECCV 2020 • Menglin Jia, Mengyun Shi, Mikhail Sirotenko, Yin Cui, Claire Cardie, Bharath Hariharan, Hartwig Adam, Serge Belongie
In this work we explore the task of instance segmentation with attribute localization, which unifies instance segmentation (detect and segment each object instance) and fine-grained visual attribute categorization (recognize one or multiple attributes).
1 code implementation • 21 Dec 2019 • Yin Cui, Zeqi Gu, Dhruv Mahajan, Laurens van der Maaten, Serge Belongie, Ser-Nam Lim
We also investigate the interplay between dataset granularity with a variety of factors and find that fine-grained datasets are more difficult to learn from, more difficult to transfer to, more difficult to perform few-shot learning with, and more vulnerable to adversarial attacks.
13 code implementations • CVPR 2020 • Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song
We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search.
Ranked #9 on Image Classification on iNaturalist
1 code implementation • 13 Jun 2019 • Sheng Guo, Weilin Huang, Xiao Zhang, Prasanna Srikhanta, Yin Cui, Yuan Li, Matthew R. Scott, Hartwig Adam, Serge Belongie
The dataset is constructed from over one million fashion images with a label space that includes 8 groups of 228 fine-grained attributes in total.
8 code implementations • CVPR 2019 • Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang song, Serge Belongie
We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss.
Ranked #2 on Long-tail Learning on EGTEA
1 code implementation • ECCV 2018 • Guandao Yang, Yin Cui, Serge Belongie, Bharath Hariharan
It is expensive to label images with 3D structure or precise camera pose.
1 code implementation • CVPR 2018 • Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, Serge Belongie
To address these two challenges, we propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions.
1 code implementation • CVPR 2018 • Yin Cui, Yang song, Chen Sun, Andrew Howard, Serge Belongie
We propose a measure to estimate domain similarity via Earth Mover's Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure.
Ranked #29 on Fine-Grained Image Classification on CUB-200-2011
Fine-Grained Image Classification Fine-Grained Visual Categorization +1
19 code implementations • CVPR 2018 • Grant Van Horn, Oisin Mac Aodha, Yang song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie
Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories.
Ranked #8 on Image Classification on iNaturalist
no code implementations • CVPR 2017 • Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, Serge Belongie
We demonstrate how to approximate kernels such as Gaussian RBF up to a given order using compact explicit feature maps in a parameter-free manner.
2 code implementations • WWW 2017 • Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, Deborah Estrin
Metric learning algorithms produce distance metrics that capture the important relationships among data.
Ranked #1 on Recommendation Systems on MovieLens 20M (Recall@100 metric)
no code implementations • CVPR 2016 • Yin Cui, Feng Zhou, Yuanqing Lin, Serge Belongie
To demonstrate the effectiveness of the proposed framework, we bootstrap a fine-grained flower dataset with 620 categories from Instagram images.
no code implementations • CVPR 2015 • Tsung-Yi Lin, Yin Cui, Serge Belongie, James Hays
Most approaches predict the location of a query image by matching to ground-level images with known locations (e. g., street-view data).
no code implementations • 29 Mar 2014 • Yin Cui, Dong Liu, Jiawei Chen, Shih-Fu Chang
In this paper, we propose to build Concept Bank, the largest concept library consisting of 4, 876 concepts specifically designed to cover 631 real-world events.