Search Results for author: Yin Cui

Found 40 papers, 25 papers with code

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

no code implementations • 30 Apr 2024 • Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui

VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results.

Caption Generation Hallucination +7

Paper
Add Code

VideoGLUE: Video General Understanding Evaluation of Foundation Models

1 code implementation • 6 Jul 2023 • Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong

We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task.

Action Recognition Temporal Localization +1

76,633

Paper
Code

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

no code implementations • NeurIPS 2023 • Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, Hartwig Adam

We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model.

Ranked #1 on Zero-Shot Action Recognition on Kinetics (using extra training data)

Classification Image Classification +7

Paper
Add Code

Towards Understanding the Effect of Pretraining Label Granularity

no code implementations • 29 Mar 2023 • Guan Zhe Hong, Yin Cui, Ariel Fuxman, Stanley H. Chan, Enming Luo

Furthermore, we perform comprehensive experiments using the label hierarchies of iNaturalist 2021 and observe that the following conditions, in addition to proper choice of label granularity, enable the transfer to work well in practice: 1) the pretraining dataset needs to have a meaningful label hierarchy, and 2) the pretraining and target label functions need to align well.

Image Classification Transfer Learning

Paper
Add Code

Unified Visual Relationship Detection with Vision and Language Models

1 code implementation • ICCV 2023 • Long Zhao, Liangzhe Yuan, Boqing Gong, Yin Cui, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu

To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs).

Human-Object Interaction Detection Relationship Detection +2

3,029

Paper
Code

A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models

no code implementations • 13 Feb 2023 • James Urquhart Allingham, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, Balaji Lakshminarayanan

In particular, we ask "Given a large pool of prompts, can we automatically score the prompts and ensemble those that are most suitable for a particular downstream dataset, without needing access to labeled validation data?".

Prompt Engineering Zero-Shot Learning

Paper
Add Code

Train-Once-for-All Personalization

no code implementations • CVPR 2023 • Hong-You Chen, Yandong Li, Yin Cui, Mingda Zhang, Wei-Lun Chao, Li Zhang

We study the problem of how to train a "personalization-friendly" model such that given only the task descriptions, the model can be adapted to different end-users' needs, e. g., for accurately classifying different subsets of objects.

Paper
Add Code

MovieCLIP: Visual Scene Recognition in Movies

1 code implementation • 20 Oct 2022 • Digbalay Bose, Rajat Hebbar, Krishna Somandepalli, Haoyang Zhang, Yin Cui, Kree Cole-McLaughlin, Huisheng Wang, Shrikanth Narayanan

Longform media such as movies have complex narrative structures, with events spanning a rich variety of ambient visual scenes.

Genre classification Scene Recognition

Paper
Code

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

1 code implementation • 30 Sep 2022 • Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.

Knowledge Distillation object-detection +1

32,938

Paper
Code

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

no code implementations • 15 Jul 2022 • Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition.

Ranked #1 on Zero-Shot Action Recognition on HMDB51

Optical Flow Estimation Video Classification +1

Paper
Add Code

Surrogate Gap Minimization Improves Sharpness-Aware Training

1 code implementation • ICLR 2022 • Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, Ting Liu

Instead, we define a \textit{surrogate gap}, a measure equivalent to the dominant eigenvalue of Hessian at a local minimum when the radius of the neighborhood (to derive the perturbed loss) is small.

9,359

Paper
Code

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels

1 code implementation • 22 Dec 2021 • Golnaz Ghiasi, Xiuye Gu, Yin Cui, Tsung-Yi Lin

We propose OpenSeg to address the above issue while still making use of scalable image-level supervision of captions.

Image Segmentation Segmentation +1

5,183

Paper
Code

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

no code implementations • 14 Dec 2021 • Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown

The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.

Image Classification Knowledge Distillation +1

Paper
Add Code

Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

1 code implementation • CVPR 2022 • Liangzhe Yuan, Rui Qian, Yin Cui, Boqing Gong, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu

Modern self-supervised learning algorithms typically enforce persistency of instance representations across views.

Action Recognition Contrastive Learning +4

76,633

Paper
Code

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

no code implementations • 8 Dec 2021 • Rui Qian, Yeqing Li, Liangzhe Yuan, Boqing Gong, Ting Liu, Matthew Brown, Serge Belongie, Ming-Hsuan Yang, Hartwig Adam, Yin Cui

The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips.

Representation Learning Self-Supervised Learning

Paper
Add Code

Revisiting 3D ResNets for Video Recognition

5 code implementations • 3 Sep 2021 • Xianzhi Du, Yeqing Li, Yin Cui, Rui Qian, Jing Li, Irwan Bello

A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition.

Ranked #40 on Action Classification on Kinetics-600

Action Classification Contrastive Learning +1

76,633

Paper
Code

Federated Multi-Target Domain Adaptation

no code implementations • 17 Aug 2021 • Chun-Han Yao, Boqing Gong, Yin Cui, Hang Qi, Yukun Zhu, Ming-Hsuan Yang

We further take the server-client and inter-client domain shifts into account and pose a domain adaptation problem with one source (centralized server data) and multiple targets (distributed client data).

Domain Adaptation Federated Learning +3

Paper
Add Code

SITTA: Single Image Texture Translation for Data Augmentation

2 code implementations • 25 Jun 2021 • Boyi Li, Yin Cui, Tsung-Yi Lin, Serge Belongie

In this paper, we propose and explore the problem of image translation for data augmentation.

Data Augmentation Few-Shot Image Classification +2

Paper
Code

Bridging the Gap Between Object Detection and User Intent via Query-Modulation

no code implementations • 18 Jun 2021 • Marco Fornoni, Chaochao Yan, Liangchen Luo, Kimberly Wilber, Alex Stark, Yin Cui, Boqing Gong, Andrew Howard

When interacting with objects through cameras, or pictures, users often have a specific intent.

Object object-detection +2

Paper
Add Code

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

4 code implementations • ICLR 2022 • Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui

On COCO, ViLD outperforms the previous state-of-the-art by 4. 8 on novel AP and 11. 4 on overall AP.

Ranked #2 on Open Vocabulary Object Detection on Objects365

Image Classification Knowledge Distillation +4

5,183

Paper
Code

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

2 code implementations • NeurIPS 2021 • Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.

Ranked #3 on Zero-Shot Video Retrieval on YouCook2 (text-to-video Mean Rank metric)

Action Classification Action Recognition In Videos +9

32,942

Paper
Code

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

5 code implementations • CVPR 2021 • Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, Barret Zoph

Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3. 6 mask AP on rare categories.

Ranked #1 on Object Detection on PASCAL VOC 2007

Image Augmentation Instance Segmentation +3

38,826

Paper
Code

Efficient Scale-Permuted Backbone with Learned Resource Distribution

no code implementations • ECCV 2020 • Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Yin Cui, Mingxing Tan, Quoc Le, Xiaodan Song

Furthermore, SpineNet is built with a uniform resource distribution over operations.

General Classification Image Classification +3

Paper
Add Code

Small-floating Target Detection in Sea Clutter via Visual Feature Classifying in the Time-Doppler Spectra

1 code implementation • 9 Sep 2020 • Yi Zhou, Yin Cui, Xiaoke Xu, Jidong Suo, Xiaoming Liu

It is challenging to detect small-floating object in the sea clutter for a surface radar.

Paper
Code

Spatiotemporal Contrastive Video Representation Learning

4 code implementations • CVPR 2021 • Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, Yin Cui

Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away.

Ranked #1 on Self-Supervised Action Recognition on Kinetics-600

Contrastive Learning Data Augmentation +4

76,632

Paper
Code

Rethinking Pre-training and Self-training

2 code implementations • NeurIPS 2020 • Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D. Cubuk, Quoc V. Le

For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data.

Ranked #1 on Semantic Segmentation on PASCAL VOC 2012 val

Data Augmentation Object +4

5,182

Paper
Code

Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset

5 code implementations • ECCV 2020 • Menglin Jia, Mengyun Shi, Mikhail Sirotenko, Yin Cui, Claire Cardie, Bharath Hariharan, Hartwig Adam, Serge Belongie

In this work we explore the task of instance segmentation with attribute localization, which unifies instance segmentation (detect and segment each object instance) and fine-grained visual attribute categorization (recognize one or multiple attributes).

Attribute Fine-Grained Visual Categorization +5

5,182

Paper
Code

Measuring Dataset Granularity

1 code implementation • 21 Dec 2019 • Yin Cui, Zeqi Gu, Dhruv Mahajan, Laurens van der Maaten, Serge Belongie, Ser-Nam Lim

We also investigate the interplay between dataset granularity with a variety of factors and find that fine-grained datasets are more difficult to learn from, more difficult to transfer to, more difficult to perform few-shot learning with, and more vulnerable to adversarial attacks.

Clustering Few-Shot Learning

Paper
Code

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

13 code implementations • CVPR 2020 • Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song

We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search.

Ranked #9 on Image Classification on iNaturalist

Decoder General Classification +6

73,120

Paper
Code

The iMaterialist Fashion Attribute Dataset

1 code implementation • 13 Jun 2019 • Sheng Guo, Weilin Huang, Xiao Zhang, Prasanna Srikhanta, Yin Cui, Yuan Li, Matthew R. Scott, Hartwig Adam, Serge Belongie

The dataset is constructed from over one million fashion images with a label space that includes 8 groups of 228 fine-grained attributes in total.

Attribute General Classification +2

Paper
Code

Class-Balanced Loss Based on Effective Number of Samples

8 code implementations • CVPR 2019 • Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang song, Serge Belongie

We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss.

Ranked #2 on Long-tail Learning on EGTEA

Image Classification Long-tail Learning

770

Paper
Code

Learning Single-View 3D Reconstruction with Limited Pose Supervision

1 code implementation • ECCV 2018 • Guandao Yang, Yin Cui, Serge Belongie, Bharath Hariharan

It is expensive to label images with 3D structure or precise camera pose.

3D Reconstruction Single-View 3D Reconstruction +1

Paper
Code

Learning to Evaluate Image Captioning

1 code implementation • CVPR 2018 • Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, Serge Belongie

To address these two challenges, we propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions.

8k Data Augmentation +2

Paper
Code

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning

1 code implementation • CVPR 2018 • Yin Cui, Yang song, Chen Sun, Andrew Howard, Serge Belongie

We propose a measure to estimate domain similarity via Earth Mover's Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure.

Ranked #29 on Fine-Grained Image Classification on CUB-200-2011

Fine-Grained Image Classification Fine-Grained Visual Categorization +1

197

Paper
Code

The iNaturalist Species Classification and Detection Dataset

19 code implementations • CVPR 2018 • Grant Van Horn, Oisin Mac Aodha, Yang song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie

Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories.

Ranked #8 on Image Classification on iNaturalist

General Classification Image Classification

76,633

Paper
Code

Kernel Pooling for Convolutional Neural Networks

no code implementations • CVPR 2017 • Yin Cui, Feng Zhou, Jiang Wang, Xiao Liu, Yuanqing Lin, Serge Belongie

We demonstrate how to approximate kernels such as Gaussian RBF up to a given order using compact explicit feature maps in a parameter-free manner.

Face Recognition Fine-Grained Visual Categorization +2

Paper
Add Code

Collaborative Metric Learning

2 code implementations • WWW 2017 • Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, Deborah Estrin

Metric learning algorithms produce distance metrics that capture the important relationships among data.

Ranked #1 on Recommendation Systems on MovieLens 20M (Recall@100 metric)

Collaborative Filtering Metric Learning +1

157

Paper
Code

Fine-grained Categorization and Dataset Bootstrapping using Deep Metric Learning with Humans in the Loop

no code implementations • CVPR 2016 • Yin Cui, Feng Zhou, Yuanqing Lin, Serge Belongie

To demonstrate the effectiveness of the proposed framework, we bootstrap a fine-grained flower dataset with 620 categories from Instagram images.

Fine-Grained Visual Categorization Metric Learning

Paper
Add Code

Learning Deep Representations for Ground-to-Aerial Geolocalization

no code implementations • CVPR 2015 • Tsung-Yi Lin, Yin Cui, Serge Belongie, James Hays

Most approaches predict the location of a query image by matching to ground-level images with known locations (e. g., street-view data).

Face Verification

Paper
Add Code

Building A Large Concept Bank for Representing Events in Video

no code implementations • 29 Mar 2014 • Yin Cui, Dong Liu, Jiawei Chen, Shih-Fu Chang

In this paper, we propose to build Concept Bank, the largest concept library consisting of 4, 876 concepts specifically designed to cover 631 real-world events.

Event Detection Retrieval

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.