TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Fine-Grained Image Classification	CUB-200-2011	MP-MGVC	Accuracy	91.8%	# 8
Fine-Grained Image Classification	NABirds	MP-FGVC	Accuracy	91.0%	# 6
Fine-Grained Image Classification	Stanford Dogs	MP-FGVC	Accuracy	91.0%	# 12

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/delving-into-multimodal-prompting-for-fine/fine-grained-image-classification-on-nabirds)](https://paperswithcode.com/sota/fine-grained-image-classification-on-nabirds?p=delving-into-multimodal-prompting-for-fine)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/delving-into-multimodal-prompting-for-fine/fine-grained-image-classification-on-cub-200)](https://paperswithcode.com/sota/fine-grained-image-classification-on-cub-200?p=delving-into-multimodal-prompting-for-fine)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/delving-into-multimodal-prompting-for-fine/fine-grained-image-classification-on-stanford-1)](https://paperswithcode.com/sota/fine-grained-image-classification-on-stanford-1?p=delving-into-multimodal-prompting-for-fine)`

Delving into Multimodal Prompting for Fine-grained Visual Classification

16 Sep 2023 · Xin Jiang, Hao Tang, Junyao Gao, Xiaoyu Du, Shengfeng He, Zechao Li ·

Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Classification

Fine-Grained Image Classification

Datasets

CUB-200-2011

NABirds

Stanford Dogs

Results from the Paper

Edit

Ranked #6 on Fine-Grained Image Classification on NABirds

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Fine-Grained Image Classification	CUB-200-2011	MP-MGVC	Accuracy	91.8%	# 8	Compare
Fine-Grained Image Classification	NABirds	MP-FGVC	Accuracy	91.0%	# 6	Compare
Fine-Grained Image Classification	Stanford Dogs	MP-FGVC	Accuracy	91.0%	# 12	Compare

Methods

Add Remove

CLIP • Focus

Edit Social Preview

Delving into Multimodal Prompting for Fine-grained Visual Classification

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove