TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Cross-Modal Retrieval	RSICD	DOVE	Mean Recall	22.72%	# 6
Cross-Modal Retrieval	RSICD	DOVE	Image-to-text R@1	8.66%	# 6
Cross-Modal Retrieval	RSICD	DOVE	text-to-image R@1	6.04%	# 6
Cross-Modal Retrieval	RSITMD	DOVE	Mean Recall	37.73%	# 6
Cross-Modal Retrieval	RSITMD	DOVE	Image-to-text R@1	16.81%	# 6
Cross-Modal Retrieval	RSITMD	DOVE	text-to-imageR@1	12.20%	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/direction-oriented-visual-semantic-embedding/cross-modal-retrieval-on-rsicd)](https://paperswithcode.com/sota/cross-modal-retrieval-on-rsicd?p=direction-oriented-visual-semantic-embedding)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/direction-oriented-visual-semantic-embedding/cross-modal-retrieval-on-rsitmd)](https://paperswithcode.com/sota/cross-modal-retrieval-on-rsitmd?p=direction-oriented-visual-semantic-embedding)`

Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrieval

12 Oct 2023 · Qing Ma, Jiancheng Pan, Cong Bai ·

Image-text retrieval has developed rapidly in recent years. However, it is still a challenge in remote sensing due to visual-semantic imbalance, which leads to incorrect matching of non-semantic visual and textual features. To solve this problem, we propose a novel Direction-Oriented Visual-semantic Embedding Model (DOVE) to mine the relationship between vision and language. Our highlight is to conduct visual and textual representations in latent space, directing them as close as possible to a redundancy-free regional visual representation. Concretely, a Regional-Oriented Attention Module (ROAM) adaptively adjusts the distance between the final visual and textual embeddings in the latent semantic space, oriented by regional visual features. Meanwhile, a lightweight Digging Text Genome Assistant (DTGA) is designed to expand the range of tractable textual representation and enhance global word-level semantic connections using less attention operations. Ultimately, we exploit a global visual-semantic constraint to reduce single visual dependency and serve as an external constraint for the final visual and textual representations. The effectiveness and superiority of our method are verified by extensive experiments including parameter evaluation, quantitative comparison, ablation studies and visual analysis, on two benchmark datasets, RSICD and RSITMD.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Cross-Modal Retrieval

Retrieval

Text Retrieval

Datasets

RSICD RSITMD

Results from the Paper

Add Remove

Ranked #6 on Cross-Modal Retrieval on RSITMD

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Cross-Modal Retrieval	RSICD	DOVE	Mean Recall	22.72%	# 6	Compare
			Image-to-text R@1	8.66%	# 6	Compare
			text-to-image R@1	6.04%	# 6	Compare
Cross-Modal Retrieval	RSITMD	DOVE	Mean Recall	37.73%	# 6	Compare
			Image-to-text R@1	16.81%	# 6	Compare
			text-to-imageR@1	12.20%	# 5	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove