1 code implementation • 21 Nov 2021 • Keng Ji Chow, Samson Tan, Min-Yen Kan
Furthermore, existing V+L benchmarks often report global accuracy scores on the entire dataset, making it difficult to pinpoint the specific reasoning tasks that models fail and succeed at.