no code implementations • 28 Nov 2023 • Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei
The diffusion model has been proven a powerful generative model in recent years, yet remains a challenge in generating visual text.
no code implementations • 20 Sep 2023 • Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, Furu Wei
We present Kosmos-2. 5, a multimodal literate model for machine reading of text-intensive images.
1 code implementation • 31 Aug 2023 • Yupan Huang, Zaiqiao Meng, Fangyu Liu, Yixuan Su, Nigel Collier, Yutong Lu
Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns.
no code implementations • NeurIPS 2023 • Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei
Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text.
2 code implementations • 18 Apr 2022 • Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei
In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking.
Ranked #1 on Key Information Extraction on EPHOIE
1 code implementation • 19 Oct 2021 • Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu
In this work, we demonstrate such an AI creation system to produce both diverse captions and rich images.
1 code implementation • 19 Oct 2021 • Yupan Huang, Hongwei Xue, Bei Liu, Yutong Lu
We adopt Transformer as our unified architecture for its strong performance and task-agnostic design.
no code implementations • NeurIPS 2021 • Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo
To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment.
no code implementations • NeurIPS 2021 • Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo
To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment.
3 code implementations • CVPR 2021 • Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu
As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.
Ranked #5 on Visual Entailment on SNLI-VE val
no code implementations • 24 Apr 2020 • Xingbo Liu, Xiushan Nie, Qi Dai, Yupan Huang, Yilong Yin
Due to the compelling efficiency in retrieval and storage, similarity-preserving hashing has been widely applied to approximate nearest neighbor search in large-scale image retrieval.
1 code implementation • 16 Apr 2019 • Yupan Huang, Qi Dai, Yutong Lu
Each branch produces a set of action anchor layers by applying deconvolution to the feature maps of the main stream.
Ranked #26 on Temporal Action Localization on THUMOS’14