1 code implementation • 30 Apr 2024 • Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu, Chunhua Shen, Lianwen Jin, Xiang Bai
Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters.
2 code implementations • 12 Mar 2024 • Weijia Wu, Zhuang Li, YuChao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, Di Zhang
We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation.
1 code implementation • 25 Feb 2024 • Luoming Zhang, Yefei He, Wen Fei, Zhenyu Lou, Weijia Wu, YangWei Ying, Hong Zhou
Our framework outperforms previous methods by approximately 1\% for 8-bit PTQ and 2\% for 6-bit PTQ, showcasing its superior performance.
1 code implementation • 31 Jan 2024 • Yuzhong Zhao, Yue Liu, Zonghao Guo, Weijia Wu, Chen Gong, Fang Wan, Qixiang Ye
The multimodal model is constrained to generate captions within a few sub-spaces containing the control words, which increases the opportunity of hitting less frequent captions, alleviating the caption degeneration issue.
Ranked #1 on Dense Captioning on Visual Genome
1 code implementation • 29 Nov 2023 • Weijia Wu, Yuzhong Zhao, Zhuang Li, Lianlei Shan, Hong Zhou, Mike Zheng Shou
Image segmentation based on continual learning exhibits a critical drop of performance, mainly due to catastrophic forgetting and background shift, as they are required to incorporate new classes continually.
1 code implementation • 24 Nov 2023 • Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang
In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
1 code implementation • 12 Oct 2023 • Rui Zhao, YuChao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, Mike Zheng Shou
Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion.
no code implementations • 7 Oct 2023 • Luoming Zhang, Wen Fei, Weijia Wu, Yefei He, Zhenyu Lou, Hong Zhou
Fine-grained quantization has smaller quantization loss, consequently achieving superior performance.
1 code implementation • 5 Oct 2023 • Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang
In this paper, we introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency.
1 code implementation • NeurIPS 2023 • Weijia Wu, Yuzhong Zhao, Hao Chen, YuChao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen
To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic segmentation, instance segmentation, and depth estimation.
1 code implementation • ICCV 2023 • Yuzhong Zhao, Qixiang Ye, Weijia Wu, Chunhua Shen, Fang Wan
During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings.
Ranked #1 on Weakly-Supervised Object Localization on CUB-200-2011 (Top-1 Localization Accuracy metric, using extra training data)
2 code implementations • NeurIPS 2023 • YuChao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, Mike Zheng Shou
Public large-scale text-to-image diffusion models, such as Stable Diffusion, have gained significant attention from the community.
1 code implementation • 5 May 2023 • Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Hong Zhou, Mike Zheng Shou, Xiang Bai
Most existing cross-modal language-to-video retrieval (VR) research focuses on single-modal input from video, i. e., visual representation, while the text is omnipresent in human environments and frequently critical to understand video.
1 code implementation • 5 May 2023 • Yuzhong Zhao, Weijia Wu, Zhuang Li, Jiahong Li, Weiqiang Wang
This paper introduces a novel video text synthesis technique called FlowText, which utilizes optical flow estimation to synthesize a large amount of text video data at a low cost for training robust video text spotters.
no code implementations • 10 Apr 2023 • Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Mike Zheng Shou, Umapada Pal, Dimosthenis Karatzas, Xiang Bai
In this competition report, we establish a video text reading benchmark, DSText, which focuses on dense and small text reading challenges in the video with various scenarios.
1 code implementation • ICCV 2023 • Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, Chunhua Shen
In contrast, synthetic data can be freely available using a generative model (e. g., DALL-E, Stable Diffusion).
no code implementations • 10 Feb 2023 • Yu Li, Yi Zhang, Weijia Wu, Zimu Zhou, Qiang Li
Such personalized opening sentence generation is challenging because (i) there are limited historical samples for conversation topic recommendation in online insurance sales and (ii) existing text generation schemes often fail to support customized topic ordering based on user preferences.
no code implementations • ICCV 2023 • Yefei He, Zhenyu Lou, Luoming Zhang, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang
To solve this, we propose Softmax-aware Binarization, which dynamically adapts to the data distribution and reduces the error caused by binarization.
no code implementations • 14 Nov 2022 • Yefei He, Zhenyu Lou, Luoming Zhang, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang
To solve this, we propose Softmax-aware Binarization, which dynamically adapts to the data distribution and reduces the error caused by binarization.
no code implementations • 4 Jul 2022 • Yuzhong Zhao, Yuanqiang Cai, Weijia Wu, Weiqiang Wang
Generally pre-training and long-time training computation are necessary for obtaining a good-performance text detector based on deep networks.
no code implementations • 16 May 2022 • Yefei He, Luoming Zhang, Weijia Wu, Hong Zhou
Extensive experiments demonstrate that the proposed method yields surprising performance both in image classification and human pose estimation tasks.
Ranked #1 on Binarization on ImageNet (Top 1 Accuracy metric)
no code implementations • 8 Apr 2022 • Yefei He, Luoming Zhang, Weijia Wu, Hong Zhou
In this paper, we present a simple yet effective data-free quantization method with accurate activation clipping and adaptive batch normalization.
1 code implementation • 20 Mar 2022 • Weijia Wu, Yuanqiang Cai, Chunhua Shen, Debing Zhang, Ying Fu, Hong Zhou, Ping Luo
Recent video text spotting methods usually require the three-staged pipeline, i. e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results.
1 code implementation • 30 Dec 2021 • Zhuang Li, Weijia Wu, Mike Zheng Shou, Jiahong Li, Size Li, Zhongyuan Wang, Hong Zhou
Semantic representation is of great benefit to the video text tracking(VTT) task that requires simultaneously classifying, detecting, and tracking texts in the video.
3 code implementations • 9 Dec 2021 • Weijia Wu, Yuanqiang Cai, Debing Zhang, Sibo Wang, Zhuang Li, Jiahong Li, Yejun Tang, Hong Zhou
Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data.
no code implementations • 10 Sep 2021 • Jue Wang, Haofan Wang, Jincan Deng, Weijia Wu, Debing Zhang
Extra rich non-paired single-modal text data is used for boosting the generalization of text branch.
1 code implementation • 26 Nov 2020 • Weijia Wu, Enze Xie, Ruimao Zhang, Wenhai Wang, Hong Zhou, Ping Luo
For example, without using polygon annotations, PSENet achieves an 80. 5% F-score on TotalText [3] (vs. 80. 9% of fully supervised counterpart), 31. 1% better than training directly with upright bounding box annotations, and saves 80%+ labeling costs.
1 code implementation • 3 Sep 2020 • Weijia Wu, Ning Lu, Enze Xie
To address the severe domain distribution mismatch, we propose a synthetic-to-real domain adaptation method for scene text detection, which transfers knowledge from synthetic data (source domain) to real data (target domain).
no code implementations • 18 Nov 2019 • Qiang Huang, Jianhui Bu, Weijian Xie, Shengwen Yang, Weijia Wu, Li-Ping Liu
Sentence matching is an essential task in the QA systems and is usually reformulated as a Paraphrase Identification (PI) problem.
Ranked #13 on Paraphrase Identification on Quora Question Pairs (Accuracy metric)
no code implementations • 22 Apr 2019 • Weijia Wu, Jici Xing, Hong Zhou
In this paper, we propose a pixel-wise method named TextCohesion for scene text detection, which splits a text instance into five key components: a Text Skeleton and four Directional Pixel Regions.
Ranked #1 on Curved Text Detection on SCUT-CTW1500