no code implementations • 2 Apr 2024 • Kei Sawada, Tianyu Zhao, Makoto Shing, Kentaro Mitsui, Akio Kaga, Yukiya Hono, Toshiaki Wakatsuki, Koh Mitsuda
AI democratization aims to create a world in which the average person can utilize AI techniques.
no code implementations • 6 Dec 2023 • Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, Kei Sawada
Advances in machine learning have made it possible to perform various text and speech processing tasks, including automatic speech recognition (ASR), in an end-to-end (E2E) manner.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
no code implementations • 2 Oct 2023 • Kentaro Mitsui, Yukiya Hono, Kei Sawada
The advent of large language models (LLMs) has made it possible to generate natural written dialogues between two agents.
no code implementations • 1 Jun 2023 • Congda Ma, Tianyu Zhao, Makoto Shing, Kei Sawada, Manabu Okumura
In a controllable text generation dataset, there exist unannotated attributes that could provide irrelevant learning signals to models that use it for training and thus degrade their performance.
no code implementations • 28 Feb 2023 • Kentaro Mitsui, Yukiya Hono, Kei Sawada
The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech.
no code implementations • 14 Feb 2023 • AprilPyone MaungMaung, Makoto Shing, Kentaro Mitsui, Kei Sawada, Fumio Okura
To this end, we leverage knowledge from recent large-scale pre-trained generative models, resulting in text-guided sketch-to-photo synthesis without the need for reference images.
no code implementations • 24 Jun 2022 • Kentaro Mitsui, Tianyu Zhao, Kei Sawada, Yukiya Hono, Yoshihiko Nankaku, Keiichi Tokuda
A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS.
no code implementations • 28 Sep 2021 • Kentaro Mitsui, Kei Sawada
In this study, we propose a method to handle multiple sampling rates in a single NV, called the MSR-NV.
no code implementations • 17 Sep 2020 • Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda
This framework consists of a multi-grained variational autoencoder, a conditional prior, and a multi-level auto-regressive latent converter to obtain the different time-resolution latent variables and sample the finer-level latent variables from the coarser-level ones by taking into account the input text.
no code implementations • ICLR 2021 • Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang, Daxin Jiang
In this paper, we formalize the music-conditioned dance generation as a sequence-to-sequence learning problem and devise a novel seq2seq architecture to efficiently process long sequences of music features and capture the fine-grained correspondence between music and dance.
Ranked #1 on Motion Synthesis on BRACE