End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit

system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1
ubuntu18/python3.8/pip Github Actions
ubuntu18/python3.7/pip Github Actions Github Actions Github Actions Github Actions Github Actions Github Actions Github Actions Github Actions Github Actions
debian9/python3.6/conda debian9
centos7/python3.6/conda centos7
[docs/coverage] python3.8 Build Status

PyPI version Python Versions Downloads GitHub license codecov Code style: black Mergify Status Gitter

Docs | Example | Example (ESPnet2) | Docker | Notebook | Tutorial (2019)

ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech. ESPnet uses chainer and pytorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

Key Features

Kaldi style complete recipe

  • Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, etc.)
  • Support numbers of TTS recipes with a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
  • Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
  • Support numbers of MT recipes (IWSLT'16, the above ST recipes etc.)
  • Support speech separation and recognition recipe (WSJ-2mix)
  • Support voice conversion recipe (VCC2020 baseline) (new!)

ASR: Automatic Speech Recognition

  • State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
  • Hybrid CTC/attention based end-to-end ASR
    • Fast/accurate training with CTC/attention multitask training
    • CTC/attention joint decoding to boost monotonic alignment decoding
    • Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU) or Transformer
  • Attention: Dot product, location-aware attention, variants of multihead
  • Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
  • Batch GPU decoding
  • Transducer based end-to-end ASR
    • Available: RNN-based encoder/decoder or custom encoder/decoder w/ supports for Transformer, Conformer, TDNN (encoder) and causal conv1d (decoder) blocks.
    • Also support: mixed RNN/Custom encoder-decoder, VGG2L (RNN/Cutom encoder) and various decoding algorithms.

    Please refer to the tutorial page for complete documentation.

  • CTC segmentation
  • Non-autoregressive model based on Mask-CTC
  • ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
  • Wav2Vec2.0 pretrained model as Encoder, imported from FairSeq.

Demonstration

  • Real-time ASR demo with ESPnet2 Open In Colab

TTS: Text-to-speech

  • Tacotron2
  • Transformer-TTS
  • FastSpeech
  • FastSpeech2 (in ESPnet2)
  • Conformer-based FastSpeech & FastSpeech2 (in ESPnet2)
  • Multi-speaker model with pretrained speaker embedding
  • Multi-speaker model with GST (in ESPnet2)
  • Phoneme-based training (En, Jp, and Zn)
  • Integration with neural vocoders (WaveNet, ParallelWaveGAN, and MelGAN)

Demonstration

  • Real-time TTS demo with ESPnet2 Open In Colab
  • Real-time TTS demo with ESPnet1 Open In Colab

To train the neural vocoder, please check the following repositories:

NOTE:

  • We are moving on ESPnet2-based development for TTS.
  • If you are beginner, we recommend using ESPnet2-TTS.

ST: Speech Translation & MT: Machine Translation

  • State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
  • Transformer based end-to-end ST (new!)
  • Transformer based end-to-end MT (new!)

VC: Voice conversion

  • Transformer and Tacotron2 based parallel VC using melspectrogram (new!)
  • End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)

DNN Framework

  • Flexible network architecture thanks to chainer and pytorch
  • Flexible front-end processing thanks to kaldiio and HDF5 support
  • Tensorboard based monitoring

ESPnet2

See ESPnet2.

  • Indepedent from Kaldi/Chainer, unlike ESPnet1
  • On the fly feature extraction and text processing when training
  • Supporting DistributedDataParallel and DaraParallel both
  • Supporting multiple nodes training and integrated with Slurm or MPI
  • Supporting Sharded Training provided by fairscale
  • A template recipe which can be applied for all corpora
  • Possible to train any size of corpus without cpu memory error
  • ESPnet Model Zoo
  • Integrated with wandb

Installation

  • If you intend to do full experiments including DNN training, then see Installation.

  • If you just need the Python module only:

    pip install espnet
    # To install latest
    # pip install git+https://github.com/espnet/espnet

    You need to install some packages.

    pip install torch
    pip install chainer==6.0.0 cupy==6.0.0    # [Option] If you'll use ESPnet1
    pip install torchaudio                    # [Option] If you'll use enhancement task
    pip install torch_optimizer               # [Option] If you'll use additional optimizers in ESPnet2

    There are some required packages depending on each task other than above. If you meet ImportError, please intall them at that time.

Usage

See Usage.

Docker Container

go to docker/ and follow instructions.

Contribution

Thank you for taking times for ESPnet! Any contributions to ESPNet are welcome and feel free to ask any questions or requests to issues. If it's the first contribution to ESPnet for you, please follow the contribution guide.

Results and demo

You can find useful tutorials and demos in Interspeech 2019 Tutorial

ASR results

expand

We list the character error rate (CER) and word error rate (WER) of major ASR tasks.

Task CER (%) WER (%) Pretrained model
Aishell dev/test 4.6/5.1 N/A link
ESPnet2 Aishell dev/test 4.4/4.7 N/A link
Common Voice dev/test 1.7/1.8 2.2/2.3 link
CSJ eval1/eval2/eval3 5.7/3.8/4.2 N/A link
ESPnet2 CSJ eval1/eval2/eval3 4.5/3.3/3.6 N/A link
HKUST dev 23.5 N/A link
ESPnet2 HKUST dev 21.2 N/A link
Librispeech dev_clean/dev_other/test_clean/test_other N/A 1.9/4.9/2.1/4.9 link
ESPnet2 Librispeech dev_clean/dev_other/test_clean/test_other 0.7/2.2/0.7/2.1 1.9/4.6/2.1/4.7 link
Switchboard (eval2000) callhm/swbd N/A 14.0/6.8 link
TEDLIUM2 dev/test N/A 8.6/7.2 link
TEDLIUM3 dev/test N/A 9.6/7.6 link
WSJ dev93/eval92 3.2/2.1 7.0/4.7 N/A
ESPnet2 WSJ dev93/eval92 2.7/1.8 6.6/4.6 link

Note that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by RWTH.

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/asr1/RESULTS.md.

ASR demo

expand

You can recognize speech in a WAV file using pretrained models. Go to a recipe directory and run utils/recog_wav.sh as follows:

# go to recipe directory and source path of espnet tools
cd egs/tedlium2/asr1 && . ./path.sh
# let's recognize speech!
recog_wav.sh --models tedlium2.transformer.v1 example.wav

where example.wav is a WAV file to be recognized. The sampling rate must be consistent with that of data used in training.

Available pretrained models in the demo script are listed as below.

Model Notes
tedlium2.rnn.v1 Streaming decoding based on CTC-based VAD
tedlium2.rnn.v2 Streaming decoding based on CTC-based VAD (batch decoding)
tedlium2.transformer.v1 Joint-CTC attention Transformer trained on Tedlium 2
tedlium3.transformer.v1 Joint-CTC attention Transformer trained on Tedlium 3
librispeech.transformer.v1 Joint-CTC attention Transformer trained on Librispeech
commonvoice.transformer.v1 Joint-CTC attention Transformer trained on CommonVoice
csj.transformer.v1 Joint-CTC attention Transformer trained on CSJ
csj.rnn.v1 Joint-CTC attention VGGBLSTM trained on CSJ

ST results

expand

We list 4-gram BLEU of major ST tasks.

end-to-end system

Task BLEU Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En) 51.03 link
Fisher-CallHome Spanish callhome_evltest (Es->En) 20.44 link
Libri-trans test (En->Fr) 16.70 link
How2 dev5 (En->Pt) 45.68 link
Must-C tst-COMMON (En->De) 22.91 link
Mboshi-French dev (Fr->Mboshi) 6.18 N/A

cascaded system

Task BLEU Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En) 42.16 N/A
Fisher-CallHome Spanish callhome_evltest (Es->En) 19.82 N/A
Libri-trans test (En->Fr) 16.96 N/A
How2 dev5 (En->Pt) 44.90 N/A
Must-C tst-COMMON (En->De) 23.65 N/A

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/st1/RESULTS.md.

ST demo

expand

(New!) We made a new real-time E2E-ST + TTS demonstration in Google Colab. Please access the notebook from the following button and enjoy the real-time speech-to-speech translation!

Open In Colab


You can translate speech in a WAV file using pretrained models. Go to a recipe directory and run utils/translate_wav.sh as follows:

# go to recipe directory and source path of espnet tools
cd egs/fisher_callhome_spanish/st1 && . ./path.sh
# download example wav file
wget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf -
# let's translate speech!
translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en test.wav

where test.wav is a WAV file to be translated. The sampling rate must be consistent with that of data used in training.

Available pretrained models in the demo script are listed as below.

Model Notes
fisher_callhome_spanish.transformer.v1 Transformer-ST trained on Fisher-CallHome Spanish Es->En

MT results

expand
Task BLEU Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En) 61.45 link
Fisher-CallHome Spanish callhome_evltest (Es->En) 29.86 link
Libri-trans test (En->Fr) 18.09 link
How2 dev5 (En->Pt) 58.61 link
Must-C tst-COMMON (En->De) 27.63 link
IWSLT'14 test2014 (En->De) 24.70 link
IWSLT'14 test2014 (De->En) 29.22 link
IWSLT'16 test2014 (En->De) 24.05 link
IWSLT'16 test2014 (De->En) 29.13 link

TTS results

ESPnet2

You can listen to the generated samples in the following url.

Note that in the generation we use Griffin-Lim (wav/) and Parallel WaveGAN (wav_pwg/).

You can download pretrained models via espnet_model_zoo.

You can download pretrained vocoders via kan-bayashi/ParallelWaveGAN.

ESPnet1

NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest results in the above ESPnet2 results.

You can listen to our samples in demo HP espnet-tts-sample. Here we list some notable ones:

You can download all of the pretrained models and generated samples:

Note that in the generated samples we use the following vocoders: Griffin-Lim (GL), WaveNet vocoder (WaveNet), Parallel WaveGAN (ParallelWaveGAN), and MelGAN (MelGAN). The neural vocoders are based on following repositories.

If you want to build your own neural vocoder, please check the above repositories. kan-bayashi/ParallelWaveGAN provides the manual about how to decode ESPnet-TTS model's features with neural vocoders. Please check it.

Here we list all of the pretrained neural vocoders. Please download and enjoy the generation of high quality speech!

Model link Lang Fs [Hz] Mel range [Hz] FFT / Shift / Win [pt] Model type
ljspeech.wavenet.softmax.ns.v1 EN 22.05k None 1024 / 256 / None Softmax WaveNet
ljspeech.wavenet.mol.v1 EN 22.05k None 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v1 EN 22.05k None 1024 / 256 / None Parallel WaveGAN
ljspeech.wavenet.mol.v2 EN 22.05k 80-7600 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v2 EN 22.05k 80-7600 1024 / 256 / None Parallel WaveGAN
ljspeech.melgan.v1 EN 22.05k 80-7600 1024 / 256 / None MelGAN
ljspeech.melgan.v3 EN 22.05k 80-7600 1024 / 256 / None MelGAN
libritts.wavenet.mol.v1 EN 24k None 1024 / 256 / None MoL WaveNet
jsut.wavenet.mol.v1 JP 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
jsut.parallel_wavegan.v1 JP 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN
csmsc.wavenet.mol.v1 ZH 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
csmsc.parallel_wavegan.v1 ZH 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN

If you want to use the above pretrained vocoders, please exactly match the feature setting with them.

TTS demo

ESPnet2

You can try the real-time demo in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis!

  • Real-time TTS demo with ESPnet2 Open In Colab

English, Japanese, and Mandarin models are available in the demo.

ESPnet1

NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest demo in the above ESPnet2 demo.

You can try the real-time demo in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis.

  • Real-time TTS demo with ESPnet1 Open In Colab

We also provide shell script to perform synthesize. Go to a recipe directory and run utils/synth_wav.sh as follows:

# go to recipe directory and source path of espnet tools
cd egs/ljspeech/tts1 && . ./path.sh
# we use upper-case char sequence for the default model.
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt
# let's synthesize speech!
synth_wav.sh example.txt

# also you can use multiple sentences
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example_multi.txt
echo "TEXT TO SPEECH IS A TECHQNIQUE TO CONVERT TEXT INTO SPEECH." >> example_multi.txt
synth_wav.sh example_multi.txt

You can change the pretrained model as follows:

synth_wav.sh --models ljspeech.fastspeech.v1 example.txt

Waveform synthesis is performed with Griffin-Lim algorithm and neural vocoders (WaveNet and ParallelWaveGAN). You can change the pretrained vocoder model as follows:

synth_wav.sh --vocoder_models ljspeech.wavenet.mol.v1 example.txt

WaveNet vocoder provides very high quality speech but it takes time to generate.

See more details or available models via --help.

synth_wav.sh --help

VC results

expand
  • Transformer and Tacotron2 based VC

You can listen to some samples on the demo webpage.

  • Cascade ASR+TTS as one of the baseline systems of VCC2020

The Voice Conversion Challenge 2020 (VCC2020) adopts ESPnet to build an end-to-end based baseline system. In VCC2020, the objective is intra/cross lingual nonparallel VC. You can download converted samples of the cascade ASR+TTS baseline system here.

CTC Segmentation demo

expand

CTC segmentation determines utterance segments within audio files. Aligned utterance segments constitute the labels of speech datasets.

As demo, we align start and end of utterances within the audio file ctc_align_test.wav, using the example script utils/ctc_align_wav.sh. For preparation, set up a data directory:

cd egs/tedlium2/align1/
# data directory
align_dir=data/demo
mkdir -p ${align_dir}
# wav file
base=ctc_align_test
wav=../../../test_utils/${base}.wav
# recipe files
echo "batchsize: 0" > ${align_dir}/align.yaml

cat << EOF > ${align_dir}/utt_text
${base} THE SALE OF THE HOTELS
${base} IS PART OF HOLIDAY'S STRATEGY
${base} TO SELL OFF ASSETS
${base} AND CONCENTRATE
${base} ON PROPERTY MANAGEMENT
EOF

Here, utt_text is the file containing the list of utterances. Choose a pre-trained ASR model that includes a CTC layer to find utterance segments:

# pre-trained ASR model
model=wsj.transformer_small.v1
mkdir ./conf && cp ../../wsj/asr1/conf/no_preprocess.yaml ./conf

../../../utils/asr_align_wav.sh \
    --models ${model} \
    --align_dir ${align_dir} \
    --align_config ${align_dir}/align.yaml \
    ${wav} ${align_dir}/utt_text

Segments are written to aligned_segments as a list of file/utterance name, utterance start and end times in seconds and a confidence score. The confidence score is a probability in log space that indicates how good the utterance was aligned. If needed, remove bad utterances:

min_confidence_score=-5
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${align_dir}/aligned_segments

The demo script utils/ctc_align_wav.sh uses an already pretrained ASR model (see list above for more models). It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files; rather than using Transformer models that have a high memory consumption on longer audio data. The sample rate of the audio must be consistent with that of the data used in training; adjust with sox if needed. A full example recipe is in egs/tedlium2/align1/.

References

[1] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, "ESPnet: End-to-End Speech Processing Toolkit," Proc. Interspeech'18, pp. 2207-2211 (2018)

[2] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," Proc. ICASSP'17, pp. 4835--4839 (2017)

[3] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey and Tomoki Hayashi, "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, Dec. 2017

Citations

@inproceedings{watanabe2018espnet,
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
  year={2018},
  booktitle={Proceedings of Interspeech},
  pages={2207--2211},
  doi={10.21437/Interspeech.2018-1456},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
@inproceedings{hayashi2020espnet,
  title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
  booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7654--7658},
  year={2020},
  organization={IEEE}
}
@inproceedings{inaguma-etal-2020-espnet,
    title = "{ESP}net-{ST}: All-in-One Speech Translation Toolkit",
    author = "Inaguma, Hirofumi  and
      Kiyono, Shun  and
      Duh, Kevin  and
      Karita, Shigeki  and
      Yalta, Nelson  and
      Hayashi, Tomoki  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.34",
    pages = "302--311",
}
@inproceedings{li2020espnet,
  title={{ESPnet-SE}: End-to-End Speech Enhancement and Separation Toolkit Designed for {ASR} Integration},
  author={Chenda Li and Jing Shi and Wangyou Zhang and Aswin Shanmugam Subramanian and Xuankai Chang and Naoyuki Kamo and Moto Hira and Tomoki Hayashi and Christoph Boeddeker and Zhuo Chen and Shinji Watanabe},
  booktitle={Proceedings of IEEE Spoken Language Technology Workshop (SLT)},
  pages={785--792},
  year={2021},
  organization={IEEE},
}
Owner
ESPnet
end-to-end speech processing toolkit
ESPnet
Comments
  • [not-for-merge] Transformer

    [not-for-merge] Transformer

    I am currently working on the Transformer for ASR (https://arxiv.org/pdf/1706.03762.pdf). I am implementing it with minimum changes from the original script. If it works, I will adapt it to fit into the E2E module, meanwhile I will keep using a different script (e2e_transformer).

    I am currently testing with voxforge dataset with CPU, because I am having some memory issues. It seems that the multihead attention layers consume a huge amount of memory. When the model was training for MNT (utters with a length of ~50) the model consumed 5GB of GPU. For ASR, the input length has more the 100, and so the model requires more than 20 GB of GPU for training with reduced parameters.

    I am training with: ./run.sh --stage 3 --ngpu 0 --verbose 1 --backend chainer --mtlalpha 0.0 --elayers 3 --batchsize 20 --maxlen_in 500 --epochs 2

    Some changes: Implemented subsampling in the encoder (/4) I will test it also with subsampling by the may cause a memory error. Layers for Enc and dec reduced to 3. Once more, to avoid memory issues.

    TODO:

    • Implement recog script.
  • [ESPnet2] Transducer

    [ESPnet2] Transducer

    Hi,

    This PR add vanilla RNN-T training + decoding (w/ all search algorithms supported) for ESPnet2. For now, it's quite straightforward and I duplicated and modified classes from ESPnet1: RNNTDecoder, BeamSearchTransducerESPnet2 and ErrorCalculatorTransESPnet2. I also added run script and config files in vivos recipe for testing purpose. The former will be removed in the future.

    Concerning performances, I observed some degradations. I'll have to investigate, I may have made a mistake:

    | dataset | cer / wer (espnet1) | cer / wer (espnet2) | | - | - | - | | dev | 17.4 / 39.7 | 18.9 / 45.1 | | test | 18.4 / 38.9 | 21.1 / 45.8 |

    @kamo-naoyuki I'm not sure what should be modified and future actions, feel free to assign me next tasks!

    P.S: I'll extend espnet2 test later to cover RNN-T when we have a proper v1.

  • add wav2vec_encoder

    add wav2vec_encoder

    This is the initial PR for importing the Wav2Vec2.0 model in ESPnet. Before the code review, there is an issue that FairSeq is now made as an optional package, however, I cannot pass the test. The error message is related to failure for "import fairseq". Can anyone give some suggestions?

  • [WIP] transducer v4

    [WIP] transducer v4

    This PR add/modify a bunch of things:

    To do / in progress:

    • Default beam search:
      • [x] Optimization techniques
        • [x] prediction net caching
    • N-Step Constrained beam search (modified version of: https://arxiv.org/pdf/2002.03577.pdf):
      • [x] RNN-T
      • [x] RNN-T w/ att.
      • [x] T-T
      • [x] VIVOS decode config
      • [x] Voxforge decode config
      • [x] VIVOS results
      • [x] Voxforge results
      • [x] Optimization techniques
        • [x] prediction net caching
      • [ ] Code optimization
    • Time Synchronous Decoding (https://ieeexplore.ieee.org/document/9053040):
      • [x] RNN-T
      • [x] RNN-T w/ att.
      • [x] T-T
      • [x] VIVOS decode config
      • [x] Voxforge decode config
      • [x] VIVOS results
      • [x] Voxforge results
      • [x] Optimization techniques
        • [x] prediction net caching
        • [x] prediction net bashing
        • [x] joint net caching
      • [ ] Code optimization
    • Alignment-Length Synchronous Decoding (https://ieeexplore.ieee.org/document/9053040):
      • [x] RNN-T
      • [x] RNN-T w/ att.
      • [x] T-T
      • [x] VIVOS decode config
      • [x] Voxforge decode config
      • [x] VIVOS results
      • [x] Voxforge results
      • [x] ! Fix bug final hypothesis empty with transformer (Edit: lazy patch for now, I'll investigate later)
      • [x] Optimization techniques
        • [x] prediction net caching
        • [x] prediction net bashing
        • [x] joint net caching
      • [ ] Code optimization
    • Transformer-Transducer:
      • [x] Customizable architecture
      • [x] TDNN-BN-ReLU blocks for encoder part
      • [x] Conformer blocks for encoder part
      • [x] CausalConv1d for decoder part
    • General:
      • [x] Beam search interface (first version)
      • [x] Beam search tests (I'll add more tests in another PR I think)
      • [x] Benchmark: decoding speed (edit: only rough benchmark for now, complete will be done after optimizations)
      • [x] Fix mtl_mode (#2162)
      • [x] Fix attention weights saving condition + PlotAttentionReport selection for unified transducer architecture
      • [x] Fix CER/WER reporting CPU-GPU bug (#2232)
      • [ ] Fix CER/WER reporting multi GPUs bug (#2084)
      • [x] Documentation
        • [x] T-T w/ customizable architecure
        • [x] New features in main README
    • Recipe
      • [x] Voxforge: remove transducer run script + transfer learning scripts for voxforge (too complicated to maintain and documentation is now available for finetuning)

    Note/Important: Modifications related to customizable architecture should be discussed. This version is only intended to start the discussion: it works correctly however it's not user-friendly. This feature would be better suited with espnet2 but I think it should be accessible in espnet1 for now as it is necessary to reproduce some (if not most) transformer-transducer related papers.

    To be added later: prefix tree when dealing with BPE.

    ~P.S: I'm now focused on porting transducer to espnet2 so I'll probably keep remaining work on standby.~

  • espnet2 ASR recipe

    espnet2 ASR recipe

    We're thinking of converting espnet ASR recipes to new espnet (espnet2) ASR recipes (https://github.com/espnet/espnet/tree/v.0.7.0/egs2). The following is a current assignment. I did not finish the assignment of some recipes, and if you volunteer to do it, please let me know!

    @ftshijt, @Emrys365, @sas91, @YosukeHiguchi, @simpleoier, Thanks a lot for helping it! This is a temporal assignment. Please let me know if you have any requests for the assignment. Also, if you have any problems, comments on our new design, etc., you may use this issue.

    • [x] aishell @Emrys365
    • [x] ami @ftshijt
    • [ ] aurora4 @sas91
    • [x] babel @ftshijt
    • [x] chime4 @sas91
    • [ ] chime5 @b-flo
    • [x] commonvoice @ftshijt
    • [x] csj @YosukeHiguchi
    • [x] dirha_wsj @ruizhilijhu
    • [ ] fisher_callhome_spanish
    • [ ] fisher_swbd @YosukeHiguchi
    • [x] hkust @Emrys365
    • [x] how2 @b-flo
    • [ ] hub4_spanish @ruizhilijhu
    • [ ] iwslt18
    • [ ] iwslt19
    • [ ] jnas @YosukeHiguchi
    • [x] jsut @YosukeHiguchi
    • [ ] libri_trans
    • [ ] librispeech @simpleoier
    • [ ] must_c
    • [ ] reverb @sas91
    • [ ] ru_open_stt
    • [ ] swbd @YosukeHiguchi
    • [ ] tedlium2 @simpleoier
    • [ ] tedlium3 @simpleoier
    • [ ] timit @Emrys365
    • [x] vivos @b-flo
    • [x] voxforge @kamo-naoyuki
    • [x] yesno @b-flo
    • [x] wsj @kamo-naoyuki
  • [ESPnet2] distributed training

    [ESPnet2] distributed training

    Anyone test this.

    This PR is too complex to explain, so I'd like to show the examples only:

    See: https://github.com/espnet/espnet/wiki/About-distributed-training

    I added drop_last arguments for Batch Sampler.

    • ~~For training, drop_last is true. In Distributed training, mini-batch is divided by worldsize and each worker must have 1 or more batch-size. To avoid 0-batchsize, drop_last=true for training~~.
    • For training, drop_last is false.
    • For validation, drop_last is false. ~~Validation mode perform only at RANK==0 worker.~~
    • For inference, drop_last is false.
    • By default, drop_last is false.
  • Pytorch transformer (take 2)

    Pytorch transformer (take 2)

    this is rework of #555 with upstream chainer implementation #655 on master

    • [x] split modules as previous discussion in #555 #655
    • [x] add docstrings
    • [x] update asr train and recog with transformer
    • [x] implement dynamic module loading like preprocessing transform module (also in chainer)
    • [x] CTC/LM joint decoding
    • [x] constent ASRInterface for all the E2E implementation
    • [x] add pytorch exp and RESULTS (maybe finish tomorrow)
  • Why the multichannel data is randomly processed by the

    Why the multichannel data is randomly processed by the " chime4/asr1_multich " in v0.4.0

    I check the procedures in the chime4/asr1_multich of v0.4.0 and have some questions about the code in line 90 of espnet/espnet/nets/pytorch_backend/frontends/frontend.py. image

    It means that in the training "use_beamformer" is randomly set true or false. So not all the data is processed by the beamformer.
    I just change the code and find the loss is worse than the original code. Can anyone tell me why the data is processed like this. In my opinion, the beamformer is not trained by all the data. The results can not perform better than using all the data

    Thank you

  • Development plan of TTS recipes for v.0.5.0

    Development plan of TTS recipes for v.0.5.0

    Continue from #561

    • [x] Integrate neural vocoder #1081
    • [ ] GPU batch inference
    • [x] Multi-speaker Transformer #1001
    • [x] Multi-speaker FastSpeech #1006
    • [x] Add Transformer recipe
      • [x] JSUT #1009
      • [x] LibriTTS #1005
    • [x] Add Chinese recipe #1259
    • [ ] Re-design interface to be compatible with other types of embeddings
    • [ ] Integrate online text cleaning in training #998

    If you have other suggestions, please let me know.

  • ASR-based CER/WER eval for TTS

    ASR-based CER/WER eval for TTS

    This is the branch for evaluating TTS objectively. The stage 6 (ASR-based CER/WER evaluation) is added into /egs/ljspeech/tts1/run.sh. librispeech (ngpu4) is used as ASR model.

    I did pull request for showing my progress. This branch have not been complete yet.

  • TTS Tacotron 2,

    TTS Tacotron 2, "Weak" Alignment, any suggestions?

    So i am trying to train Tacotron 2 on some custom dataset, i have a single speaker dataset, that is roughly around 11 hours.

    I have trained other implementations of tacotron 1 before, and on this one implementation the alignment learnt was very good.

    ESPNet though for some reason learns alignment, but its a bit "weak" meaning at every timestamp the predicted phonemes are a bit off sometimes.

    I am training this on 6 GPUs with batch size 32. I trained libritts as well from the default recipes, and according the the config, the model learnt very good alignment in only 30 epochs. But as can be seen from the GIF below, the alignment does get better, but it takes 800 epochs, and still its somewhat weak.

    Can anyone give suggestions on what the problem could be, or what i could do to make things better, any help would be greatly appreciated.

    EDIT:::

    To not mess with the specifications alot, i have only changed the sampling rate in the config to match my datas SR, all other parameters like n_mels, nfft, etc remain the same. Could this be an issue? Should i resample my data to 24000 to match libritts specifications and try training the model again? My params are as follows

    fs=16000      # sampling frequency
    fmax=""       # maximum frequency
    fmin=""       # minimum frequency
    n_mels=80     # number of mel basis
    n_fft=1024    # number of fft points
    n_shift=256   # number of shift points
    win_length="" # window length
     
    

    Additionally this is what my melspectograms look like from feats.scp and feats.ark

    random feats.scp feat.scp

    random feats.ark feats.ark

    This is very different from what libritts features looked like, but my limited knowledge in signal processing does not help me identify what the problem could be

    Alignment spochs 280-870

  • Add support of test-only criterions after each epoch

    Add support of test-only criterions after each epoch

    This PR enables specifying whether some criterions in the ENH task will only be calculated during inference (after each epoch). This can be now achieved by adding an additional argument only_for_test in the model config file:

    criterions: 
      # The first criterion
      - name: si_snr 
        conf:
          eps: 1.0e-7
          only_for_test: True
        # the wrapper for the current criterion
        # for single-talker case, we simplely use fixed_order wrapper
        wrapper: fixed_order
        wrapper_conf:
          weight: 1.0
    
  • Add talromur recipe

    Add talromur recipe

    Added a recipe for the Talrómur corpus, see following links https://repository.clarin.is/repository/xmlui/handle/20.500.12537/104 https://aclanthology.org/2021.nodalida-main.50.pdf

    This recipe is mostly based off of the LJSpeech TTS recipe, with the obviously new local files, such as data_download.sh data.sh and data_utils.sh.

    I also included a data_multi_speaker which creates a data directory containing all of the speakers, rather than a single one. This can be used to train multi-speaker tts systems such as xvector-VITS

    The recipe also employs MFA alignments obtainable at https://repository.clarin.is/repository/xmlui/handle/20.500.12537/201 We used them to trim leading and trailing silences from audio files, and hope to be able to use them for more reliable duration modelling in FastSpeech2 in the future.

  • Multi-GPU performance improvement for ASR in ESPnet1 by DDP

    Multi-GPU performance improvement for ASR in ESPnet1 by DDP

    Describe the bug

    Dear ESPnet team,

    This issue is a proposal to improve the multi-GPU training performance for v1 ASR. I'm filing this because I didn't find any other issues and/or PRs except for https://github.com/espnet/espnet/issues/3583 as far as I checked. (Sorry in advance if I overlooked any other similar issues/PRs and/or should discuss on https://github.com/espnet/espnet/issues/3583.) I have already made a basic DDP implementation for a verification as mentioned later, and can send a PR. Here, I would like to discuss with you if this feature is required at first. After that, also would like to talk about if the current modifications are acceptable, etc.

    As discussed in https://github.com/espnet/espnet/issues/3583, it looks like ASR training script in v1 has a large bottleneck in multi-GPU case. As far as I investigated, it's caused by thread-based multi-GPU implementation. According to a profiling result by Nsight Systems, GPU computation occasionally stopped even though actual training codes don't contain any blockers, and GPU has enough capability to continue calculation. Also, during some GPUs stopped, it looks like only one GPU could run simultaneously, and the GPU that could run was determined in a round-robin fashion. Therefore, I guess that this is caused by GIL.

    To verify if my guess is correct, I made several changes for ASR v1 training codes to support PyTorch's DDP. As a result, 8GPU case can achieve x2 faster compared to non-DDP version (=current implementation). Performance data is below. Note that 8GPU case includes several times validations because 2500 iters mean around 10 epochs in my setting, and only train-clean-100 dataset is used for this measurement to avoid too long training time.

    | num of GPUs | num of iters | current impl [sec.] | DDP [sec.] | |:--:|:--:|:--:|:--:| | 1 | 2000 | 1270.37 | 1299.95 | | 8 | 2500 | 5327.21 | 2470.36 |

    And, throughput roughly calculated from the result above is below. (bs=32 is assumed.)

    | num of GPUs | current impl [samples / sec.] | DDP [samples / sec.] | DDP vs current impl | |:--:|:--:|:--:|:--:| | 1 | 201.5160937 | 196.9306512 | x0.98 | | 8 | 480.5517335 | 1036.286209 | x2.16 |

    Basic environments:

    I did an early investigation and PoC implementation on DGX A100 (80GB memory version) and within a NGC's PyTorch docker container. Detailed information is below.

    • OS information: Linux 4c930d24d0ff 5.4.0-110-generic #124-Ubuntu SMP Thu Apr 14 19:46:19 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
    • python version: Python 3.8.5
    • espnet version: espnet 0.10.4
    • Git hash: eee2566f2395d676e3978376315a4cd116a4e335
      • Commit date: Wed Nov 10 17:23:24 2021 +0900
      • We need to test transducer recipe. So, recipe is only v.202204 (commit hash: 48c351057df1ff9d5605d9cb4361cf1f216cf533).
    • pytorch version pytorch 1.8.0a0+1606899
      • Base container image: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-12.html#rel_21-12

    Environments from torch.utils.collect_env:

    (If necessary, please let me know. I will add these information later.)

    Task information:

    • Task: ASR
    • Recipe: ibrispeech
    • ESPnet1

    To Reproduce

    1. cd egs/librispeech/asr1
    2. run run.sh until stage 3 to setup everything,
    3. execute asr_train.py like below.
    python /workspace/espnet/espnet/bin/asr_train.py \
        --epochs 10 \
        --minibatches -1 \
        --config conf/tuning/transducer/train_conformer-rnn_transducer.yaml \
        --preprocess-conf conf/specaug.yaml \
        --ngpu 8 \
        --backend pytorch \
        --outdir exp/train_100_pytorch_train_conformer-rnn_transducer_specaug/results \
        --tensorboard-dir tensorboard/train_100_pytorch_train_conformer-rnn_transducer_specaug \
        --debugmode 1 \
        --dict /data/dataset/asr/librispeech/data/lang_char/train_100_unigram5000_units.txt \
        --debugdir exp/train_100_pytorch_train_conformer-rnn_transducer_specaug \
        --verbose 0 \
        --resume \
        --train-json /data/dataset/asr/librispeech/dump/train_sp/deltafalse/data_unigram5000.json \
        --valid-json /data/dataset/asr/librispeech/dump/dev/deltafalse/data_unigram5000.json \
        --n-iter-processes 8
    

    Note that --ngpu is changed to 1 for single GPU case, and --epochs should be 1 for single GPU performance measurement.

    Error logs

    N/A

    I would highly appreciate all your feedback and comments.

  • Adding vocoder_tag (or vocoder_url) that supports urls to tts_inference

    Adding vocoder_tag (or vocoder_url) that supports urls to tts_inference

    Hi, I made a merged pull request to parallel_wavegan repo in which we could have URLs for download_pretrained_model. In the current implementation of Text2Speech.from_pretrained the vocoders are limited to predefined tag models and they should start with parallel_wavegan in tts_inference. For example I tried the following code and it works fine (my own vocoder on google drive):

    from espnet2.bin.tts_inference import Text2Speech
    from scipy.io.wavfile import write
    
    model = Text2Speech.from_pretrained(model_tag="espnet/english_male_ryanspeech_fastspeech",
                                        vocoder_tag="parallel_wavegan/https://drive.google.com/file/d/10GYvB_mIKzXzSjD67tSnBhknZRoBjsNb")
    
    output = model("this is a fresh new start.")
    write("new.wav", 22050, output['wav'].numpy())
    

    However, this looks unnatural to me. I was thinking that we might be able to define vocoder_url or keep vocoder_tag but check if it's a URL and then download the model. What are your thoughts on this? I can create a pull request after your suggestion.

  • Can't finish collect-stats by using s3prl as the frontend.

    Can't finish collect-stats by using s3prl as the frontend.

    Describe the bug Can't finish collect-stats by using s3prl as the frontend.

    Basic environments:

    • OS information: Linux 5.4.0-104-generic #118-Ubuntu SMP Wed Mar 2 19:02:41 UTC 2022 x86_64
    • python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
    • espnet version: espnet 202204
    • Git hash 4ab6912ae80bcc4454b30b62b36f6b40c03cf8cf
    • Commit date Thu Oct 21 18:40:13 2021 +0900
    • pytorch version pytorch 1.9.0

    Environments from torch.utils.collect_env:

    Collecting environment information...
    PyTorch version: 1.9.0
    Is debug build: False
    CUDA used to build PyTorch: 10.2
    ROCM used to build PyTorch: N/A
    
    OS: Ubuntu 20.04.3 LTS (x86_64)
    GCC version: (Ubuntu 7.5.0-6ubuntu2) 7.5.0
    Clang version: 10.0.0 
    CMake version: version 3.16.3
    Libc version: glibc-2.31
    
    Python version: 3.8 (64-bit runtime)
    Python platform: Linux-5.4.0-104-generic-x86_64-with-glibc2.17
    Is CUDA available: True
    CUDA runtime version: 10.2.89
    GPU models and configuration: 
    GPU 0: NVIDIA GeForce RTX 2080 Ti
    GPU 1: NVIDIA GeForce RTX 2080 Ti
    GPU 2: NVIDIA GeForce RTX 2080 Ti
    GPU 3: NVIDIA GeForce RTX 2080 Ti
    
    Nvidia driver version: 510.54
    cuDNN version: Probably one of the following:
    /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.8.2.4
    /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.2.4
    /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.2.4
    /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.2.4
    /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.2.4
    /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.2.4
    /usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.2.4
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    
    Versions of relevant libraries:
    [pip3] numpy==1.20.3
    [pip3] pytorch-ranger==0.1.1
    [pip3] pytorch-wpe==0.0.1
    [pip3] torch==1.9.0
    [pip3] torch-complex==0.2.1
    [pip3] torch-optimizer==0.1.0
    [pip3] torchaudio==0.9.0
    [conda] blas                      1.0                         mkl  
    [conda] cudatoolkit               10.2.89              hfd86e86_1  
    [conda] mkl                       2021.3.0           h06a4308_520  
    [conda] numpy                     1.20.3                   pypi_0    pypi
    [conda] pytorch                   1.9.0           py3.8_cuda10.2_cudnn7.6.5_0    pytorch
    [conda] pytorch-ranger            0.1.1                    pypi_0    pypi
    [conda] pytorch-wpe               0.0.1                    pypi_0    pypi
    [conda] torch-complex             0.2.1                    pypi_0    pypi
    [conda] torch-optimizer           0.1.0                    pypi_0    pypi
    [conda] torchaudio                0.9.0                    pypi_0    pypi
    

    Task information:

    • Task: ASR
    • Recipe: librispeech + Aishell
    • ESPnet2

    Error logs

    2022-05-16T14:43:56 (asr_streaming-Copy1.sh:249:main) ./asr_streaming-Copy1.sh --audio_format wav --feats_type raw --token_type char --use_lm false --use_word_lm false --lm_config conf/train_lm.yaml --asr_config conf/train_asr_conformer_s3prl.yaml --inference_config conf/decode_asr_streaming.yaml --train_set mandarin_english_train --valid_set mandarin_english_dev --test_sets mandarin_english_test --speed_perturb_factors 0.9 1.0 1.1 --feats_normalize utt_mvn --asr_speech_fold_length 512 --asr_text_fold_length 150 --lm_fold_length 150 --stage 10 --stop-stage 100
    2022-05-16T14:43:56 (asr_streaming-Copy1.sh:886:main) Stage 6-8: Skip lm-related stages: use_lm=false
    2022-05-16T14:43:56 (asr_streaming-Copy1.sh:907:main) Stage 10: ASR collect stats: train_set=dump/raw/mandarin_english_train_sp, valid_set=dump/raw/mandarin_english_dev
    2022-05-16T14:43:57 (asr_streaming-Copy1.sh:957:main) Generate 'exp/stats_hubert_conformer/run.sh'. You can resume the process from stage 10 using this script
    2022-05-16T14:43:57 (asr_streaming-Copy1.sh:961:main) ASR collect-stats started... log: 'exp/stats_hubert_conformer/logdir/stats.*.log'
    bash: line 1: 3580624 Killed                  ( python3 -m espnet2.bin.asr_train --collect_stats true --use_preprocessor true --bpemodel none --token_type char --token_list data/token_list/char/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --train_data_path_and_name_and_type dump/raw/mandarin_english_train_sp/wav.scp,speech,sound --train_data_path_and_name_and_type dump/raw/mandarin_english_train_sp/text,text,text --valid_data_path_and_name_and_type dump/raw/mandarin_english_dev/wav.scp,speech,sound --valid_data_path_and_name_and_type dump/raw/mandarin_english_dev/text,text,text --train_shape_file exp/stats_hubert_conformer/logdir/train.7.scp --valid_shape_file exp/stats_hubert_conformer/logdir/valid.7.scp --output_dir exp/stats_hubert_conformer/logdir/stats.7 --config conf/train_asr_conformer_s3prl.yaml --frontend_conf fs=16k ) 2>> exp/stats_hubert_conformer/logdir/stats.7.log >> exp/stats_hubert_conformer/logdir/stats.7.log
    bash: line 1: 3580641 Killed                  ( python3 -m espnet2.bin.asr_train --collect_stats true --use_preprocessor true --bpemodel none --token_type char --token_list data/token_list/char/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --train_data_path_and_name_and_type dump/raw/mandarin_english_train_sp/wav.scp,speech,sound --train_data_path_and_name_and_type dump/raw/mandarin_english_train_sp/text,text,text --valid_data_path_and_name_and_type dump/raw/mandarin_english_dev/wav.scp,speech,sound --valid_data_path_and_name_and_type dump/raw/mandarin_english_dev/text,text,text --train_shape_file exp/stats_hubert_conformer/logdir/train.6.scp --valid_shape_file exp/stats_hubert_conformer/logdir/valid.6.scp --output_dir exp/stats_hubert_conformer/logdir/stats.6 --config conf/train_asr_conformer_s3prl.yaml --frontend_conf fs=16k ) 2>> exp/stats_hubert_conformer/logdir/stats.6.log >> exp/stats_hubert_conformer/logdir/stats.6.log
    

    Parts of stats.6.log

    [s3prl.upstream.experts] Warning: can not import s3prl.upstream.byol_a.expert: No module named 'easydict'. Pass.
    [s3prl.upstream.experts] Warning: can not import s3prl.upstream.wav2vec2_hug.expert: No module named 'transformers'. Pass.
    [s3prl.hub] Warning: can not import s3prl.upstream.byol_a.hubconf: No module named 'easydict'. Please see upstream/byol_a/README.md
    [s3prl.hub] Warning: can not import s3prl.upstream.wav2vec2_hug.hubconf: No module named 'transformers'. Please see upstream/wav2vec2_hug/README.md
    [s3prl.downstream.experts] Warning: can not import s3prl.downstream.voxceleb2_ge2e.expert: No module named 'sox'. Pass.
    [s3prl.downstream.experts] Warning: can not import s3prl.downstream.speech_commands.expert: No module named 'catalyst'. Pass.
    [s3prl.downstream.experts] Warning: can not import s3prl.downstream.quesst14_dtw.expert: No module named 'dtw'. Pass.
    [s3prl.downstream.experts] Warning: can not import s3prl.downstream.sws2013.expert: No module named 'lxml'. Pass.
    [s3prl.downstream.experts] Warning: can not import s3prl.downstream.separation_stft.expert: No module named 'asteroid'. Pass.
    [s3prl.downstream.experts] Warning: can not import s3prl.downstream.sv_voxceleb1.expert: No module named 'sox'. Pass.
    [s3prl.downstream.experts] Warning: can not import s3prl.downstream.enhancement_stft.expert: No module named 'asteroid'. Pass.
    [s3prl.downstream.experts] Warning: can not import s3prl.downstream.quesst14_embedding.expert: No module named 'lxml'. Pass.
    [s3prl.downstream.experts] Warning: can not import s3prl.downstream.a2a-vc-vctk.expert: No module named 'resemblyzer'. Pass.
    Using cache found in ./hub/s3prl_cache/4a54d64fa42b41e39db994c958d8107d5785a100f38c6eba680b6a3cc79babb3
    for https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt
    # Accounting: time=52 threads=1
    # Ended (code 137) at Mon May 16 14:37:41 CST 2022, elapsed time 52 seconds
    
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

May 3, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Jun 11, 2021
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

?? Contributing to OpenSpeech ?? OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform ta

May 8, 2022
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

May 16, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing. To make speech processing available to everyone, we're also releasing example implementation and recipe on some opensource dataset for various tasks (Automatic Speech Recognition, Speech Synthesis, Voice Conversion, Speaker Recognition, etc).

May 3, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

May 18, 2022
End-2-end speech synthesis with recurrent neural networks
End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

Apr 25, 2022
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
 SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation In this repo you can find the code of the Supervised Hybrid Audio Segmentatio

May 8, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

Apr 29, 2022
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, multi-microphone signal processing and many others.

May 16, 2022
May 21, 2022
ExKaldi-RT: An Online Speech Recognition Extension Toolkit of Kaldi

ExKaldi-RT is an online ASR toolkit for Python language. It reads realtime streaming audio and do online feature extraction, probability computation, and online decoding.

Aug 16, 2021
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

May 13, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

May 15, 2022
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。 安装 pip install -U ttskit 注意 可能需另外安装的依赖包:torch,版本要求torch>=1.6.0,<=1.7.1,根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

May 15, 2022
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

May 13, 2022