Audio Tokenizers

DASB includes a benchmark on 9 audio and speech datasets using 6 popular discrete audio encoders: semantic (Discrete HuBERT, Discrete WavLM, Discrete Wav2Vec2), compression (EnCodec, DAC), and hybrid (SpeechTokenizer).

🎌 Discrete Audio Encoder

Model Dataset Repo
Discrete Hubert LibriSpeech960 huggingface.co/speechbrain/SSL_Quantization
Discrete WavLM LibriSpeech960 huggingface.co/speechbrain/SSL_Quantization
Discrete Wav2Vec2 LibriSpeech960 huggingface.co/speechbrain/SSL_Quantization
EnCodec DNS, CommonVoice, AudioSet, FSD50K, and Jamendo github.com/facebookresearch/encodec
DAC DAPS, DNS, CommonVoice, VCTK, MUSDB, and Jamendo github.com/descriptinc/descript-audio-codec
SpeechTokenizer LibriSpeech960 github.com/ZhangXInFD/SpeechTokenizer

Key Features of the Discrete Audio Encoders

Model #Params Sampling Rate Bitrate (kbps) #Codebooks
low medium high low medium high
Discrete HuBERT 309.0M 16KHz 0.98 2.9 - 2 6 -
Discrete WavLM 309.0M 16KHz 0.98 2.9 - 2 6 -
Discrete Wav2Vec2 309.0M 16KHz 0.98 2.9 - 2 6 -
EnCodec 17.9M 24KHz 1.5 6.0 24.0 2 8 32
DAC 22.4M 24KHz 1.5 6.0 24.0 2 8 32
SpeechTokenizer 85.3M 16KHz 1.0 4.0 - 2 8 -

**** #Params is computed for medium bitrate. ****