Audio Tokenizers
DASB includes a benchmark on 9 audio and speech datasets using 6 popular discrete audio encoders: semantic (Discrete HuBERT, Discrete WavLM, Discrete Wav2Vec2), compression (EnCodec, DAC), and hybrid (SpeechTokenizer).
🎌 Discrete Audio Encoder
Model | Dataset | Repo |
---|---|---|
Discrete Hubert | LibriSpeech960 | huggingface.co/speechbrain/SSL_Quantization |
Discrete WavLM | LibriSpeech960 | huggingface.co/speechbrain/SSL_Quantization |
Discrete Wav2Vec2 | LibriSpeech960 | huggingface.co/speechbrain/SSL_Quantization |
EnCodec | DNS, CommonVoice, AudioSet, FSD50K, and Jamendo | github.com/facebookresearch/encodec |
DAC | DAPS, DNS, CommonVoice, VCTK, MUSDB, and Jamendo | github.com/descriptinc/descript-audio-codec |
SpeechTokenizer | LibriSpeech960 | github.com/ZhangXInFD/SpeechTokenizer |
Key Features of the Discrete Audio Encoders
Model | #Params | Sampling Rate | Bitrate (kbps) | #Codebooks | ||||
---|---|---|---|---|---|---|---|---|
low | medium | high | low | medium | high | |||
Discrete HuBERT | 309.0M | 16KHz | 0.98 | 2.9 | - | 2 | 6 | - |
Discrete WavLM | 309.0M | 16KHz | 0.98 | 2.9 | - | 2 | 6 | - |
Discrete Wav2Vec2 | 309.0M | 16KHz | 0.98 | 2.9 | - | 2 | 6 | - |
EnCodec | 17.9M | 24KHz | 1.5 | 6.0 | 24.0 | 2 | 8 | 32 |
DAC | 22.4M | 24KHz | 1.5 | 6.0 | 24.0 | 2 | 8 | 32 |
SpeechTokenizer | 85.3M | 16KHz | 1.0 | 4.0 | - | 2 | 8 | - |
**** #Params is computed for medium bitrate. ****