Audio Tokenizers

DASB includes a benchmark e benchmark a diverse set of discrete audio encoders from all three categories: semantic (Discrete HuBERT, Discrete WavLM, Discrete Wav2Vec2) , acoustic (EnCodec, DAC, WavTokenizer, SQCodec), and hybrid (SpeechTokenizer, and Mimi).

⚡ Key Features of the Discrete Audio Encoders

Model Abr. SR FR #Vocab Speech Music Sound
EnCodec [Defossez et al., 2022] Enc-SMA-24 24kHz 75 1024
DAC [Kumar et al., 2023] DAC-SMA-24 24kHz 75 1024
SpeechTokenizer [Zhang et al., 2023] ST-S-16 16kHz 50 1024
Mimi [Defossez et al., 2024] Mimi-S-24 24kHz 12.5 2048
Discrete WavLM [Mousavi et al., 2024] DWavL-S-16 16kHz 50 1000
Discrete HuBERT [Mousavi et al., 2024] DHuBERT-S-16 16kHz 50 1000
Discrete Wav2Vec2 [Mousavi et al., 2024] DWav2Vec2-S-16 16kHz 50 1000
SQ-Codec [SimpleSpeech, 2024] SQ-SMA-16 16kHz 50 20000
WavTokenizer [Ji et al., 2024] WT-SMA-24-2 24kHz 75 4096

Runtime & Memor

Tokenizer legend (colors and markers)
Encoder time vs bitrate
Time (s) vs Bitrate (kbps)
Encoder memory vs bitrate
Memory (GB) vs Bitrate (kbps)
Decoder time vs bitrate
Time (s) vs Bitrate (kbps)
Decoder memory vs bitrate
Memory (GB) vs Bitrate (kbps)

Notes: Shaded backgrounds/borders are for presentation only. Colors/markers match the legend above.