Audio Tokenizers

DASB includes a benchmark on 9 audio and speech datasets using 6 popular discrete audio encoders: semantic (Discrete HuBERT, Discrete WavLM, Discrete Wav2Vec2), compression (EnCodec, DAC), and hybrid (SpeechTokenizer).

🎌 Discrete Audio Encoder

Model	Dataset	Repo
Discrete Hubert	LibriSpeech960	huggingface.co/speechbrain/SSL_Quantization
Discrete WavLM	LibriSpeech960	huggingface.co/speechbrain/SSL_Quantization
Discrete Wav2Vec2	LibriSpeech960	huggingface.co/speechbrain/SSL_Quantization
EnCodec	DNS, CommonVoice, AudioSet, FSD50K, and Jamendo	github.com/facebookresearch/encodec
DAC	DAPS, DNS, CommonVoice, VCTK, MUSDB, and Jamendo	github.com/descriptinc/descript-audio-codec
SpeechTokenizer	LibriSpeech960	github.com/ZhangXInFD/SpeechTokenizer

Key Features of the Discrete Audio Encoders

Model	#Params	Sampling Rate	Bitrate (kbps)			#Codebooks
Model	#Params	Sampling Rate	low	medium	high	low	medium	high
Discrete HuBERT	309.0M	16KHz	0.98	2.9	-	2	6	-
Discrete WavLM	309.0M	16KHz	0.98	2.9	-	2	6	-
Discrete Wav2Vec2	309.0M	16KHz	0.98	2.9	-	2	6	-
EnCodec	17.9M	24KHz	1.5	6.0	24.0	2	8	32
DAC	22.4M	24KHz	1.5	6.0	24.0	2	8	32
SpeechTokenizer	85.3M	16KHz	1.0	4.0	-	2	8	-

**** #Params is computed for medium bitrate. ****