Investigating Faithfulness in Large Audio Language Models

Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness.

📄 Read the arXiv Paper 💻 View GitHub Repo

🔊 Audio Intervention Overview

Explore the impact of adversarial injections, masking, and noise on LALM audio processing.

Audio Intervention Pipeline

🧠 Chain-of-Thought (CoT) Overview

Analyze how altering the intermediate reasoning steps affects the final faithfulness of the model.

CoT Intervention Pipeline

Audio Interventions

Quick Index

1. Adversarial Injection 3. Adding Noise 2. Masking 4. Guided Masking 5. CoT Consistency Scores

Audio Intervention Pipeline

1. Adversarial Injection

Audio Flamingo 3 (AF3)

Qwen2.5-Omni

Audio Examples

⏺️ ORIGINAL AUDIO

✅ CORRECT INJECTIONS

❌ WRONG INJECTIONS

3. Adding Noise

Figure 2: Noise Intervention Results. Accuracy (left) and CoT Consistency (right) evaluated across varying Signal-to-Noise Ratio (SNR) levels.

Audio Examples: SNR Degradation

🔊 NOISE LEVELS (SNR)

Clean (No Noise)

20dB SNR

10dB SNR

0dB SNR

-10dB SNR

-20dB SNR

2. Masking

Audio Examples: Masking Ratio

🔇 MASK RATIOS

Clean (0% Masked)

20% Masked

40% Masked

60% Masked

80% Masked

100% Masked

4. Guided Masking

Audio Flamingo 3 (AF3)

Qwen2.5-Omni

Figure 4: Modality Dependency under Guided Masking. The values in parentheses (on the left) indicate the mean similarity score compared to the reference answers. The percentages above the bars represent the proportion of responses that are audio-dependent (A), both-dependent (A-S), and speech-dependent (S).

Audio Examples: Guided Masking

🎛️ MASKING TARGETS

⏺️ Original Audio

Speech + Background Sound

🗣️ Speech Masked

Only Background Audio Remains

🎵 Audio Masked

Only Speech Remains

5. CoT Consistency Score

Intervention	Model	Animal	Language	Gender	Emotion	MMAR	MMAU
Mask 100%	AF3	3.01	2.63	2.87	3.10	3.19	3.65
Mask 100%	Qwen	2.35	2.40	2.95	2.64	2.81	3.39
Mask 20%	AF3	4.60	4.32	4.35	4.04	3.97	4.41
Mask 20%	Qwen	4.68	4.62	3.89	3.45	4.01	4.45
-20dB SNR	AF3	2.89	2.54	3.17	3.13	3.09	3.57
-20dB SNR	Qwen	2.45	2.27	2.70	2.61	2.82	3.22
20dB SNR	AF3	4.41	4.27	3.76	4.22	4.12	4.44
20dB SNR	Qwen	4.58	4.67	3.93	4.05	4.12	4.33
Adv-correct	AF3	4.25	4.26	3.59	3.55	N.A.	N.A.
Adv-correct	Qwen	4.46	4.59	3.23	2.88	N.A.	N.A.
Adv-wrong	AF3	3.04	3.43	3.32	2.99	N.A.	N.A.
Adv-wrong	Qwen	3.56	3.86	2.92	2.49	N.A.	N.A.