Investigating Faithfulness in Large Audio Language Models

Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness.

πŸ”Š Audio Intervention Overview

Explore the impact of adversarial injections, masking, and noise on LALM audio processing.

Audio Intervention Pipeline
Audio Pipeline

🧠 Chain-of-Thought (CoT) Overview

Analyze how altering the intermediate reasoning steps affects the final faithfulness of the model.

CoT Intervention Pipeline
CoT Pipeline

Audio Interventions

Quick Index

Audio Intervention Pipeline

Audio Pipeline

1. Adversarial Injection

Legend

Audio Flamingo 3 (AF3)

AF3 Accuracy AF3 Consistency

Qwen2.5-Omni

Qwen2.5 Accuracy Qwen2.5 Consistency

Audio Examples

⏺️ ORIGINAL AUDIO
βœ… CORRECT INJECTIONS
❌ WRONG INJECTIONS

3. Adding Noise

Dataset Legend Model Legend
Noise Accuracy vs SNR Noise Consistency vs SNR
Figure 2: Noise Intervention Results. Accuracy (left) and CoT Consistency (right) evaluated across varying Signal-to-Noise Ratio (SNR) levels.

Audio Examples: SNR Degradation

πŸ”Š NOISE LEVELS (SNR)
Clean (No Noise)
20dB SNR
10dB SNR
0dB SNR
-10dB SNR
-20dB SNR

2. Masking

Dataset Legend Model Legend
Mask Accuracy vs Ratio Mask Consistency vs Ratio

Audio Examples: Masking Ratio

πŸ”‡ MASK RATIOS
Clean (0% Masked)
20% Masked
40% Masked
60% Masked
80% Masked
100% Masked

4. Guided Masking

Audio Flamingo 3 (AF3)

AF3 Modality Dependence

Qwen2.5-Omni

Qwen2.5 Modality Dependence
Figure 4: Modality Dependency under Guided Masking. The values in parentheses (on the left) indicate the mean similarity score compared to the reference answers. The percentages above the bars represent the proportion of responses that are audio-dependent (A), both-dependent (A-S), and speech-dependent (S).

Audio Examples: Guided Masking

πŸŽ›οΈ MASKING TARGETS
⏺️ Original Audio
Speech + Background Sound
πŸ—£οΈ Speech Masked
Only Background Audio Remains
🎡 Audio Masked
Only Speech Remains

5. CoT Consistency Score

Intervention Model Animal Language Gender Emotion MMAR MMAU
Mask 100% AF3 3.012.632.873.103.193.65
Qwen 2.352.402.952.642.813.39
Mask 20% AF3 4.604.324.354.043.974.41
Qwen 4.684.623.893.454.014.45
-20dB SNR AF3 2.892.543.173.133.093.57
Qwen 2.452.272.702.612.823.22
20dB SNR AF3 4.414.273.764.224.124.44
Qwen 4.584.673.934.054.124.33
Adv-correct AF3 4.254.263.593.55N.A.N.A.
Qwen 4.464.593.232.88N.A.N.A.
Adv-wrong AF3 3.043.433.322.99N.A.N.A.
Qwen 3.563.862.922.49N.A.N.A.

Chain-of-Thought (CoT) Interventions

Quick Index

CoT Intervention Pipeline

CoT Pipeline

1. Paraphrasing

Dataset Legend

Audio Flamingo 3 (AF3)

AF3 Paraphrasing

Qwen2.5-Omni

Qwen2.5 Paraphrasing

2. Early Answering

Dataset Legend

Audio Flamingo 3 (AF3)

AF3 Early Answering

Qwen2.5-Omni

Qwen2.5 Early Answering

3. Adding Mistake

Dataset Legend

Audio Flamingo 3 (AF3)

AF3 Adding Mistake

Qwen2.5-Omni

Qwen2.5 Adding Mistake

4. Filler Tokens

Dataset Legend

Audio Flamingo 3 (AF3)

AF3 Filler Tokens

Qwen2.5-Omni

Qwen2.5 Filler Tokens