Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness.
Explore the impact of adversarial injections, masking, and noise on LALM audio processing.
Analyze how altering the intermediate reasoning steps affects the final faithfulness of the model.
| Intervention | Model | Animal | Language | Gender | Emotion | MMAR | MMAU |
|---|---|---|---|---|---|---|---|
| Mask 100% | AF3 | 3.01 | 2.63 | 2.87 | 3.10 | 3.19 | 3.65 |
| Qwen | 2.35 | 2.40 | 2.95 | 2.64 | 2.81 | 3.39 | |
| Mask 20% | AF3 | 4.60 | 4.32 | 4.35 | 4.04 | 3.97 | 4.41 |
| Qwen | 4.68 | 4.62 | 3.89 | 3.45 | 4.01 | 4.45 | |
| -20dB SNR | AF3 | 2.89 | 2.54 | 3.17 | 3.13 | 3.09 | 3.57 |
| Qwen | 2.45 | 2.27 | 2.70 | 2.61 | 2.82 | 3.22 | |
| 20dB SNR | AF3 | 4.41 | 4.27 | 3.76 | 4.22 | 4.12 | 4.44 |
| Qwen | 4.58 | 4.67 | 3.93 | 4.05 | 4.12 | 4.33 | |
| Adv-correct | AF3 | 4.25 | 4.26 | 3.59 | 3.55 | N.A. | N.A. |
| Qwen | 4.46 | 4.59 | 3.23 | 2.88 | N.A. | N.A. | |
| Adv-wrong | AF3 | 3.04 | 3.43 | 3.32 | 2.99 | N.A. | N.A. |
| Qwen | 3.56 | 3.86 | 2.92 | 2.49 | N.A. | N.A. |