Audio-Text-to-Text
Transformers
Safetensors
qwen2_audio
text2text-generation
frankenliu GrantL10 commited on
Commit
320ac24
·
verified ·
1 Parent(s): 39b1e16

Update README.md (#10)

Browse files

- Update README.md (eb1a98537616f38ed12563e7f7af4195a6e92359)


Co-authored-by: Gang Li <GrantL10@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +5 -8
README.md CHANGED
@@ -12,7 +12,7 @@ pipeline_tag: audio-text-to-text
12
  ## Introduction
13
 
14
  R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
15
- This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
16
  For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
17
 
18
  Our main findings are as follows:
@@ -27,7 +27,7 @@ Additional Notes:
27
  - The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
28
  - The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
29
 
30
- ### Table: Accuracies (%) on MMAU benchmark
31
 
32
  | Model | Method | Test-mini | Test | Test-mini | Test | Test-mini | Test | Test-mini | Test |
33
  |---------------------------------------|-----------------------|-----------|-------|-----------|-------|-----------|------|------------|-------|
@@ -36,21 +36,18 @@ Additional Notes:
36
  | Audio Flamingo 2 | Direct Inference\* | 61.56 | 65.10 | 73.95 | 72.90 | 30.93 | 40.26 | 55.48 | 59.42 |
37
  | GPT4o + Strong Cap. | Direct Inference\* | 57.35 | 55.83 | 49.70 | 51.73 | 64.86 | 68.66 | 57.30 | 58.74 |
38
  | Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
39
- | Gemini Pro v1.5 | Direct Inference\* | 56.75 | 54.46 | 49.40 | 48.56 | 58.55 | 55.90 | 54.90 | 52.97 |
40
  | Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
41
- | GPT4o + weak cap. | Direct Inference\* | 39.33 | 35.80 | 41.90 | 39.52 | 58.25 | 68.27 | 45.70 | 48.65 |
42
- | Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 33.73 | 38.02 | 42.36 | 54.05 | 61.54 | 42.10 | 45.87 |
43
  | SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
44
  | Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
45
  | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
46
- | **Qwen2-Audio-7B-Instruct** | **Ours 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
47
- | **Qwen2-Audio-7B-Instruct** | **Ours 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
48
 
49
  #### Notes
50
 
51
  1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
52
  2️⃣ It is the model submitted to [EvalAI](https://eval.ai/web/challenges/challenge-page/2391/overview) for evaluation, trained multiple times to achieve balanced results. (**The results on the [leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard) contain some typographical errors, and we are currently in communication with the organizers to correct them.**)
53
- \* The data are sourced from the [MMAU official website](https://sakshi113.github.io/mmau_homepage/)
54
  \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
55
  \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
56
 
 
12
  ## Introduction
13
 
14
  R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
15
+ This implementation has achieved state-of-the-art performance on the MMAU benchmark with only 38k post-training samples.
16
  For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
17
 
18
  Our main findings are as follows:
 
27
  - The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
28
  - The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
29
 
30
+ ### Table: Accuracies (%) on the MMAU benchmark
31
 
32
  | Model | Method | Test-mini | Test | Test-mini | Test | Test-mini | Test | Test-mini | Test |
33
  |---------------------------------------|-----------------------|-----------|-------|-----------|-------|-----------|------|------------|-------|
 
36
  | Audio Flamingo 2 | Direct Inference\* | 61.56 | 65.10 | 73.95 | 72.90 | 30.93 | 40.26 | 55.48 | 59.42 |
37
  | GPT4o + Strong Cap. | Direct Inference\* | 57.35 | 55.83 | 49.70 | 51.73 | 64.86 | 68.66 | 57.30 | 58.74 |
38
  | Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
 
39
  | Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
 
 
40
  | SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
41
  | Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
42
  | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
43
+ | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
44
+ | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
45
 
46
  #### Notes
47
 
48
  1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
49
  2️⃣ It is the model submitted to [EvalAI](https://eval.ai/web/challenges/challenge-page/2391/overview) for evaluation, trained multiple times to achieve balanced results. (**The results on the [leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard) contain some typographical errors, and we are currently in communication with the organizers to correct them.**)
50
+ \* The data are sourced from the [MMAU official website](https://sakshi113.github.io/mmau_homepage/).
51
  \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
52
  \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
53