Update README.md
#10
by
GrantL10
- opened
README.md
CHANGED
@@ -12,7 +12,7 @@ pipeline_tag: audio-text-to-text
|
|
12 |
## Introduction
|
13 |
|
14 |
R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
|
15 |
-
This implementation has achieved state-of-the-art performance on MMAU
|
16 |
For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
|
17 |
|
18 |
Our main findings are as follows:
|
@@ -27,7 +27,7 @@ Additional Notes:
|
|
27 |
- The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
|
28 |
- The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
|
29 |
|
30 |
-
### Table: Accuracies (%) on MMAU benchmark
|
31 |
|
32 |
| Model | Method | Test-mini | Test | Test-mini | Test | Test-mini | Test | Test-mini | Test |
|
33 |
|---------------------------------------|-----------------------|-----------|-------|-----------|-------|-----------|------|------------|-------|
|
@@ -36,21 +36,18 @@ Additional Notes:
|
|
36 |
| Audio Flamingo 2 | Direct Inference\* | 61.56 | 65.10 | 73.95 | 72.90 | 30.93 | 40.26 | 55.48 | 59.42 |
|
37 |
| GPT4o + Strong Cap. | Direct Inference\* | 57.35 | 55.83 | 49.70 | 51.73 | 64.86 | 68.66 | 57.30 | 58.74 |
|
38 |
| Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
|
39 |
-
| Gemini Pro v1.5 | Direct Inference\* | 56.75 | 54.46 | 49.40 | 48.56 | 58.55 | 55.90 | 54.90 | 52.97 |
|
40 |
| Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
|
41 |
-
| GPT4o + weak cap. | Direct Inference\* | 39.33 | 35.80 | 41.90 | 39.52 | 58.25 | 68.27 | 45.70 | 48.65 |
|
42 |
-
| Llama-3-8B-Instruct + Weak Cap. | Direct Inference\* | 34.23 | 33.73 | 38.02 | 42.36 | 54.05 | 61.54 | 42.10 | 45.87 |
|
43 |
| SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
|
44 |
| Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
|
45 |
| Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
|
46 |
-
| **Qwen2-Audio-7B-Instruct** | **Ours 1️⃣**
|
47 |
-
| **Qwen2-Audio-7B-Instruct** | **Ours 2️⃣**
|
48 |
|
49 |
#### Notes
|
50 |
|
51 |
1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
|
52 |
2️⃣ It is the model submitted to [EvalAI](https://eval.ai/web/challenges/challenge-page/2391/overview) for evaluation, trained multiple times to achieve balanced results. (**The results on the [leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard) contain some typographical errors, and we are currently in communication with the organizers to correct them.**)
|
53 |
-
\* The data are sourced from the [MMAU official website](https://sakshi113.github.io/mmau_homepage/)
|
54 |
\[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
|
55 |
\[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
|
56 |
|
|
|
12 |
## Introduction
|
13 |
|
14 |
R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instruct`, optimized through reinforcement learning using the group relative policy optimization (GRPO) algorithm.
|
15 |
+
This implementation has achieved state-of-the-art performance on the MMAU benchmark with only 38k post-training samples.
|
16 |
For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
|
17 |
|
18 |
Our main findings are as follows:
|
|
|
27 |
- The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
|
28 |
- The statement about the 8.2B parameters is based on the *Qwen2-Audio Technical Report*.
|
29 |
|
30 |
+
### Table: Accuracies (%) on the MMAU benchmark
|
31 |
|
32 |
| Model | Method | Test-mini | Test | Test-mini | Test | Test-mini | Test | Test-mini | Test |
|
33 |
|---------------------------------------|-----------------------|-----------|-------|-----------|-------|-----------|------|------------|-------|
|
|
|
36 |
| Audio Flamingo 2 | Direct Inference\* | 61.56 | 65.10 | 73.95 | 72.90 | 30.93 | 40.26 | 55.48 | 59.42 |
|
37 |
| GPT4o + Strong Cap. | Direct Inference\* | 57.35 | 55.83 | 49.70 | 51.73 | 64.86 | 68.66 | 57.30 | 58.74 |
|
38 |
| Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
|
|
|
39 |
| Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
|
|
|
|
|
40 |
| SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
|
41 |
| Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
|
42 |
| Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
|
43 |
+
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
|
44 |
+
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
|
45 |
|
46 |
#### Notes
|
47 |
|
48 |
1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
|
49 |
2️⃣ It is the model submitted to [EvalAI](https://eval.ai/web/challenges/challenge-page/2391/overview) for evaluation, trained multiple times to achieve balanced results. (**The results on the [leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard) contain some typographical errors, and we are currently in communication with the organizers to correct them.**)
|
50 |
+
\* The data are sourced from the [MMAU official website](https://sakshi113.github.io/mmau_homepage/).
|
51 |
\[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
|
52 |
\[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
|
53 |
|