Ke-Omni-R: Achieving Advanced Audio Reasoning with a Concise 50-Words Think Process

If you wish to train or perform inference with the model, please visit the GitHub repository: https://github.com/shuaijiang/Ke-Omni-R/. If you find this model helpful, please like this model and star our GitHub.

Ke-Omni-R is an advanced audio reasoning model built upon Qwen2.5-Omni-7B. With only 10k post-training samples, Ke-Omni-R has achieved state-of-the-art performance on the MMAU Test-mini and Test benchmarks. Key insights from its development include:

GRPO Algorithm: The GRPO algorithm significantly enhances the performance of the already strong base model (Qwen2.5-Omni-7B), demonstrating superior generalization even in unseen speech domains.
Think Process: Incorporating a concise think process (less than 50 words) plays a crucial role in improving reasoning capabilities.
KL Divergence: Slight improvements were observed during GRPO training by leveraging KL divergence.
Domain Ratio vs. Data Volume: Domain diversity outweighs data volume. We utilized only 10k samples, with 5k randomly selected from AVQA and another 5k from MusicBench.

Performance: Accuracies (%) on MMAU Test-mini and Test benchmark

Model	Method	Sound (Test-mini)	Sound (Test)	Music (Test-mini)	Music (Test)	Speech (Test-mini)	Speech (Test)	Average (Test-mini)	Average (Test)
-	Human*	86.31	-	78.22	-	82.17	-	82.23	-
Gemini Pro 2.0 Flash	Direct Inference*	56.46	61.73	58.68	56.53	51.65	61.53	55.60	59.93
Audio Flamingo 2	Direct Inference*	61.56	65.10	73.95	72.90	30.93	40.26	55.48	59.42
GPT4o + Strong Cap.	Direct Inference*	57.35	55.83	49.70	51.73	64.86	68.66	57.30	58.74
Llama-3-8B-Instruct + Strong Cap.	Direct Inference*	50.75	49.10	48.93	48.93	55.25	62.70	52.10	53.57
Qwen2-Audio-7B-Instruct	Direct Inference*	54.95	45.90	50.98	53.26	42.04	45.90	49.20	52.50
SALAMONN	Direct Inference*	41.00	40.30	34.80	33.76	25.50	24.24	33.70	32.77
Audio-Reasoner(Qwen2-Audio-7B-Instruct)	[1]	60.06	-	64.30	-	60.70	-	61.71	-
Audio-Cot(Qwen2-Audio-7B-Instruct)	[2]	61.86	-	56.29	-	55.26	-	57.80	-
R1-AQA(Qwen2-Audio-7B-Instruct)	[3]	68.77	69.76	64.37	61.40	63.66	62.70	65.60	64.36
Qwen2.5-Omni-7B	[4]	67.87	-	69.16	-	59.76	-	65.60	-
Ke-Omni-R(Qwen2.5-Omni-7B)	GRPO(ours)	69.37	71.90	69.46	67.13	67.87	67.10	68.90	68.71

Note:

* The data are sourced from the MMAU leaderboard.
[1] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318.
[2] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246.
[3] Li, Gang, et al. "Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering." arXiv preprint arXiv:2503.11197
[4] Xu, Jin, et al. "Qwen2.5-Omni Technical Report." arXiv preprint arXiv:2503.20215

Usage


from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info


# You can directly insert a local file path, a URL, or a base64-encoded audio into the position where you want in the text.
messages = [
  # Audio
    ## Local audio path
    [{"role": "system", "content":[{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
     {"role": "user", "content": [{"type": "audio", "audio": "/path_to_avqa_wavs/-IBtBeR6B00_000000.wav"}, {"type": "text", "text": "Please describe this audio."}]}],
    [{"role": "user", "content": [{"type": "audio", "audio": "/path_to_avqa_wavs/-IBtBeR6B00_000000.wav"}, {"type": "text", "text": "What is the main source of sound in the audio? ['aircraft', 'Car', 'Tank', 'Missile'] Output the thinking process (less than 50 words) in <think> </think> and final answer in <answer> </answer>."}]}],
    [{"role": "user", "content": [{"type": "audio", "audio": "/path_to_avqa_wavs/-IBXTktoom8_000030.wav"}, {"type": "text", "text": "What animal is the main source of sound in the video? ['dog', 'wasp', 'honeybee', 'dragonfly'] Output the thinking process (less than 50 words) in <think> </think> and final answer in <answer> </answer>."}]}],
]

model = Qwen2_5OmniForConditionalGeneration.from_pretrained('KE-Team/Ke-Omni-R')
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(text)
audios, images, videos = process_mm_info(messages, use_audio_in_video=False)
inputs = processor(text=text, images=images, videos=videos, audio=audios, padding=True, return_tensors="pt")

generation = model.generate(**inputs, thinker_temperature=0, thinker_do_sample=False)
generated_ids = generation[:, inputs.input_ids.size(1):]
completions = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(completions)

the output should be

["Well, it sounds like there's a car accelerating. You can hear the engine revving up, and there's a bit of a thump or thud sound too. It might be the car hitting something or just a part of the acceleration process. It gives off a sense of speed and power. What do you think about it? Do you have any other audio samples you want to talk about?", '<think>The audio features a vehicle accelerating and revving, which is characteristic of a car. The sound is consistent with a car engine, not an aircraft, tank, or missile.</think>\n<answer>Car</answer>', "<think>The main source of sound is a buzzing insect, which is consistent with the size and sound of a honeybee. The other options don't match the sound or context.</think>\n<answer>honeybee</answer>"]

Acknowledgements

We express our gratitude to the following projects and teams for their contributions:

R1-AQA: Referenced the GRPO-based training implementation from R1-AQA.
Qwen Team: Special thanks to the Qwen2.5-Omni-7B model for providing a robust foundation.
Datasets:
- AVAQ
- MusicBench
- MMAU

Citation

@misc{zhao2025keomnir,
  author = {Zhao, Shuaijiang and Guo, Tingwei and Wen, Cheng and Xiang, Bajian and Zou, Wei},
  title = {Ke-Omni-R: Achieving Advanced Audio Reasoning with a Concise 50-Words Think Process},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub Repository},
  howpublished = {\url{https://github.com/shuaijiang/Ke-Omni-R}},
}

KE-Team
/

Ke-Omni-R

Ke-Omni-R: Achieving Advanced Audio Reasoning with a Concise 50-Words Think Process

Performance: Accuracies (%) on MMAU Test-mini and Test benchmark

Usage

Acknowledgements

Citation

Model tree for KE-Team/Ke-Omni-R

Dataset used to train KE-Team/Ke-Omni-R