Kimi-Audio

๐Ÿค— Kimi-Audio-7B | ๐Ÿค— Kimi-Audio-7B-Instruct | ๐Ÿ“‘ Paper

Introduction

We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository hosts the model checkpoints for Kimi-Audio-7B.

Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:

  • Universal Capabilities: Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC) and end-to-end speech conversation.
  • State-of-the-Art Performance: Achieves SOTA results on numerous audio benchmarks (see our Technical Report).
  • Large-Scale Pre-training: Pre-trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data.
  • Novel Architecture: Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
  • Efficient Inference: Features a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.

For more details, please refer to our GitHub Repository and Technical Report.

Note

Kimi-Audio-7B is a base model without fine-tuning. So it cannot be used directly. The base model is quite flexible, you can fine-tune it on any possible downstream tasks.

If you are looking for an out-of-the-box model, please refer to Kimi-Audio-7B-Instruct.

Citation

If you find Kimi-Audio useful in your research or applications, please cite our technical report:

@misc{kimi_audio_2024,
      title={Kimi-Audio Technical Report},
      author={Kimi Team},
      year={2024},
      eprint={arXiv:placeholder},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

The model is based and modified from Qwen 2.5-7B. Code derived from Qwen2.5-7B is licensed under the Apache 2.0 License. Other parts of the code are licensed under the MIT License.

Downloads last month
55
Safetensors
Model size
9.77B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support

Collection including moonshotai/Kimi-Audio-7B