lbourdois commited on
Commit
fb516b7
·
verified ·
1 Parent(s): f88ca16

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +112 -100
README.md CHANGED
@@ -1,101 +1,113 @@
1
- ---
2
- license: mit
3
- language:
4
- - zh
5
- base_model:
6
- - Qwen/Qwen2.5-14B-Instruct
7
- pipeline_tag: audio-text-to-text
8
- datasets:
9
- - EastBrook/COIG-Kun-Aug-Audio
10
- ---
11
-
12
- ## Model Details
13
- 本模型是语音-文本->文本的多模态模型。基于seamless-m4t-v2-large的音频编码器和Qwen2.5-14B-Instruct文本模型。
14
-
15
-
16
- 训练:
17
- 一阶段使用清洗后的WeNet中文数据(约6000小时)进行ASR任务训练,此阶段除文本模型外全部可训练。
18
- 二阶段基于chatgpt-corpus、moss-003-sft-data等数据集的问题,使用Qwen2.5-72B-Instruct-GPTQ-Int4首先继续生成更多轮次的问题,然后使用Qwen2.5-72B-Instruct-GPTQ-Int4生成多轮问题的答案,
19
- 问题使用cosyvoice生成对应音频。生成约620k的多轮语音输入->文本回答数据集。进行语音输入->文本回答的问答任务训练。此阶段除文本模型外全部可训练。
20
-
21
-
22
- 优势:训练期间文本模型完全冻结,保留原始能力;seamless-m4t-v2-large编码器平均每秒的音频编码成6-7个token,显著小于whisper的50个token。
23
-
24
- ### Model Description
25
-
26
- <!-- Provide a longer summary of what this model is. -->
27
-
28
- - **Developed by:** [More Information Needed]
29
- - **Funded by [optional]:** [More Information Needed]
30
- - **Shared by [optional]:** [More Information Needed]
31
- - **Model type:** [More Information Needed]
32
- - **Language(s) (NLP):** [More Information Needed]
33
- - **License:** [More Information Needed]
34
- - **Finetuned from model [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
- ```python
38
- import requests
39
- import torch
40
- from torchvision import io
41
- from typing import Dict
42
- from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor, AutoModel, SeamlessM4Tv2Model, AutoModelForCausalLM, AutoConfig
43
-
44
-
45
- model_path = "EastBrook/Qwen2.5-14B-SeamlessV2"
46
- #model_path = "./"
47
- model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16).to("cuda")
48
- processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
49
- print("model_path: ", model_path)
50
-
51
- messages = [
52
- {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
53
- {
54
- "role": "user",
55
- "content": [
56
- {"type": "audio"},
57
- #{"type": "text", "text": "请详细介绍一下强化学习中的GRPO。"},
58
- ],
59
- },
60
- ]
61
-
62
- # Preparation for inference
63
- text = processor.apply_chat_template(
64
- messages, tokenize=False, add_generation_prompt=True
65
- )
66
-
67
- import librosa
68
- audios = []
69
- audio_paths = [
70
- "/mnt/diskhd/Backup/Dataset/WenetSpeech/audio/train/podcast/B00022/X0000005821_5113963_S01270.mp3",
71
- ]
72
-
73
- for path in audio_paths:
74
- audio, sr = librosa.load(path, sr=16000)
75
- audios.append(audio)
76
-
77
-
78
- inputs = processor(
79
- text=[text],
80
- images=None,
81
- videos=None,
82
- #audios=None,
83
- audios=audios,
84
- padding=True,
85
- return_tensors="pt",
86
- )
87
- inputs = inputs.to("cuda")
88
-
89
- generated_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
90
- generated_ids_trimmed = [
91
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
92
- ]
93
- output_text = processor.batch_decode(
94
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
95
- )
96
- print("output_text: ", output_text)
97
- ```
98
-
99
- ### Out-of-Scope Use
100
-
 
 
 
 
 
 
 
 
 
 
 
 
101
  模型主要使用中文音频训练,英文能力较弱。
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - zho
5
+ - eng
6
+ - fra
7
+ - spa
8
+ - por
9
+ - deu
10
+ - ita
11
+ - rus
12
+ - jpn
13
+ - kor
14
+ - vie
15
+ - tha
16
+ - ara
17
+ base_model:
18
+ - Qwen/Qwen2.5-14B-Instruct
19
+ pipeline_tag: audio-text-to-text
20
+ datasets:
21
+ - EastBrook/COIG-Kun-Aug-Audio
22
+ ---
23
+
24
+ ## Model Details
25
+ 本模型是语音-文本->文本的多模态模型。基于seamless-m4t-v2-large的音频编码器和Qwen2.5-14B-Instruct文本模型。
26
+
27
+
28
+ 训练:
29
+ 一阶段使用清洗后的WeNet中文数据(约6000小时)进行ASR任务训练,此阶段除文本模型外全部可训练。
30
+ 二阶段基于chatgpt-corpus、moss-003-sft-data等数据集的问题,使用Qwen2.5-72B-Instruct-GPTQ-Int4首先继续生成更多轮次的问题,然后使用Qwen2.5-72B-Instruct-GPTQ-Int4生成多轮问题的答案,
31
+ 问题使用cosyvoice生成对应音频。生成约620k的多轮语音输入->文本回答数据集。进行语音输入->文本回答的问答任务训练。此阶段除文本模型外全部可训练。
32
+
33
+
34
+ 优势:训练期间文本模型完全冻结,保留原始能力;seamless-m4t-v2-large编码器平均每秒的音频编码成6-7个token,显著小于whisper的50个token。
35
+
36
+ ### Model Description
37
+
38
+ <!-- Provide a longer summary of what this model is. -->
39
+
40
+ - **Developed by:** [More Information Needed]
41
+ - **Funded by [optional]:** [More Information Needed]
42
+ - **Shared by [optional]:** [More Information Needed]
43
+ - **Model type:** [More Information Needed]
44
+ - **Language(s) (NLP):** [More Information Needed]
45
+ - **License:** [More Information Needed]
46
+ - **Finetuned from model [optional]:** [More Information Needed]
47
+
48
+ ## Uses
49
+ ```python
50
+ import requests
51
+ import torch
52
+ from torchvision import io
53
+ from typing import Dict
54
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor, AutoModel, SeamlessM4Tv2Model, AutoModelForCausalLM, AutoConfig
55
+
56
+
57
+ model_path = "EastBrook/Qwen2.5-14B-SeamlessV2"
58
+ #model_path = "./"
59
+ model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16).to("cuda")
60
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
61
+ print("model_path: ", model_path)
62
+
63
+ messages = [
64
+ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
65
+ {
66
+ "role": "user",
67
+ "content": [
68
+ {"type": "audio"},
69
+ #{"type": "text", "text": "请详细介绍一下强化学习中的GRPO。"},
70
+ ],
71
+ },
72
+ ]
73
+
74
+ # Preparation for inference
75
+ text = processor.apply_chat_template(
76
+ messages, tokenize=False, add_generation_prompt=True
77
+ )
78
+
79
+ import librosa
80
+ audios = []
81
+ audio_paths = [
82
+ "/mnt/diskhd/Backup/Dataset/WenetSpeech/audio/train/podcast/B00022/X0000005821_5113963_S01270.mp3",
83
+ ]
84
+
85
+ for path in audio_paths:
86
+ audio, sr = librosa.load(path, sr=16000)
87
+ audios.append(audio)
88
+
89
+
90
+ inputs = processor(
91
+ text=[text],
92
+ images=None,
93
+ videos=None,
94
+ #audios=None,
95
+ audios=audios,
96
+ padding=True,
97
+ return_tensors="pt",
98
+ )
99
+ inputs = inputs.to("cuda")
100
+
101
+ generated_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
102
+ generated_ids_trimmed = [
103
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
104
+ ]
105
+ output_text = processor.batch_decode(
106
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
107
+ )
108
+ print("output_text: ", output_text)
109
+ ```
110
+
111
+ ### Out-of-Scope Use
112
+
113
  模型主要使用中文音频训练,英文能力较弱。