Files changed (1) hide show
  1. README.md +112 -101
README.md CHANGED
@@ -1,102 +1,113 @@
1
- ---
2
- datasets:
3
- - lmg-anon/VNTL-v3.1-1k
4
- language:
5
- - en
6
- - ja
7
- base_model:
8
- - Qwen/Qwen2.5-7B
9
- pipeline_tag: translation
10
- tags:
11
- - translation
12
- - vntl
13
- - qwen
14
- license: apache-2.0
15
- library_name: transformers
16
- ---
17
- # Qwen2.5-7B-VNTL-JP-EN
18
-
19
- Qwen2.5-7B finetuned for Japanese to English translation.
20
-
21
- Trained on ~150k sentences from [VNTL-v3.1-1k](https://huggingface.co/datasets/lmg-anon/VNTL-v3.1-1k).
22
-
23
- The model was trained on just the sentences in random order to make it more flexible and useful outside of just VN translation.
24
-
25
- ## Usage
26
-
27
- ### Ollama
28
-
29
- 1. `ollama run technobyte/Qwen2.5-7B-VNTL-JP-EN:q4_k_m`
30
- 2. Input just the Japanese sentence.
31
-
32
- ### Llama.cpp
33
-
34
- 1. Download the [GGUF](https://huggingface.co/TechnoByte/Qwen2.5-7B-VNTL-JP-EN-GGUF/tree/main).
35
- 2. `llama-cli -m Qwen2.5-7B-VNTL-JP-EN-Q4_K_M.gguf -no-cnv -p "A Japanese sentence along with a proper English equivalent.\nJapanese: 放課後はマンガ喫茶でまったり〜♡ おすすめのマンガ教えて! \nEnglish: "`
36
-
37
- ### Transformers
38
-
39
- ```python
40
- from transformers import AutoModelForCausalLM, AutoTokenizer
41
-
42
- model_name = "TechnoByte/Qwen2.5-7B-VNTL-JP-EN"
43
-
44
- model = AutoModelForCausalLM.from_pretrained(
45
- model_name,
46
- torch_dtype="auto",
47
- device_map="auto"
48
- )
49
- tokenizer = AutoTokenizer.from_pretrained(model_name)
50
-
51
- messages = [
52
- {"role": "user", "content": "放課後はマンガ喫茶でまったり〜♡ おすすめのマンガ教えて!"}
53
- ]
54
- text = tokenizer.apply_chat_template(
55
- messages,
56
- tokenize=False,
57
- add_generation_prompt=True
58
- )
59
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
60
-
61
- generated_ids = model.generate(
62
- **model_inputs,
63
- max_new_tokens=512
64
- )
65
- generated_ids = [
66
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
67
- ]
68
-
69
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
70
- ```
71
-
72
- ## Prompt template
73
-
74
- ### Plaintext
75
-
76
- ```
77
- A Japanese sentence along with a proper English equivalent.
78
- Japanese: JAPANESE SENTENCE HERE
79
- English:
80
- ```
81
-
82
- ### Jinja (HF Transformers)
83
-
84
- ```jinja
85
- {% for i in range(0, messages|length, 2) %}A Japanese sentence along with a proper English equivalent.
86
- Japanese: {{ messages[i].content }}
87
- English:{% if i+1 < messages|length %} {{ messages[i+1].content }}<|endoftext|>{{ "
88
- " }}{% else %}{% endif %}{% endfor %}
89
- ```
90
-
91
- ### Go (Ollama)
92
-
93
- ```
94
- A Japanese sentence along with a proper English equivalent.
95
- Japanese: {{ .Prompt }}
96
- English: {{ .Response }}<|endoftext|>
97
- ```
98
-
99
- ## Limitations
100
-
101
- - Can only translate one sentence per turn.
 
 
 
 
 
 
 
 
 
 
 
102
  - Can use incorrect pronouns due to lack of context.
 
1
+ ---
2
+ datasets:
3
+ - lmg-anon/VNTL-v3.1-1k
4
+ language:
5
+ - zho
6
+ - eng
7
+ - fra
8
+ - spa
9
+ - por
10
+ - deu
11
+ - ita
12
+ - rus
13
+ - jpn
14
+ - kor
15
+ - vie
16
+ - tha
17
+ - ara
18
+ base_model:
19
+ - Qwen/Qwen2.5-7B
20
+ pipeline_tag: translation
21
+ tags:
22
+ - translation
23
+ - vntl
24
+ - qwen
25
+ license: apache-2.0
26
+ library_name: transformers
27
+ ---
28
+ # Qwen2.5-7B-VNTL-JP-EN
29
+
30
+ Qwen2.5-7B finetuned for Japanese to English translation.
31
+
32
+ Trained on ~150k sentences from [VNTL-v3.1-1k](https://huggingface.co/datasets/lmg-anon/VNTL-v3.1-1k).
33
+
34
+ The model was trained on just the sentences in random order to make it more flexible and useful outside of just VN translation.
35
+
36
+ ## Usage
37
+
38
+ ### Ollama
39
+
40
+ 1. `ollama run technobyte/Qwen2.5-7B-VNTL-JP-EN:q4_k_m`
41
+ 2. Input just the Japanese sentence.
42
+
43
+ ### Llama.cpp
44
+
45
+ 1. Download the [GGUF](https://huggingface.co/TechnoByte/Qwen2.5-7B-VNTL-JP-EN-GGUF/tree/main).
46
+ 2. `llama-cli -m Qwen2.5-7B-VNTL-JP-EN-Q4_K_M.gguf -no-cnv -p "A Japanese sentence along with a proper English equivalent.\nJapanese: 放課後はマンガ喫茶でまったり〜♡ おすすめのマンガ教えて! \nEnglish: "`
47
+
48
+ ### Transformers
49
+
50
+ ```python
51
+ from transformers import AutoModelForCausalLM, AutoTokenizer
52
+
53
+ model_name = "TechnoByte/Qwen2.5-7B-VNTL-JP-EN"
54
+
55
+ model = AutoModelForCausalLM.from_pretrained(
56
+ model_name,
57
+ torch_dtype="auto",
58
+ device_map="auto"
59
+ )
60
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
61
+
62
+ messages = [
63
+ {"role": "user", "content": "放課後はマンガ喫茶でまったり〜♡ おすすめのマンガ教えて!"}
64
+ ]
65
+ text = tokenizer.apply_chat_template(
66
+ messages,
67
+ tokenize=False,
68
+ add_generation_prompt=True
69
+ )
70
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
71
+
72
+ generated_ids = model.generate(
73
+ **model_inputs,
74
+ max_new_tokens=512
75
+ )
76
+ generated_ids = [
77
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
78
+ ]
79
+
80
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
81
+ ```
82
+
83
+ ## Prompt template
84
+
85
+ ### Plaintext
86
+
87
+ ```
88
+ A Japanese sentence along with a proper English equivalent.
89
+ Japanese: JAPANESE SENTENCE HERE
90
+ English:
91
+ ```
92
+
93
+ ### Jinja (HF Transformers)
94
+
95
+ ```jinja
96
+ {% for i in range(0, messages|length, 2) %}A Japanese sentence along with a proper English equivalent.
97
+ Japanese: {{ messages[i].content }}
98
+ English:{% if i+1 < messages|length %} {{ messages[i+1].content }}<|endoftext|>{{ "
99
+ " }}{% else %}{% endif %}{% endfor %}
100
+ ```
101
+
102
+ ### Go (Ollama)
103
+
104
+ ```
105
+ A Japanese sentence along with a proper English equivalent.
106
+ Japanese: {{ .Prompt }}
107
+ English: {{ .Response }}<|endoftext|>
108
+ ```
109
+
110
+ ## Limitations
111
+
112
+ - Can only translate one sentence per turn.
113
  - Can use incorrect pronouns due to lack of context.