Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,329 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model: models/gemma-2-9b
|
3 |
+
library_name: peft
|
4 |
+
license: cc-by-sa-4.0
|
5 |
+
datasets:
|
6 |
+
- universal-dependencies/universal_dependencies
|
7 |
+
language:
|
8 |
+
- en
|
9 |
+
- ja
|
10 |
+
- zh
|
11 |
+
- ko
|
12 |
+
- fr
|
13 |
+
- de
|
14 |
+
- sl
|
15 |
+
metrics:
|
16 |
+
- LAS
|
17 |
+
- UAS
|
18 |
+
- UPOS
|
19 |
+
pipeline_tag: text-generation
|
20 |
+
---
|
21 |
+
# Model Card for prompt-parsing-v0-gemma-2-9b-lora
|
22 |
+
|
23 |
+
[megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1](https://huggingface.co/megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1) is a dependency parsing model which analyze a gold token sequence in user prompt in step-by-step way.
|
24 |
+
This model is trained using the Universal Dependencies datasets over 7 languages, and provides SoTA-level accuracy for UPOS, UAS, and LAS.
|
25 |
+
|
26 |
+
[megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1](https://huggingface.co/megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1)はユーザプロンプトで与えられた正解トークン列に対してstep-by-stepで依存構造解析を行うモデルです。
|
27 |
+
このモデルはUniversal Dependenciesの7つの言語のデータセットを用いて訓練されており、UPOS, UAS, LASにおいてSoTAレベルの解析精度を持ちます。
|
28 |
+
|
29 |
+
## Terms of Use
|
30 |
+
|
31 |
+
This LoRA adapter package is released under the CC BY-SA 4.0.
|
32 |
+
|
33 |
+
However, please note the following important conditions regarding its usage:
|
34 |
+
- This package **does not contain any part of the original Gemma 2 model**.
|
35 |
+
- In order to use this package, you must obtain and use the base model distributed from Google:
|
36 |
+
[Gemma 2 9B base on Hugging Face](https://huggingface.co/google/gemma-2-9b)
|
37 |
+
- **Use of the Gemma models requires agreement to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms)**.
|
38 |
+
|
39 |
+
利用規約 (Japanese version of the Terms of Use)
|
40 |
+
|
41 |
+
このLoRAアダプタパッケージは、CC BY-SA 4.0に基づいてリリースされています。
|
42 |
+
|
43 |
+
ただし、使用に関しては以下の重要な利用条件に注意してください。
|
44 |
+
|
45 |
+
- このパッケージには**オリジナルのGemma 2モデルは含まれていません**
|
46 |
+
- このパッケージを使用するには、Googleが配布するGemmaモデルを入手して使用する必要があります:
|
47 |
+
[Gemma 2 9B base on Hugging Face](https://huggingface.co/google/gemma-2-9b)
|
48 |
+
- **Gemmaモデルの使用には[Gemma Terms of Use](https://ai.google.dev/gemma/terms)への同意が必要です**
|
49 |
+
|
50 |
+
## Usage
|
51 |
+
|
52 |
+
- Install
|
53 |
+
```Console
|
54 |
+
pip install -U vllm==0.7.2 sudachipy sudachidict-core
|
55 |
+
```
|
56 |
+
|
57 |
+
In this first release, we only provide code example using the [sudachipy](https://github.com/WorksApplications/SudachiPy) tokenizer, which matches the token boundaries of UD Japanese datasets.
|
58 |
+
Code examples for other languages will be provided in upcoming releases.
|
59 |
+
|
60 |
+
本リリースでは、UD Japanese データセットのトークン境界との親和性の高い[sudachipy](https://github.com/WorksApplications/SudachiPy)をトークナイザーに使用したサンプルコードのみを提供します。
|
61 |
+
他の言語向けのサンプルコードは、今後のリリースで提供予定です。
|
62 |
+
|
63 |
+
- Code example
|
64 |
+
```Python
|
65 |
+
import json
|
66 |
+
import sudachipy
|
67 |
+
from vllm import LLM, SamplingParams
|
68 |
+
from vllm.lora.request import LoRARequest
|
69 |
+
|
70 |
+
base_model = "google/gemma-2-9b"
|
71 |
+
adapter_model = "megagonlabs/prompt-based-parsing-gemma-2-9b-lora-v1"
|
72 |
+
input_language = "Japanese"
|
73 |
+
input_sentences = ["銀座でランチをご一緒しましょう。", "この時代から、日本列島に人類が住んだ遺跡や遺物が多く発見されている。"]
|
74 |
+
|
75 |
+
tokenizer = sudachipy.Dictionary().create(mode=sudachipy.Tokenizer.SplitMode.A)
|
76 |
+
|
77 |
+
def tokenize_japanese_space_after(sentence) -> list[str]:
|
78 |
+
tokens = []
|
79 |
+
for m in tokenizer.tokenize(sentence):
|
80 |
+
surface = m.surface()
|
81 |
+
if surface in [" ", " "]:
|
82 |
+
if tokens and tokens[-1][-1] != " ":
|
83 |
+
tokens[-1] += " "
|
84 |
+
else:
|
85 |
+
tokens.append(surface)
|
86 |
+
if tokens and tokens[-1][-1] != " ":
|
87 |
+
tokens[-1] += " "
|
88 |
+
return tokens
|
89 |
+
|
90 |
+
def apply_template(language: str, sentence: str, tokens: list[str]) -> list:
|
91 |
+
return """You are an <<<LANGUAGE>>> linguist and specialize in <<<LANGUAGE>>> dependency analysis based on Universal Dependencies.
|
92 |
+
We will now perform dependency parsing on <<<LANGUAGE>>> sentence.
|
93 |
+
After splitting the input sentence into words as shown below, execute following three tasks:
|
94 |
+
|
95 |
+
- Task 1
|
96 |
+
|
97 |
+
Create a TSV with three fields: word index from 1 to <<<TOKEN_NUM>>> + word + part of speech.
|
98 |
+
|
99 |
+
- Task 2
|
100 |
+
Add a field for the dependent word indexes to each row to the output of Task 1.
|
101 |
+
However, for the word that is the main predicate of the sentence, the dependent word index should be 0.
|
102 |
+
|
103 |
+
- Task 3
|
104 |
+
|
105 |
+
Add a field for the Universal Dependencies relation labels to the output of Task 2.
|
106 |
+
|
107 |
+
|
108 |
+
input sentence:
|
109 |
+
<<<SENTENCE>>>
|
110 |
+
|
111 |
+
words:
|
112 |
+
<<<TOKENS>>>
|
113 |
+
""".replace("<<<LANGUAGE>>>", language).replace("<<<TOKEN_NUM>>>", str(len(tokens))).replace("<<<SENTENCE>>>", sentence).replace("<<<TOKENS>>>", "\n".join(tokens))
|
114 |
+
|
115 |
+
input_prompts = [
|
116 |
+
[
|
117 |
+
{
|
118 |
+
"role": "user",
|
119 |
+
"content": apply_template(input_language, s, tokenize_japanese_space_after(s)),
|
120 |
+
}
|
121 |
+
] for s in input_sentences
|
122 |
+
]
|
123 |
+
|
124 |
+
llm = LLM(
|
125 |
+
model=base_model,
|
126 |
+
enable_lora=True,
|
127 |
+
tokenizer=adapter_model,
|
128 |
+
dtype="bfloat16",
|
129 |
+
gpu_memory_utilization=0.9,
|
130 |
+
tensor_parallel_size=1,
|
131 |
+
enforce_eager=True,
|
132 |
+
)
|
133 |
+
sampling_params = SamplingParams(
|
134 |
+
temperature=0.,
|
135 |
+
max_tokens=1024, # <= 8192
|
136 |
+
)
|
137 |
+
lora_request = LoRARequest("adapter", 1, adapter_model)
|
138 |
+
|
139 |
+
results = llm.chat(
|
140 |
+
messages=input_prompts,
|
141 |
+
sampling_params=sampling_params,
|
142 |
+
use_tqdm=False,
|
143 |
+
lora_request=lora_request,
|
144 |
+
)
|
145 |
+
for sentence, result in zip(input_sentences, results):
|
146 |
+
print("# text =", sentence)
|
147 |
+
print(result.outputs[0].text)
|
148 |
+
```
|
149 |
+
|
150 |
+
- Output of code example
|
151 |
+
```
|
152 |
+
# text = 銀座でランチをご一緒しましょう。
|
153 |
+
- Task 1
|
154 |
+
|
155 |
+
1 銀座 PROPN
|
156 |
+
2 で ADP
|
157 |
+
3 ランチ NOUN
|
158 |
+
4 を ADP
|
159 |
+
5 ご NOUN
|
160 |
+
6 一緒 NOUN
|
161 |
+
7 し AUX
|
162 |
+
8 ましょう AUX
|
163 |
+
9 。 PUNCT
|
164 |
+
|
165 |
+
- Task 2
|
166 |
+
|
167 |
+
1 銀座 PROPN 6
|
168 |
+
2 で ADP 1
|
169 |
+
3 ランチ NOUN 6
|
170 |
+
4 を ADP 3
|
171 |
+
5 ご NOUN 6
|
172 |
+
6 一緒 NOUN 0
|
173 |
+
7 し AUX 6
|
174 |
+
8 ましょう AUX 6
|
175 |
+
9 。 PUNCT 6
|
176 |
+
|
177 |
+
- Task 3
|
178 |
+
|
179 |
+
1 銀座 PROPN 6 nmod
|
180 |
+
2 で ADP 1 case
|
181 |
+
3 ランチ NOUN 6 obj
|
182 |
+
4 を ADP 3 case
|
183 |
+
5 ご NOUN 6 compound
|
184 |
+
6 一緒 NOUN 0 root
|
185 |
+
7 し AUX 6 aux
|
186 |
+
8 ましょう AUX 6 aux
|
187 |
+
9 。 PUNCT 6 punct
|
188 |
+
|
189 |
+
|
190 |
+
# text = この時代から、日本列島に人類が住んだ遺跡や遺物が多く発見されている。
|
191 |
+
- Task 1
|
192 |
+
|
193 |
+
1 この DET
|
194 |
+
2 時代 NOUN
|
195 |
+
3 から ADP
|
196 |
+
4 、 PUNCT
|
197 |
+
5 日本 PROPN
|
198 |
+
6 列島 NOUN
|
199 |
+
7 に ADP
|
200 |
+
8 人類 NOUN
|
201 |
+
9 が ADP
|
202 |
+
10 住ん VERB
|
203 |
+
11 だ AUX
|
204 |
+
12 遺跡 NOUN
|
205 |
+
13 や ADP
|
206 |
+
14 遺物 NOUN
|
207 |
+
15 が ADP
|
208 |
+
16 多く ADJ
|
209 |
+
17 発見 VERB
|
210 |
+
18 さ AUX
|
211 |
+
19 れ AUX
|
212 |
+
20 て SCONJ
|
213 |
+
21 いる VERB
|
214 |
+
22 。 PUNCT
|
215 |
+
|
216 |
+
- Task 2
|
217 |
+
|
218 |
+
1 この DET 2
|
219 |
+
2 時代 NOUN 17
|
220 |
+
3 から ADP 2
|
221 |
+
4 、 PUNCT 2
|
222 |
+
5 日本 PROPN 6
|
223 |
+
6 列島 NOUN 10
|
224 |
+
7 に ADP 6
|
225 |
+
8 人類 NOUN 10
|
226 |
+
9 が ADP 8
|
227 |
+
10 住ん VERB 12
|
228 |
+
11 だ AUX 10
|
229 |
+
12 遺跡 NOUN 14
|
230 |
+
13 や ADP 12
|
231 |
+
14 遺物 NOUN 17
|
232 |
+
15 が ADP 14
|
233 |
+
16 多く ADJ 17
|
234 |
+
17 発見 VERB 0
|
235 |
+
18 さ AUX 17
|
236 |
+
19 れ AUX 17
|
237 |
+
20 て SCONJ 17
|
238 |
+
21 いる VERB 20
|
239 |
+
22 。 PUNCT 17
|
240 |
+
|
241 |
+
- Task 3
|
242 |
+
|
243 |
+
1 この DET 2 det
|
244 |
+
2 時代 NOUN 17 obl
|
245 |
+
3 から ADP 2 case
|
246 |
+
4 、 PUNCT 2 punct
|
247 |
+
5 日本 PROPN 6 compound
|
248 |
+
6 列島 NOUN 10 obl
|
249 |
+
7 に ADP 6 case
|
250 |
+
8 人類 NOUN 10 nsubj
|
251 |
+
9 が ADP 8 case
|
252 |
+
10 住ん VERB 12 acl
|
253 |
+
11 だ AUX 10 aux
|
254 |
+
12 遺跡 NOUN 14 nmod
|
255 |
+
13 や ADP 12 case
|
256 |
+
14 遺物 NOUN 17 nsubj
|
257 |
+
15 が ADP 14 case
|
258 |
+
16 多く ADJ 17 advcl
|
259 |
+
17 発見 VERB 0 root
|
260 |
+
18 さ AUX 17 aux
|
261 |
+
19 れ AUX 17 aux
|
262 |
+
20 て SCONJ 17 mark
|
263 |
+
21 いる VERB 20 fixed
|
264 |
+
22 。 PUNCT 17 punct
|
265 |
+
```
|
266 |
+
|
267 |
+
## Training and Evaluation
|
268 |
+
|
269 |
+
### Training Data and Hyper-parameters
|
270 |
+
|
271 |
+
We used the train-sets of the UD datasets below for LoRA SFT.
|
272 |
+
|
273 |
+
本モデルのLoRA SFTには次のUDデータセットのtrainセットを使用しました。
|
274 |
+
- [UD_English-EWT](https://github.com/UniversalDependencies/UD_English-EWT) r2.15
|
275 |
+
- [UD_Japanese-GSD](https://github.com/UniversalDependencies/UD_Japanese-GSD) r2.15
|
276 |
+
- [UD_Chinese-GSDSimp](https://github.com/UniversalDependencies/UD_Chinese-GSDSimp) r2.15
|
277 |
+
- [UD_Korean-GSD](https://github.com/UniversalDependencies/UD_Korean-GSD) r2.15
|
278 |
+
- [UD_French-GSD](https://github.com/UniversalDependencies/UD_French-GSD) r2.15
|
279 |
+
- [UD_German-GSD](https://github.com/UniversalDependencies/UD_German-GSD) r2.15
|
280 |
+
- [UD_Slovenian-SSJ](https://github.com/UniversalDependencies/UD_Slovenian-SSJ) r2.15
|
281 |
+
|
282 |
+
We also used the training hyper-parameters below:
|
283 |
+
|
284 |
+
また訓練時には次のパイパーパラメータを使用しました。
|
285 |
+
- lr: 5e-5
|
286 |
+
- num_train_epochs: 2
|
287 |
+
- lora_target_modules: "all-linear"
|
288 |
+
- lora_r: 8
|
289 |
+
- lora_alpha: 8
|
290 |
+
- lora_dropout: 0.05
|
291 |
+
|
292 |
+
The details of the experimental conditions will be released later.
|
293 |
+
|
294 |
+
実験条件の詳細については後日公開予定です。
|
295 |
+
|
296 |
+
### Evaluation Results
|
297 |
+
|
298 |
+
The accuracies in the table below are based on the simple recovery process applied to the TSV output in Step 3.
|
299 |
+
|
300 |
+
次の表に記載した精度は、Step 3のTSV出力に簡易なリカバリ処理を適用した上で評価を行っています。
|
301 |
+
| dataset | UPOS | UAS | LAS |
|
302 |
+
| ---- | ---- | ---- | ---- |
|
303 |
+
| [UD_English-EWT](https://github.com/UniversalDependencies/UD_English-EWT) | 0.982 | 0.951 | 0.937 |
|
304 |
+
| [UD_Japanese-GSD](https://github.com/UniversalDependencies/UD_Japanese-GSD) | 0.987 | 0.952 | 0.939 |
|
305 |
+
| [UD_Chinese-GSDSimp](https://github.com/UniversalDependencies/UD_Chinese-GSDSimp) | 0.972 | 0.889 | 0.862 |
|
306 |
+
| [UD_Korean-GSD](https://github.com/UniversalDependencies/UD_Korean-GSD) | 0.970 | 0.898 | 0.868 |
|
307 |
+
| [UD_French-GSD](https://github.com/UniversalDependencies/UD_French-GSD) | 0.981 | 0.956 | 0.943 |
|
308 |
+
| [UD_German-GSD](https://github.com/UniversalDependencies/UD_German-GSD) | 0.974 | 0.908 | 0.873 |
|
309 |
+
| [UD_Slovenian-SSJ](https://github.com/UniversalDependencies/UD_Slovenian-SSJ) | 0.989 | 0.954 | 0.939 |
|
310 |
+
|
311 |
+
### Framework versions
|
312 |
+
|
313 |
+
- TRL v0.15.2 (for training)
|
314 |
+
- vLLM 0.7.2 (for inference)
|
315 |
+
|
316 |
+
## Citation
|
317 |
+
|
318 |
+
```bibtex
|
319 |
+
@article{matsuda-nl263,
|
320 |
+
title={大規模言語モデルによる対話型依存構造解析},
|
321 |
+
author={松田寛},
|
322 |
+
journal={研究報告自然言語処理 (NL)},
|
323 |
+
volume={2025},
|
324 |
+
number={17},
|
325 |
+
pages={1--7},
|
326 |
+
year={2025},
|
327 |
+
publisher={情報処理学会}
|
328 |
+
}
|
329 |
+
```
|