File size: 4,653 Bytes
5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 8f607e1 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac 4824cb8 5ddf8ac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
---
tags:
- text-generation
license: cc-by-nc-sa-4.0
language:
- ko
base_model: yanolja/KoSOLAR-10.7B-v0.1
pipeline_tag: text-generation
datasets:
- Edentns/data_go_kr-PublicDoc
- Edentns/aihub-TL_unanswerable_output
- Edentns/aihub-TL_span_extraction_how_output
- Edentns/aihub-TL_multiple_choice_output
- Edentns/aihub-TL_text_entailment_output
- jojo0217/korean_rlhf_dataset
- kyujinpy/KOR-OpenOrca-Platypus-v3
- beomi/KoAlpaca-v1.1a
- HumanF-MarkrAI/WIKI_QA_Near_dedup
---
# **DataVortexS-10.7B-v0.4**
<img src="./DataVortex.png" alt="DataVortex" style="height: 8em;">
## **Model Details**
### **Base Model**
[yanolja/KoSOLAR-10.7B-v0.1](https://huggingface.co/yanolja/KoSOLAR-10.7B-v0.1) _(Tokenizer Issue Fixed Version)_
### **Trained On**
- **OS**: Ubuntu 20.04
- **GPU**: H100 80GB 2ea
- **transformers**: v4.36.2
### **Dataset**
- Edentns/data_go_kr-PublicDoc - private
- Edentns/aihub-TL_unanswerable_output - private
- Edentns/aihub-TL_span_extraction_how_output - private
- Edentns/aihub-TL_multiple_choice_output - private
- Edentns/aihub-TL_text_entailment_output - private
- [jojo0217/korean_rlhf_dataset](https://huggingface.co/datasets/jojo0217/korean_rlhf_dataset)
- [kyujinpy/KOR-OpenOrca-Platypus-v3](https://huggingface.co/datasets/kyujinpy/KOR-OpenOrca-Platypus-v3)
- [beomi/KoAlpaca-v1.1a](https://huggingface.co/datasets/beomi/KoAlpaca-v1.1a)
- [HumanF-MarkrAI/WIKI_QA_Near_dedup](https://huggingface.co/datasets/HumanF-MarkrAI/WIKI_QA_Near_dedup)
### **Instruction format**
It follows **Alpaca** format.
E.g.
```python
text = """\
λΉμ μ μ¬λλ€μ΄ μ 보λ₯Ό μ°Ύμ μ μλλ‘ λμμ£Όλ μΈκ³΅μ§λ₯ λΉμμ
λλ€.
### Instruction:
λνλ―Όκ΅μ μλλ μ΄λμΌ?
### Response:
λνλ―Όκ΅μ μλλ μμΈμ
λλ€.
### Instruction:
μμΈ μΈκ΅¬λ μ΄ λͺ λͺ
μ΄μΌ?
"""
```
## **Model Benchmark**
### **[Ko-LLM-Leaderboard](https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard)**
On Benchmarking ...
| Model | Average | Ko-ARC | Ko-HellaSwag | Ko-MMLU | Ko-TruthfulQA | Ko-CommonGen V2 |
| ---------------------------- | ------- | ------ | ------------ | ------- | ------------- | --------------- |
| DataVortexM-7B-Instruct-v0.1 | 39.81 | 34.13 | 42.35 | 38.73 | 45.46 | 38.37 |
| DataVortexS-10.7B-v0.1 | 0 | 0 | 0 | 0 | 0 | 0 |
| DataVortexS-10.7B-v0.2 | 0 | 0 | 0 | 0 | 0 | 0 |
| DataVortexS-10.7B-v0.3 | 0 | 0 | 0 | 0 | 0 | 0 |
| **DataVortexS-10.7B-v0.4** | **0** | **0** | **0** | **0** | **0** | **0** |
| DataVortexS-10.7B-v0.5 | 0 | 0 | 0 | 0 | 0 | 0 |
| DataVortexTL-1.1B-v0.1 | 0 | 0 | 0 | 0 | 0 | 0 |
| DataVortexS-10.7B-dpo-v0.1 | 0 | 0 | 0 | 0 | 0 | 0 |
## **Implementation Code**
This model contains the chat_template instruction format.
You can use the code below.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("Edentns/DataVortexS-10.7B-v0.4")
tokenizer = AutoTokenizer.from_pretrained("Edentns/DataVortexS-10.7B-v0.4")
messages = [
{"role": "system", "content": "λΉμ μ μ¬λλ€μ΄ μ 보λ₯Ό μ°Ύμ μ μλλ‘ λμμ£Όλ μΈκ³΅μ§λ₯ λΉμμ
λλ€."},
{"role": "user", "content": "λνλ―Όκ΅μ μλλ μ΄λμΌ?"},
{"role": "assistant", "content": "λνλ―Όκ΅μ μλλ μμΈμ
λλ€."},
{"role": "user", "content": "μμΈ μΈκ΅¬λ μ΄ λͺ λͺ
μ΄μΌ?"}
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(device)
model.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
```
## **License**
The model is licensed under the [cc-by-nc-sa-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license, which allows others to copy, modify, and share the work non-commercially, as long as they give appropriate credit and distribute any derivative works under the same license.
<div align="center">
<a href="https://edentns.com/">
<img src="./Logo.png" alt="Logo" style="height: 3em;">
</a>
</div>
|