File size: 7,027 Bytes
3153cda 5997d1b 6584fab 55299dd 3153cda 88a5a4a 3153cda 88a5a4a 3153cda 88a5a4a 3153cda 88a5a4a 5997d1b 3153cda 88a5a4a 5997d1b 3153cda 88a5a4a 62edf9f 88a5a4a 62edf9f 88a5a4a 5997d1b 88a5a4a f839cc4 527f89c c3f9954 88a5a4a 527f89c 5997d1b 88a5a4a 5997d1b 88a5a4a 4974253 88a5a4a 527f89c 88a5a4a 527f89c 88a5a4a 527f89c 88a5a4a 5997d1b 88a5a4a 5997d1b 88a5a4a 62edf9f 88a5a4a 25ed3cb 88a5a4a 62edf9f 88a5a4a 62edf9f 34cc827 62edf9f 88a5a4a 62edf9f 34cc827 62edf9f 88a5a4a 62edf9f 88a5a4a d5f9dd1 62edf9f 88a5a4a 62edf9f 88a5a4a 62edf9f 88a5a4a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
---
license: llama3.1
language:
- ko
- en
base_model:
- meta-llama/Llama-3.1-8B
tags:
- biology
- medical
- Llama
- pre-trained
- text-generation-inference
pipeline_tag: text-generation
---

# KoBioMed-Llama-3.1-8B
## Instroduction
We introduce KoBioMed-Llama-3.1-8B, a bilingual (English and Korean) generative model specialized in the BioMedical domain, developed by ezCaretech. This model has been continual pre-trained (CPT) on a dataset from PubMed abstracts and their translated Korean counterparts, undergoing extensive preprocessing that includes cleansing, de-duplication, and quality filtering.
Our KoBioMed-Llama-3.1-8B has achieved state-of-the-art performance on both Korean and English BioMedical benchmarks. We hope this model will contribute significantly to the biomedical and medical research community.
This repository contains an 8 Billion generative language model with the following key features:
- Developed by: AI Team, ezCaretech R&D Center
- Language Support: English and Korean
- Context Length: 8,192 tokens
- Vocab Size: 12,800
- License: llama3.1
### Notice!
- **This is a pre-trained model. It will be a great starting point for post-training, such as instruction tuning.**
- **This model was developed with support from the Korea Artificial Intelligence Industry Cluster Agency (AICA).**
- **The model is currently in post-training (Instruction Tuning, DPO) and is scheduled to be released within March 2025.**
## Evaluation
We evaluated the KoBioMed-Llama-3.1-8B using various Korean and English biomedical benchmarks.
- Benchmark evaluations were carried out using EleutherAI/lm-evaluation-harness and performed with 5-shot examples.
- The subsets used for the KMMLU and MMLU evaluations are listed below.
- KMMLU: 'kmmlu_direct_biology'
- MMLU: 'mmlu_college_biology', 'mmlu_clinical_knowledge', 'mmlu_anatomy', 'mmlu_college_medicine', 'mmlu_medical_genetics', 'mmlu_professional_medicine'
<table>
<tr>
<th>Models</th>
<th>KMMLU</th>
<th>KorMedMCQA</th>
<th>MedMCQA</th>
<th>MMLU</th>
<th>PubMedQA</th>
<th>Mean</th>
</tr>
<tr>
<td><a href="https://huggingface.co/Lowenzahn/KoBioMed-Llama-3.1-8B">KoBioMed-Llama-3.1-8B</a></td>
<td align="center"><strong>0.4010</strong></td>
<td align="center"><strong>0.5705</strong></td>
<td align="center"><u>0.5367</u></td>
<td align="center">0.6837</td>
<td align="center"><strong>0.7800</strong></td>
<td align="center"><strong>0.5944</strong></td>
</tr>
<tr>
<td><a href="https://huggingface.co/meta-llama/Llama-3.1-8B">Llama-3.1-8B</a></td>
<td align="center"><u>0.3620</u></td>
<td align="center">0.5105</td>
<td align="center"><strong>0.5635</strong></td>
<td align="center"><strong>0.7159</strong></td>
<td align="center">0.7600</td>
<td align="center"><u>0.5824</u></td>
</tr>
<tr>
<td><a href="https://huggingface.co/mistralai/Mistral-7B-v0.3">Mistral-7B-v0.3</a></td>
<td align="center">0.3130</td>
<td align="center">0.3958</td>
<td align="center">0.4927</td>
<td align="center">0.6693</td>
<td align="center">0.7740</td>
<td align="center">0.5290</td>
</tr>
<tr>
<td><a href="https://huggingface.co/beomi/Llama-3-Open-Ko-8B">Llama-3-Open-Ko-8B</a></td>
<td align="center">0.3340</td>
<td align="center">0.4941</td>
<td align="center">0.4743</td>
<td align="center">0.6251</td>
<td align="center">0.7320</td>
<td align="center">0.5319</td>
</tr>
<tr>
<td><a href="https://huggingface.co/upstage/SOLAR-10.7B-v1.0">SOLAR-10.7B-v1.0</a></td>
<td align="center">0.3200</td>
<td align="center"><u>0.5146</u></td>
<td align="center">0.5075</td>
<td align="center"><u>0.7050</u></td>
<td align="center"><u>0.7760</u></td>
<td align="center">0.5646</td>
</tr>
</table>
## Quickstart
Here is a code snippet for model inference.
``` python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo = 'Lowenzahn/KoBioMed-Llama-3.1-8B'
# Load model
model = AutoModelForCausalLM.from_pretrained(
repo,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(repo)
# Inference
prompts = ["Machine learning is"]
inputs = tokenizer(prompts, return_tensors="pt")
gen_kwargs = {"max_new_tokens": 1024, "top_p": 0.8, "temperature": 0.8, "do_sample": False, "repetition_penalty": 1.2}
output = model.generate(inputs['input_ids'], **gen_kwargs)
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(output)
```
## Limitations
KoBioMed-Llama-3.1-8B demonstrates strong performance in the biomedical domain, but it can sometimes generate inappropriate responses. While we have made considerable efforts to avoid providing sensitive data, racial discrimination, harm, or biased information in the training data, issues may still arise. We emphasize that the text generated by KoBioMed-Llama-3.1-8B does not reflect the views of the ezCaretech R&D center AI Team.
- The model may generate responses containing biased information related to age, gender, or race.
- The model may generate responses containing personal information, harmful content, or other inappropriate information.
- Since the model does not reflect the most up-to-date information, its responses may be outdated or contradictory.
- The performance of model may degrade on tasks unrelated to the biomedical and healthcare domains.
- KoBioMed-Llama-3.1-8B can make mistakes. Critical information should be verified independently.
## Training Data
This model was trained on preprocessed abstracts of papers published in PubMed from 2000 to 2023.
The preprocessing includes the following steps:
- Removal of URLs
- Removal of HTML tags
- Removal of reference citations
- Removal of Identifiable information
- Min-Hash based duplication removal
- Scoring model based low quality text removal
## License
This model is released under llama3.1 license.
## Supported by
This model was developed with support from the Korea Artificial Intelligence Industry Cluster Agency (AICA).
## Contact
조형민(Hyeongmin Cho), hyeongmin0121@gmail.com </br>
김인후(Inhu Kim), markaki72@gmail.com </br>
이동형(Donghyoung Lee), abidan88@gmail.com </br>
박달호(Dalho Park), dhpark@ezcaretech.com </br>
## Citation
**KoBioMed-Llama-3.1-8B**
```
@article{kobiomedllama,
title={KoBioMed-Llama-3.1-8B},
author={Hyeongmin Cho and Inhu Kim and Donghyoung Lee and Sanghwan Kim and Dalho Park and Inchul Kang and Kyul Kim and Jihoon Cho and Jongbeom Park},
year={2025},
url={https://huggingface.co/Lowenzahn/KoBioMed-Llama-3.1-8B}
}
``` |