Introduction

About the Model

We introduce ATOMIS, developed by the Korea Institute of Nuclear Safety (KINS). This model is specifically designed for the nuclear field and is a large language model (LLM) with 32 billion parameters. It achieves state-of-the-art performance among its peers on Logickor, a real-world Korean task benchmark; NuclearQA, a nuclear-domain benchmark; and RAGEval, a RAG benchmark. Please refer to the evaluation results table for details.

Key Features

Korean Real-World use cases: The model can understand and generate Korean text with high accuracy, making it suitable for practical scenarios.
Specialized in the Nuclear Domain: The model has been specifically trained on a vast, specialized corpus of nuclear data.
RAG: The model delivers accurate answers based on real documents through its high RAG performance.

Pre-Training

We created the base model by expanding layers using a passthrough method, building on the gemma-2-27b model. Additionally, we extended the context length to 32K with RoPE and performed continuous pretraining to restore the model’s performance. In particular, to train specialized knowledge in the nuclear domain, we included the following data.

Atomic Wiki (https://atomic.snu.ac.kr)
NText (https://paperswithcode.com/dataset/ntext)
in-house data from KINS (Korea Institute of Nuclear Safety)

Post-Training

The fine-tuning data includes over 1M publicly available instruction datasets as well as high-quality synthetic data. We use this dataset to perform supervised fine-tuning (SFT) and direct preference optimization (DPO).

How to use

# pip install transformers==4.43.4 or later
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("KINS-ai/ATOMIS")
model = AutoModelForCausalLM.from_pretrained(
    "KINS-ai/ATOMIS",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "user", "content": "안녕하세요?"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

Evaluation

Overall

Model	LogicKor	NuclearQA	RAGEval	Avg
c4ai-command-r-08-2024	8.27	7.82	9.41	8.50
gemma-2-27b-it	8.66	8.18	8.97	8.60
Qwen2.5-32B-instruct	8.93	8.61	9.36	8.97
phi-4	8.62	8.67	9.55	8.95
Mistral-Small-24B-Instruct-2501	8.36	8.68	9.04	8.69
Llama-3.3-70b-instruct	7.94	8.42	9.25	8.54
ATOMIS	9.00	8.72	9.65	9.12

LogicKor

We evaluated the performance using the LogicKor code. As the judge model, we employed the officially recommended GPT-4-1106-preview. These scores reflect only the default zero-shot evaluation.

Model	Math	Reasoning	Coding	Writing	Understanding	Grammar	Single-turn	Multi-turn	Avg
c4ai-command-r-08-2024	6.14	7.36	9.43	9.64	9.21	7.86	8.05	8.52	8.27
gemma-2-27b-it	8.93	8.29	8.43	9.29	9.43	7.57	8.43	8.88	8.66
Qwen2.5-32B-instruct	8.79	8.64	9.36	9.50	9.29	8.00	8.79	9.10	8.93
phi-4	8.79	9.21	9.86	9.21	9.00	5.64	8.50	8.74	8.62
Mistral-Small-24B-Instruct-2501	8.00	8.14	9.36	9.43	8.50	6.71	8.29	8.43	8.36
Llama-3.3-70b-instruct	7.43	6.50	8.79	8.43	8.64	7.86	8.14	7.74	7.94
ATOMIS	8.36	8.71	9.79	9.64	8.29	9.21	9.14	8.86	9.00

NuclearQA

We employed NuclearQA [1], a human-made benchmark consisting of 100 questions designed by experts to evaluate language models in the nuclear domain.

We then used this question set to assess the LLM’s responses in a manner similar to the Logickor benchmark.

[1] Acharya, A., Munikoti, S., Hellinger, A., Smith, S., Wagle, S. and Horawalavithana, S., 2023. NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain. arXiv:2310.10920.

Model	Easy	Medium	Hard	General	Scientific	Numerical	Num+Sci	Avg
c4ai-command-r-08-2024	8.77	8.21	6.47	7.73	8.38	7.35	7.35	7.82
gemma-2-27b-it	8.97	8.24	7.33	7.92	8.23	8.12	8.45	8.18
Qwen2.5-32B-instruct	8.97	8.42	8.38	8.54	8.15	8.76	9.03	8.61
phi-4	8.94	8.97	8.11	8.46	8.73	9.00	8.50	8.67
Mistral-Small-24B-Instruct-2501	9.13	8.76	8.14	8.41	8.81	8.59	8.95	8.68
Llama-3.3-70b-instruct	9.29	8.58	7.44	8.22	8.62	8.47	8.35	8.42
ATOMIS	9.10	8.64	8.31	8.16	9.00	8.71	9.10	8.72

RAGEval

We used RAGEval [2], a benchmark designed to evaluate RAG performance in terms of factual accuracy, using three novel metrics: Completeness, Hallucination, and Irrelevance.

We evaluated performance using the RAGEval code. As the judge model, we employed the officially recommended gpt-4o. These scores reflect only the completeness metric of the single-document QA evaluation.

[2] Zhu, K., Luo, Y., Xu, D., Wang, R., Yu, S., Wang, S., Yan, Y., Liu, Z., Han, X., Liu, Z. and Sun, M., 2024. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. arXiv:2408.01262.

Model	Factual	Summarization	Multi-hop Reasoning	Avg
c4ai-command-r-08-2024	1.000	0.913	0.908	0.941
gemma-2-27b-it	0.987	0.890	0.814	0.897
Qwen2.5-32B-instruct	0.980	0.906	0.923	0.936
phi-4	1.000	0.931	0.934	0.955
Mistral-Small-24B-Instruct-2501	0.980	0.951	0.781	0.904
Llama-3.3-70b-instruct	0.977	0.907	0.893	0.925
ATOMIS	0.993	0.942	0.960	0.965

KINS-ai
/

ATOMIS