Introduction

About the Model

We introduce ATOMIS, developed by the Korea Institute of Nuclear Safety (KINS). This model is specifically designed for the nuclear field and is a large language model (LLM) with 32 billion parameters. It achieves state-of-the-art performance among its peers on Logickor, a real-world Korean task benchmark; NuclearQA, a nuclear-domain benchmark; and RAGEval, a RAG benchmark. Please refer to the evaluation results table for details.

Key Features

  • Korean Real-World use cases: The model can understand and generate Korean text with high accuracy, making it suitable for practical scenarios.
  • Specialized in the Nuclear Domain: The model has been specifically trained on a vast, specialized corpus of nuclear data.
  • RAG: The model delivers accurate answers based on real documents through its high RAG performance.

Pre-Training

We created the base model by expanding layers using a passthrough method, building on the gemma-2-27b model. Additionally, we extended the context length to 32K with RoPE and performed continuous pretraining to restore the modelโ€™s performance. In particular, to train specialized knowledge in the nuclear domain, we included the following data.

Post-Training

The fine-tuning data includes over 1M publicly available instruction datasets as well as high-quality synthetic data. We use this dataset to perform supervised fine-tuning (SFT) and direct preference optimization (DPO).

How to use

# pip install transformers==4.43.4 or later
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("KINS-ai/ATOMIS")
model = AutoModelForCausalLM.from_pretrained(
    "KINS-ai/ATOMIS",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "user", "content": "์•ˆ๋…•ํ•˜์„ธ์š”?"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

Evaluation

Overall

Model LogicKor NuclearQA RAGEval Avg
c4ai-command-r-08-2024 8.27 7.82 9.41 8.50
gemma-2-27b-it 8.66 8.18 8.97 8.60
Qwen2.5-32B-instruct 8.93 8.61 9.36 8.97
phi-4 8.62 8.67 9.55 8.95
Mistral-Small-24B-Instruct-2501 8.36 8.68 9.04 8.69
Llama-3.3-70b-instruct 7.94 8.42 9.25 8.54
ATOMIS 9.00 8.72 9.65 9.12

LogicKor

We evaluated the performance using the LogicKor code. As the judge model, we employed the officially recommended GPT-4-1106-preview. These scores reflect only the default zero-shot evaluation.

Model Math Reasoning Coding Writing Understanding Grammar Single-turn Multi-turn Avg
c4ai-command-r-08-2024 6.14 7.36 9.43 9.64 9.21 7.86 8.05 8.52 8.27
gemma-2-27b-it 8.93 8.29 8.43 9.29 9.43 7.57 8.43 8.88 8.66
Qwen2.5-32B-instruct 8.79 8.64 9.36 9.50 9.29 8.00 8.79 9.10 8.93
phi-4 8.79 9.21 9.86 9.21 9.00 5.64 8.50 8.74 8.62
Mistral-Small-24B-Instruct-2501 8.00 8.14 9.36 9.43 8.50 6.71 8.29 8.43 8.36
Llama-3.3-70b-instruct 7.43 6.50 8.79 8.43 8.64 7.86 8.14 7.74 7.94
ATOMIS 8.36 8.71 9.79 9.64 8.29 9.21 9.14 8.86 9.00

NuclearQA

We employed NuclearQA [1], a human-made benchmark consisting of 100 questions designed by experts to evaluate language models in the nuclear domain.

We then used this question set to assess the LLMโ€™s responses in a manner similar to the Logickor benchmark.

[1] Acharya, A., Munikoti, S., Hellinger, A., Smith, S., Wagle, S. and Horawalavithana, S., 2023. NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain. arXiv:2310.10920.

Model Easy Medium Hard General Scientific Numerical Num+Sci Avg
c4ai-command-r-08-2024 8.77 8.21 6.47 7.73 8.38 7.35 7.35 7.82
gemma-2-27b-it 8.97 8.24 7.33 7.92 8.23 8.12 8.45 8.18
Qwen2.5-32B-instruct 8.97 8.42 8.38 8.54 8.15 8.76 9.03 8.61
phi-4 8.94 8.97 8.11 8.46 8.73 9.00 8.50 8.67
Mistral-Small-24B-Instruct-2501 9.13 8.76 8.14 8.41 8.81 8.59 8.95 8.68
Llama-3.3-70b-instruct 9.29 8.58 7.44 8.22 8.62 8.47 8.35 8.42
ATOMIS 9.10 8.64 8.31 8.16 9.00 8.71 9.10 8.72

RAGEval

We used RAGEval [2], a benchmark designed to evaluate RAG performance in terms of factual accuracy, using three novel metrics: Completeness, Hallucination, and Irrelevance.

We evaluated performance using the RAGEval code. As the judge model, we employed the officially recommended gpt-4o. These scores reflect only the completeness metric of the single-document QA evaluation.

[2] Zhu, K., Luo, Y., Xu, D., Wang, R., Yu, S., Wang, S., Yan, Y., Liu, Z., Han, X., Liu, Z. and Sun, M., 2024. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. arXiv:2408.01262.

Model Factual Summarization Multi-hop Reasoning Avg
c4ai-command-r-08-2024 1.000 0.913 0.908 0.941
gemma-2-27b-it 0.987 0.890 0.814 0.897
Qwen2.5-32B-instruct 0.980 0.906 0.923 0.936
phi-4 1.000 0.931 0.934 0.955
Mistral-Small-24B-Instruct-2501 0.980 0.951 0.781 0.904
Llama-3.3-70b-instruct 0.977 0.907 0.893 0.925
ATOMIS 0.993 0.942 0.960 0.965
Downloads last month
8
Safetensors
Model size
32.9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for KINS-ai/ATOMIS

Base model

google/gemma-2-27b
Finetuned
(52)
this model