Model Card

$SwallowMath Icon$

Model Summary

This model is a continual pre-training of Llama-3.1-8B on a mix of the Python subset of The-Stack-v2-train-smol-ids (from SwallowCode, Experiment 1) and multilingual text datasets. The model was trained to evaluate the baseline performance of unfiltered Python code from The-Stack-v2 in the SwallowCode ablation experiments.

The model shows baseline performance on code generation tasks (HumanEval and HumanEval+) and maintains general capabilities on knowledge, reasoning, and common sense benchmarks. It serves as a reference point for comparing subsequent ablation experiments in the SwallowCode pipeline.

It was trained on 50 billion tokens using a mix of 16% SwallowCode (Experiment 1, Python subset) and 84% multilingual text, following the setup described in the SwallowCode paper. Training was performed using Megatron-LM.

Use

Generation

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = "tokyotech-llm/<model-name>"
device = "cuda"  # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)

inputs = tokenizer.encode("def fibonacci(n):", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Training

Model

Architecture: Llama-3.1
Pretraining tokens: 50B
Precision: bfloat16
Sequence length: 8,192
Tokenizer: Llama-3 tokenizer

Data

The training mix consists of:

16% Code: Python subset of The-Stack-v2-train-smol-ids (8B tokens), from SwallowCode, Experiment 1.
84% Multilingual Text:
- Japanese Wikipedia (0.84B tokens)
- Japanese Swallow Corpus v2 (26.1B tokens)
- Laboro-ParaCorpus (0.22B tokens)
- English Wikipedia (1.1B tokens)
- English Cosmopedia (3.7B tokens)
- English DCLM (10.0B tokens)

Details are in the paper’s Appendix.

Hardware

GPUs: 64 NVIDIA H100 (94GB)
Interconnect: InfiniBand NDR200
Supercomputer: TSUBAME, Institute of Science Tokyo

Software

Megatron-LM (version core_r0.9.0) for training
lm-evaluation-harness for evaluation
BigCodeBench for code evaluation

Evaluation

The model was evaluated using the setup described in the SwallowCode paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include code generation (HumanEval, HumanEval+) and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, GSM8K, BBH). Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.

Evaluation Results (Experiment 1)

Tokens (B)	OpenBookQA	TriviaQA	HellaSwag	SQuAD2.0	XWINO	MMLU	GSM8K	BBH	HumanEval	HumanEval+
10	0.3640	0.6659	0.5995	0.3354	0.9032	0.6294	0.4602	0.6019	0.3366	0.3366
20	0.3540	0.6567	0.6019	0.3360	0.9024	0.6238	0.4852	0.5898	0.3433	0.3433
30	0.3700	0.6588	0.6034	0.3377	0.9045	0.6263	0.5072	0.5939	0.3402	0.3421
40	0.3800	0.6618	0.6053	0.3380	0.9097	0.6341	0.5011	0.6016	0.3659	0.3701
50	0.3700	0.6679	0.6054	0.3350	0.9045	0.6340	0.5027	0.6091	0.3689	0.3720

Citation

@misc{fujii2025rewritingpretrainingdataboosts,
      title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, 
      author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki},
      year={2025},
      eprint={2505.02881},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.02881}, 
}

tokyotech-llm
/

Llama-3.1-8B-code-ablation-exp1-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0012500