360Zhinao3 (360智脑)

Feel free to visit 360Zhinao's official website https://ai.360.com for more experience.

Introduction

🎉🎉🎉 Recently, Qihoo 360 has open sourced and upgraded its self-developed 7B parameter model 360Zhinao3-7B. It has now been launched on the Github open source community 360zhinao3 and can be used for commercial purposes free of charge. The capabilities of the model have been comprehensively improved. Compared with small parameter models with less than 10B, 360Zhinao3-7B has achieved excellent results of first place in multiple benchmarks.

360Zhinao3-7B
360Zhinao3-7B-Instruct
360Zhinao3-7B-O1.5

Notable features of our 360Zhinao3 models are:

360Zhinao3-7B is continuously pre-trained with 700B high-quality tokens on the basis of 360Zhinao2-7B. The two models have exactly the same structure. The improvement in model performance mainly stems from the improvement in the quality of training data.

News and Updates

[2025.04.14] 🔥🔥🔥We have released the 360Zhinao3 series of models, and at the same time opened up 360Zhinao3-7B, 360Zhinao3-7B-Instruct, and the long thought chain model 360Zhinao3-7B-O1.5.
[2024.11.18] We release 360Zhinao2-7B, providing access to both the Base model and Chat models with text lengths of 4K, 32K, and 360K.
[2024.05.23] We released two models, 360Zhinao-search and 360Zhinao-1.8B-Reranking, which ranked first respectively in the Retrieval and Reranking tasks of C-MTEB Leaderboard .
[2024.05.20] We extended llama3 and released llama3-8B-360Zhinao-360k-Instruct🤗
[2024.04.12] We released 360Zhinao-7B v1.0, including the base model and three chat models with context lengths 4K, 32K and 360K. Technical report is on arXiv.

Download URL
Model Evaluation
Quickstart
Model Inference
Model Finetune
License

Download URL

Size	Model	BF16
7B	360Zhinao3-7B	🤗
7B	360Zhinao3-7B-Instruct	🤗
7B	360Zhinao3-7B-O1.5	🤗

Model Evaluation

Base Model

WWe used the open source tool opencompass to conduct multi-dimensional evaluation of the model. The benchmark average score of the model ranks first among models with less than 10B parameters. It is competitive in the same size.

Type	Datasets	language	glm4-9b	Qwen2.5-7B	internlm2.5-7b	Yi1.5-9B	gemma2-9b	Llama3.1-8B	360Zhinao2-7B	360Zhinao3-7B
Exam	ceval	zh	75.83	81.41	77.71	73.51	56.36	51.67	83.04	84.7
	mmlu	en	75.5	75.5	71.55	71.43	72.22	66.75	67.84	75.42
	cmmlu	zh	74.24	81.79	78.77	74.2	58.89	52.49	73.8	82.17
	ARC-c	en	94.92	80	85.08	87.46	77.63	80.68	87.12	88.14
	ARC-e	en	98.41	84.83	95.24	94.53	78.84	89.77	92.77	94
Language	WiC	en	51.57	52.82	50.78	50.63	50.47	50	49.84	50.31
Language	WSC	en	68.27	68.27	69.23	66.35	68.27	67.31	65.38	71.15
Knowledge	BoolQ	en	81.8	83.88	89.51	84.46	85.6	82.2	88.29	88.38
Knowledge	commonsense_qa	en	71.17	73.22	68.55	71.58	68.47	71.25	69.78	71.33
Understanding	C3	zh	91.51	92	93.04	85.86	81.64	83.51	93.26	92.77
	race-middle	en	91.99	91.02	92.06	91.16	88.09	81.69	90.46	90.04
	race-high	en	90.71	87.91	90.08	88.34	82.08	78.73	86.74	85.96
	lcsts	zh	18.29	15.82	15.96	16.49	10.62	17.29	18.61	18.85
	eprstmt-dev	zh	91.88	86.88	91.25	91.88	48.12	83.12	90	92.50
	lambada	en	71.67	71.14	69.98	70.64	75.43	74.23	72.56	68.17
Reasoning	hellaswag	en	70.25	72.76	70.38	71.55	66.83	74.65	71.49	73.61
	siqa	en	81.73	72.52	78.97	76.2	58.96	64.18	77.12	79.02
	bbh	en	73.68	54.63	59.43	67.86	68.45	59.9	46.54	73.74
Code	humaneval	en	69.51	75	60.37	26.22	5.49	27.44	60.98	64.63
Code	mbpp	en	60	60	43.6	56.8	51.2	42.6	54	67.80
Math	math	en	26.86	38	27.14	27.06	28.52	15.32	38.34	37.60
Math	gsm8k	en	78.54	79.76	52.54	71.11	73.09	56.25	75.51	78.77
Overall	avg_zh		70.35	71.58	71.35	68.39	51.13	57.62	71.74	74.20
Overall	avg_all		73.11	71.78	69.60	68.88	61.60	62.32	70.61	74.83

Instruct Model

We have evaluated and compared the 360Zhinao3-7B-Instruct model on three popular evaluations: IFEval, MT-bench, and CF-Bench. MT-bench and CFBench both rank first among open-source models of the same level and have strong competitiveness. In IFEval (prompt strict), it is second only to glm4-9b and has the highest score in the 7B size.

Model	MT-bench	IFEval(strict prompt)	CFBench(CSR,ISR,PSR)
Qwen2.5-7B-Instruct	8.07	0.556	0.81	0.46	0.57
Yi-9B-16k-Chat	7.44	0.455	0.75	0.4	0.52
GLM4-9B-Chat	8.08	0.634	0.82	0.48	0.61
InternLM2.5-7B-Chat	7.39	0.540	0.78	0.4	0.54
360Zhinao2-7B-Chat-4k	7.86	0.577	0.8	0.44	0.57
360Zhinao3-7B-Instruct	8.17	0.626	0.83	0.52	0.64

Long COT Model

We used the previously open-sourced Light-R1 method of Zhinao to continue fine-tuning the Long COT of 360Zhinao3-7B-Instruct, as well as RFT and GRPO. There is still a certain gap compared with the latest OpenThinker2-7B, but it surpasses all previous models based on the general Qwen2.5-7B-Instruct.

Model	Date	Base Model	AIME24	AIME25	GPQA Diamond
OpenThinker2-7B	25.4.3	Qwen2.5-7B-Instruct	50	33.3	49.3
OpenThinker-7B	25.1.28	Qwen2.5-7B-Instruct	31.3	23.3	42.4
360Zhinao3-7B-O1.5	25.4.14	360Zhinao3-7B-Instruct	54.2	36.3	40.0
OpenR1-Qwen-7B	25.2.11	Qwen2.5-Math-7B-Instruct	48.7	34.7	21.2
DeepSeek-R1-Distill-Qwen-7B	25.1.20	Qwen2.5-Math-7B-Instruct	57.3	33.3	47.3
Light-R1-7B-DS	25.3.12	DeepSeek-R1-Distill-Qwen-7B	59.1	44.3	49.4
Areal-boba-RL-7B	25.3.31	DeepSeek-R1-Distill-Qwen-7B	61.9	48.3	47.6

Quickstart

A simple example to illustrate how to quickly use 360Zhinao3-7B, 360Zhinao3-7B-Instruct, and 360Zhinao3-7B-O1.5 with 🤗Transformers

🤗 Transformers

Demonstration of Base Model Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.generation import GenerationConfig

MODEL_NAME_OR_PATH = "qihoo360/360Zhinao3-7B"

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME_OR_PATH, 
    trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME_OR_PATH,
    trust_remote_code=True).cuda()

generation_config = GenerationConfig.from_pretrained(
    MODEL_NAME_OR_PATH,
    trust_remote_code=True)
generation_config.max_new_tokens = 1024

inputs = tokenizer('中国二十四节气\n1. 立春\n2. 雨水\n3. 惊蛰\n4. 春分\n5. 清明\n', return_tensors='pt')
inputs = inputs.to(model.device)

pred = model.generate(input_ids=inputs["input_ids"], generation_config=generation_config)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Demonstration of Instruct Model Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.generation import GenerationConfig

MODEL_NAME_OR_PATH = "qihoo360/360Zhinao3-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME_OR_PATH,
    trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME_OR_PATH,
    trust_remote_code=True).cuda()

generation_config = GenerationConfig.from_pretrained(
    MODEL_NAME_OR_PATH,
    trust_remote_code=True)
generation_config.max_new_tokens = 2048

messages = []

#round-1
print(f"user: 简单介绍一下刘德华")
messages.append({"role": "user", "content": "简单介绍一下刘德华"})
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
pred = model.generate(input_ids=input_ids, generation_config=generation_config)
response = tokenizer.decode(pred.cpu()[0][len(input_ids[0]):], skip_special_tokens=True)
messages.append({"role": "assistant", "content": response})
print(f"gpt: {response}")


#round-1
print(f"user: 他有什么代表作?")
messages.append({"role": "user", "content": "他有什么代表作?"})
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
pred = model.generate(input_ids=input_ids, generation_config=generation_config)
response = tokenizer.decode(pred.cpu()[0][len(input_ids[0]):], skip_special_tokens=True)
messages.append({"role": "assistant", "content": response})
print(f"gpt: {response}")

Demonstration of Long COT Model Inference

import re
import json
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.generation import GenerationConfig

MODEL_NAME_OR_PATH = "qihoo360/360Zhinao3-7B-O1.5"

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME_OR_PATH,
    trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME_OR_PATH,
    trust_remote_code=True).cuda()

generation_config = GenerationConfig.from_pretrained(
    MODEL_NAME_OR_PATH,
    trust_remote_code=True)
generation_config.max_new_tokens = 2048


def extract_thinking_and_answer(input_string):
    thinking, answer = "", ""
    # 提取答案
    pattern_answer = r'.*</think>(.*)$'
    match_answer = re.search(pattern_answer, input_string, re.S)
    if match_answer:
        answer = match_answer.group(1)
    else:
        return thinking, input_string

    # 提取思考过程
    pattern_thinking = r'<think>(.*?)</think>'
    match_thinking = re.search(pattern_thinking, input_string, re.S)
    if match_thinking:
        thinking = match_thinking.group(1)

    return thinking, answer


messages = []
messages.append({"role": "user", "content": "现有一笼子，里面有鸡和兔子若干只，数一数，共有头14个，腿38条，求鸡和兔子各有多少只？"})
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
pred = model.generate(input_ids=input_ids, generation_config=generation_config)
response = tokenizer.decode(pred.cpu()[0][len(input_ids[0]):], skip_special_tokens=True)
thinking, answer = extract_thinking_and_answer(response)
messages.append({"role": "assistant", "content": answer, "reasoning_content": thinking})
print(json.dumps(messages, ensure_ascii=False, indent=4))

Model Inference

Deployment

vLLM Installation

We recommend using vllm==0.6.0.

If you are using CUDA 12.1 and PyTorch 2.1, you can install vLLM directly with:

pip install  vllm==0.6.0

Otherwise, please refer to the official vLLM Installation Instructions.

After installation, perform the following steps:

Copy vllm/zhinao.py into vllm/model_executor/models in your vllm installation directory (in python/conda env).
Then add a line in vllm/model_executor/models/__init__.py
```
"ZhinaoForCausalLM": ("zhinao", "ZhinaoForCausalLM"),
```

vLLM Service Start

Start the service:

python -m vllm.entrypoints.openai.api_server \
    --model qihoo360/360Zhinao3-7B-O1.5 \
    --served-model-name 360Zhinao3-7B-O1.5 \
    --port 8360 \
    --host 0.0.0.0 \
    --dtype bfloat16 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code

Use curl to request the service:

curl http://localhost:8360/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "360Zhinao3-7B-O1.5",
    "max_tokens": 200,
    "top_k": -1,
    "top_p": 0.8,
    "temperature": 1.0,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "你好"}
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]
}'

Use python to request the service:

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8360/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="360Zhinao3-7B-O1.5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "你好"},
    ],
    stop=[
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ],
    presence_penalty=0.0,
    frequency_penalty=0.0
)
print("Chat response:", chat_response)

If you need to enable repetition penalty, we recommend setting presence_penalty and frequency_penalty instead of repetition_penalty.

Model Finetune

Training data

Training Data: data/training_data_sample.json. This example data has 10,000 rows sampled from multiturn_chat_0.8M with converted format.

Data Format:

[
  {
    "id": 1,
    "conversations": [
        {
            "from": "system",
            "value": "You are a helpful assistant."
        },
        {
            "from": "user",
            "value": "您好啊"
        },
        {
            "from": "assistant",
            "value": "你好！我今天能为您做些什么？有什么问题或需要帮助吗? 我在这里为您提供服务。"
        }
    ]
  }
]

Finetuning scripts

set -x

HOSTFILE=hostfile
DS_CONFIG=./finetune/ds_config_zero2.json

# PARAMS
LR=5e-6
EPOCHS=3
MAX_LEN=32768
BATCH_SIZE=4
NUM_NODES=1
NUM_GPUS=8
MASTER_PORT=29500

IS_CONCAT=False # Whether to concatenate to maximum length (MAX_LEN)

DATA_PATH="./data/training_data_sample.json"
MODEL_PATH="qihoo360/360Zhinao3-7B-Instruct"
OUTPUT_DIR="./outputs/"

deepspeed --hostfile ${HOSTFILE} \
        --master_port ${MASTER_PORT} \
        --num_nodes ${NUM_NODES} \
        --num_gpus ${NUM_GPUS} \
        finetune.py \
        --report_to "tensorboard" \
        --data_path ${DATA_PATH} \
        --model_name_or_path ${MODEL_PATH} \
        --output_dir ${OUTPUT_DIR} \
        --model_max_length ${MAX_LEN} \
        --num_train_epochs ${EPOCHS} \
        --per_device_train_batch_size ${BATCH_SIZE} \
        --gradient_accumulation_steps 1 \
        --save_strategy steps \
        --save_steps 200 \
        --learning_rate ${LR} \
        --lr_scheduler_type cosine \
        --adam_beta1 0.9 \
        --adam_beta2 0.95 \
        --adam_epsilon 1e-8 \
        --max_grad_norm 1.0 \
        --weight_decay 0.1 \
        --warmup_ratio 0.01 \
        --gradient_checkpointing True \
        --bf16 True \
        --tf32 True \
        --deepspeed ${DS_CONFIG} \
        --is_concat ${IS_CONCAT} \
        --logging_steps 1 \
        --log_on_each_node False

bash finetune/ds_finetune.sh

Configuring HOSTFILE switches between single-machine and multi-machine training.
configuring ds_config switches between zero1, zero2 and zero3.
fp16, bf16 could configure mixed precision training. bf16 is recommended to be consistent with the pretrained model.
is_concat configures whether the training data is concatenated or not.

License

The source code of this repository follows the open-source license Apache 2.0.

360Zhinao3 open-source models support free commercial use. It is not necessary for you to submit a request for commercial usage.

qihoo360
/

360Zhinao3-7B