TON-AITZ

TON is a series of large language models trained using our efficient algorithm, which automatically decides whether to think or not, based on Qwen2.5-VL. We apply Group Relative Policy Optimization (GRPO) for reinforcement learning with "thought dropout" supervised finetuning as a prelimary step.

Introduction

Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy:

(i) A supervised fine-tuning (SFT) stage with a simple yet effective “thought dropout” operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning.
(ii) A GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards.

Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks—covering a range of reasoning difficulties under both 3B and 7B models—consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches.

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer


example={
  'image': your_image_path,
  'problem': 'You are an assistant trained to navigate the mobile phone. \nGiven a task instruction, a screen observation, and an action history sequence, \noutput the next action and wait for the next observation. \nHere is the action space:\n1. `CLICK`: Click on an element, value is not applicable and the position [x,y] is required. \n2. `TYPE`: Type a string into an element, value is a string to type and the position is not applicable.\n3. `SCROLL UP`: Scroll up for the screen.\n4. `SCROLL DOWN`: Scroll down for the screen.\n5. `SCROLL LEFT`: Scroll left for the screen.\n6. `SCROLL RIGHT`: Scroll right for the screen.\n7. `PRESS BACK`: Press for returning to the previous step, value and position are not applicable.\n8. `PRESS HOME`: Press for returning to the home screen, value and position are not applicable.\n9. `PRESS ENTER`: Press for submitting the input content, value and position are not applicable.\n10. `STATUS TASK COMPLETE`: Indicate the task is completed, value and position are not applicable.\n\nFormat the action as a dictionary with the following keys:\n{'action': 'ACTION_TYPE', 'value': 'element', 'position': [x,y]}\n\nIf value or position is not applicable, set it as `None`.\nPosition represents the relative coordinates on the screenshot and should be scaled to a range of 0-1.\n\n**Please first thinks about the reasoning process in the mind and then provides the user with the action. The reasoning process and answer are enclosed within <think> </think> and <action> </action> tags, respectively, i.e., <think> reasoning process here </think><action> action here </action>**\nTask: Search for hotels in London\n'
}

def make_conversation_image(example):
        return  {
                'image': example['image'],  # Store path instead of loaded image
                'prompt': [{
                    'role': 'user',
                    'content': [
                        {'type': 'image', 'text': None},
                        {'type': 'text', 'text': example['problem']}
                    ]
                }]
        }

model_name = "kolerk/TON-3B-AITZ"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)


text = tokenizer.apply_chat_template(
    make_conversation_image(example),
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=4096,
    top_p=0.95,
    top_k=1,
    temperature=0.6
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Evaluation

Refer to our code repository for more details.

Briefly, you should first use llamafactory to test the data, and then use our tool to run the score.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{wang2025think,
    title={Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models},
    author={Jiaqi Wang and Kevin Qinghong Lin and James Cheng and Mike Zheng Shou},
    year={2025},
    eprint={2505.16854},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}