Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning
Abstract
Tool-Light framework improves large language models' tool-integrated reasoning efficiency and accuracy by leveraging information entropy and a two-stage fine-tuning process.
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools. However, models employing TIR often display suboptimal behaviors, such as insufficient or excessive tool usage and overthinking after tool calls. The challenge of incentivizing LLMs to perform TIR efficiently and accurately, while stabilizing the reasoning process, remains an open question. In this paper, we start by exploring the impact of tool calls on model reasoning from the perspective of information entropy. Our findings indicate that tool call results lead to a distinct change in the information entropy of subsequent reasoning, with the overall entropy of the reasoning chain varying based on the number of tool calls. Building on these insights, we propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately. Our framework includes dataset construction and multi-stage fine-tuning. For dataset construction, we employ continuous self-evolved sampling using the fine-tuned model, integrating both vanilla sampling and entropy-guided sampling. Besides, we establish strict criteria for selecting positive-negative pairs during sampling. The training process involves a two-stage approach, comprising Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference Optimization (DPO). Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light, significantly improving the model's efficiency in executing TIR tasks.
Community
š Overview
Tool-Light is a framework focused on enabling models to efficiently complete TIR tasks. Tool-Light innovatively introduces the Entropy-Guided Sampling Strategy to construct the training set. Besides, it trains the model through the Self-Evolved DPO Pipeline. This design empowers the model to gradually acquire the ability to call tools efficiently and accurately. Results on two types of reasoning tasks demonstrate superior performance compared to traditional methods.
š Contribution
⢠We pioneeringly explore and analyze the TIR paradigm from the perspective of information entropy, demonstrating the connection between TIR effectiveness and entropy change.
⢠We propose an innovative entropy-guided sampling strategy, which is combined with a two-stage training method incorporating a self-evolution mechanism, thereby enhancing the effectiveness of the TIR process.
⢠Experiment results across 10 challenging reasoning datasets prove the effectiveness of Tool-Light. Further quantitative analyses offer practical guidance for efficient tool-integrated reasoning.
Our code and model checkpoints of Tool-Light are open-sourced:
Github: https://github.com/asilverlight/Tool-Light
Model: https://huggingface.co/zhangboguodong/Tool-Light-Qwen2.5-7B-it
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning (2025)
- Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models (2025)
- Tool-integrated Reinforcement Learning for Repo Deep Search (2025)
- Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking (2025)
- Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization (2025)
- DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL (2025)
- Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper