Instruction-Following Evaluation in Function Calling for Large Language Models
Abstract
IFEval-FC is a benchmark that evaluates function calling by assessing adherence to format instructions within JSON schema descriptions, revealing that even advanced models often fail to follow basic formatting rules.
Function calling is a core capability of large language models, essential for AI agents. Existing benchmarks such as the Berkeley Function Calling Leaderboard (BFCL), tau^2-Bench (arXiv:2506.07982), and ACEBench (arXiv:2501.12851) evaluate argument correctness but do not test adherence to format instructions embedded in parameter descriptions, such as enclosing values in double quotes or using ISO date formats. We introduce IFEval-FC, a benchmark inspired by IFEval (arXiv:2311.07911) that assesses precise instruction following in function calling. IFEval-FC encodes verifiable formats directly within JSON schema descriptions, for example specifying that a value must not contain punctuation. It includes 750 test cases, each consisting of a function with an embedded format for one of its input parameters and a corresponding user query. Evaluation is fully algorithmic, ensuring objectivity, reproducibility, and scalability. Our results show that even state-of-the-art proprietary models, including GPT-5 and Claude 4.1 Opus, frequently fail to follow basic formatting rules, highlighting a practical limitation for real-world agent systems. The complete codebase and data are publicly available at https://github.com/Skripkon/IFEval-FC.
Community
Code + data: https://github.com/Skripkon/IFEval-FC
data on HF: https://huggingface.co/datasets/NikolaiSkripko/IFEval-FC
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates (2025)
- When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following (2025)
- Ticket-Bench: A Kickoff for Multilingual and Regionalized Agent Evaluation (2025)
- RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment (2025)
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries (2025)
- Complex Logical Instruction Generation (2025)
- STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper