arxiv:2508.02998

MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation

Published on Aug 5

Authors:

Abstract

MRG-Bench, a multi-language repository-level code generation benchmark, reveals significant performance deficiencies in current code generation techniques and highlights the difficulty in understanding user requirements across different programming languages.

AI-generated summary

Large Language Models (LLMs) have demonstrated impressive capabilities in code generation. However, current evaluation datasets suffer from issues such as the lack of runnable test cases, deviation from the distribution of real-world code, and the ability to evaluate only the Python language. These limitations undermine the credibility of the evaluation results. To address these limitations, we introduce MRG-Bench (Multi-language Repository-level Code Generation Benchmark), a novel dataset that provides a more accurate evaluation of LLMs in practical repository-level code generation tasks. MRG-Bench has three main features: (1) practical data sourced from real-world code repositories that align to the practical distribution, (2) multiple programming languages support, including Python, Java, and Go, and (3) project-level runnable test cases to assess the quality of the generated code. Based on MRG-Bench, we conducted extensive experiments including large language models, long-context models, and RAG-related methods. These evaluation results demonstrate that current repository-level code generation techniques suffer from significant performance deficiencies. To further investigate why models fail, we designed novel experiments to annotate the underlying causes of generation errors. The results explicitly show that the majority of methods suffer from "difficulty in understanding user requirements," failing to comprehend their assigned tasks accurately. Moreover, the impact of different repository-level contexts on this issue exhibits significant disparities across different programming languages, suggesting that, in practice, specialized contextual information needs to be designed for different languages.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.02998 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.02998 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.02998 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.