Spaces:

osunlp
/

Online_Mind2Web_Leaderboard

Running

File size: 8,906 Bytes

4dcba74
2dba94f
 
579db59
5d1208d
2dba94f
 
 
 
 
 
49268fc
 
9da9431
ad604a4
9da9431
49268fc
1685572
49268fc
 
 
6820b1a
 
49268fc
 
 
 
 
6820b1a
 
49268fc
2dba94f
 
 
 
 
 
 
 
 
dbd7a03
 
 
49268fc
 
dbd7a03
3489875
2dba94f
579db59
 
 
 
 
 
 
 
5d1208d
808487a
3621285
 
 
 
 
 
 
 
 
 
5d1208d
2dba94f
 
49268fc
 
1685572
2dba94f
49268fc
 
 
 
1f3e8c7
1685572
 
2dba94f
 
579db59
1685572
579db59
2dba94f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1685572
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2dba94f
302f5c8
2dba94f
4dcba74
2dba94f

TITLE = """<h1 align="center" id="space-title">🏆 Online-Mind2Web Leaderboard</h1>"""
LINKS = """
<div align="center">
    <a href="https://arxiv.org/abs/2504.01382">Paper</a> |
    <a href="https://tiancixue.notion.site/An-Illusion-of-Progress-Assessing-the-Current-State-of-Web-Agents-1ac6cd2b9aac80719cd6f68374aaf4b4?pvs=4">Blog</a> |
    <a href="https://github.com/OSU-NLP-Group/Online-Mind2Web">Code</a> |
    <a href="https://huggingface.co/datasets/osunlp/Online-Mind2Web">Data</a>
</div>
"""

INTRODUCTION_TEXT = """
Online-Mind2Web is a benchmark designed to evaluate the real-world performance of web agents on live websites, featuring 300 tasks across 136 popular sites in diverse domains with reliable LLM-as-a-Judge (WebJudge) automactic evaluation. 
Based on the number of steps required by human annotators, tasks are divided into three difficulty levels: Easy (1-5 steps), Medium (6-10 steps), and Hard (11+ steps). 
"""

LEADERBOARD_TEXT = """
## Leaderboard
Our goal is to conduct a rigorous assessment of the current state of web agents. We maintain two leaderboards—one for automatic evaluation and another for human evaluation. 

When using our benchmark or submitting results, please first carefully review the important notes to ensure proper usage and obtain reliable evaluation results and follow the "Submission Guideline".

**We usually need about one week to review the results. If your results require urgent verification, please let us know in advance. Thank you for your understanding.**

### ⚠ Important Notes for Reliable Evaluation:
- **Start from the specified websites, not Google Search**: To enable fair comparisons, please ensure that each task starts from the specified website in our benchmark. Starting from Google Search or alternative websites can lead agents to use different websites to solve the task, resulting in varying difficulty levels and potentially skewed evaluation results.
- **Include only factual actions, not agent outputs**: The action history should contain only the factual actions taken by the agent to complete the task (e.g., Clicking elements and Typing text). Do not include the final response or any other agent's outputs, as they may contain hallucinated content and result in a high rate of false positives.
- **Use o4-mini for WebJudge**: WebJudge powered by o4-mini demonstrates a higher alignment with human judgment, achieving an average agreement rate of 85.7% and maintaining a narrow success rate gap of just 3.8%. Therefore, please use o4-mini as the backbone for automatic evaluation.

To obtain more reliable automatic evaluation results, the action representation should be as detailed as possible, including only factual actions and excluding any agent outputs. Here is an example [script](https://github.com/OSU-NLP-Group/Online-Mind2Web/blob/main/src/clean_html.py) to process the element's HTML as the action representation. It can preserve valuable information while filtering out irrelevant attributes.

**Please do not use it as training data for your agent.**
"""

SUBMISSION_TEXT = """
## Submissions
Participants are invited to submit your agent's trajectory to test. The submissions will be evaluated based on our auto-eval.
### Format of submission
Submissions must include a sequence of images (i.e., screenshots in the trajectory) and a result.json file for each task. The JSON file should contain the fields: "Task", "Task_id", and "action_history". You can refer to an example of the submission files.
"""

EVALUATION_DETAILS = """
In certain scenarios, testing on the full Online-Mind2Web dataset may not be feasible due to cost, privacy, or legal constraints. To facilitate fair and apple-to-apple comparisons, we release both our human evaluation labels and auto-eval details.

- **Human Evaluation**: Task-level human evaluation labels are provided in the [file](https://github.com/OSU-NLP-Group/Online-Mind2Web/blob/main/data/evaluation_results/online_mind2web_evaluation_results/human_label.json).
- **Auto-Evaluation**: The results of WebJudge are available in the [folder](https://github.com/OSU-NLP-Group/Online-Mind2Web/tree/main/data/evaluation_results/online_mind2web_evaluation_results/webjudge_o4-mini)."""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results. Note: Online-Mind2Web is derived from the original Mind2Web dataset. We kindly ask that you cite both the original and this work when using or referencing the data."
CITATION_BUTTON_TEXT = r"""
@article{xue2025illusionprogressassessingcurrent,
      title={An Illusion of Progress? Assessing the Current State of Web Agents}, 
      author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su},
      year={2025},
      eprint={2504.01382},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2504.01382}, 
}
@inproceedings{deng2023mind2web,
 author = {Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
 pages = {28091--28114},
 publisher = {Curran Associates, Inc.},
 title = {Mind2Web: Towards a Generalist Agent for the Web},
 url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf},
 volume = {36},
 year = {2023}
}
"""

SUBMIT_INTRODUCTION = """
You should use the script provided in our GitHub [repository](https://github.com/OSU-NLP-Group/Online-Mind2Web) to obtain automatic evaluation results on your own and submit them along with all trajectories to enhance transparency. 
To ensure the authenticity and reliability of the reported results, we will also verify the auto-eval results.
If you have conducted your own human evaluation, please also attach your human-eval results. We will spot-check these before adding them to the human-eval table.

## ⚠ Important Notes for Reliable Evaluation:
- **Start from the specified websites, not Google Search**:To enable fair comparisons, please ensure that each task starts from the specified website in our benchmark. Starting from Google Search or alternative websites can lead agents to use different websites to solve the task, resulting in varying difficulty levels and potentially skewed evaluation results.
- **Include only factual actions, not agent outputs**: The action history should contain only the factual actions taken by the agent to complete the task (e.g., Clicking elements and Typing text). Do not include the final response or any other agent's outputs, as they may contain hallucinated content and result in a high rate of false positives.
- **Use o4-mini for WebJudge**: WebJudge powered by o4-mini demonstrates a higher alignment with human judgment, achieving an average agreement rate of 85.7% and maintaining a narrow success rate gap of just 3.8%. Therefore, please use o4-mini as the backbone for automatic evaluation.

## ⚠ Please submit the trajectory file with the following format:
The result of each task is stored in a folder named as its `task_id`, containing:
- `trajectory/`: Stores screenshots of each step.
- `result.json`: Task metadata and action history.

Here is an [example](https://github.com/OSU-NLP-Group/Online-Mind2Web/tree/main/data/example/fb7b4f784cfde003e2548fdf4e8d6b4f) of the format.

**Structure:**
```
main_directory/
└── task_id/
    ├── result.json
    └── trajectory/
        ├── 0_screenshot.png
        ├── 1_screenshot.png
        └── ...
```
**`result.json` format:**
```json
{
    "task_id": 123,
    "task": "abc",
    "action_history": ["abc", "xyz", "..."]
}
```
**`human_result.json` format:**
```json
[
    {
        "task_id": 123,
        "task": "abc",
        "human_label": 0 or 1 (failure or success)
    },
    {
        "task_id": 456,
        "task": "def",
        "human_label": 0 or 1 (failure or success)
    },
]
```

Please email your agent's name, model family, and organization to xue.681@osu.edu, and include the trajectory directory and auto-eval results file as attachments (optional: human evaluation results).
"""
DATA_DATASET = """## More Statistics for Online-Mind2Web Benchmark
"""


def format_error(msg):
    return f"<p style='color: red; font-size: 20px; text-align: center;'>{msg}</p>"

def format_warning(msg):
    return f"<p style='color: orange; font-size: 20px; text-align: center;'>{msg}</p>"

def format_log(msg):
    return f"<p style='color: green; font-size: 20px; text-align: center;'>{msg}</p>"

def model_hyperlink(link, model_name):
    return f'<a target="_blank" href="{link}" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">{model_name}</a>'