|
TITLE = '<h1 align="center" id="space-title">Open Dutch LLM Evaluation Leaderboard</h1>' |
|
|
|
INTRO_TEXT = f""" |
|
## About |
|
|
|
This is a leaderboard for Dutch benchmarks for large language models. |
|
|
|
This is a fork of the [Open Multilingual LLM Evaluation Leaderboard](https://huggingface.co/spaces/uonlp/open_multilingual_llm_leaderboard), but restricted to only Dutch models and augmented with additional model results. |
|
We test the models on the following benchmarks **for the Dutch version only!!**, which have been translated into Dutch automatically by the original authors of the Open Multilingual LLM Evaluation Leaderboard with `gpt-35-turbo`. |
|
|
|
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) |
|
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) |
|
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) |
|
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) |
|
|
|
I do not maintain those datasets, I only run benchmarks and add the results to this space. For questions regarding the test sets or running them yourself, see [the original Github repository](https://github.com/laiviet/lm-evaluation-harness). |
|
|
|
Disclaimer: I am aware that benchmarking models on *translated* data is not ideal. However, for Dutch there are no other options for generative models at the moment. If you have any suggestions for other Dutch benchmarks, please let me know so I can add them! |
|
""" |
|
|
|
CREDIT = f""" |
|
## Credit |
|
|
|
This leaderboard has borrowed heavily from the following sources: |
|
|
|
- Datasets (AI2_ARC, HellaSwag, MMLU, TruthfulQA) |
|
- Evaluation code (EleutherAI's lm_evaluation_harness repo) |
|
- Leaderboard code (Huggingface4's open_llm_leaderboard repo) |
|
- The multilingual version of the leaderboard (uonlp's open_multilingual_llm_leaderboard repo) |
|
|
|
""" |
|
|
|
|
|
CITATION = f""" |
|
## Citation |
|
|
|
|
|
If you use or cite the Dutch benchmark results or this specific leaderboard page, please cite the following paper: |
|
|
|
TDB |
|
|
|
|
|
If you use the multilingual benchmarks, please cite the following paper: |
|
|
|
```bibtex |
|
@misc{{lai2023openllmbenchmark, |
|
author = {{Viet Lai and Nghia Trung Ngo and Amir Pouran Ben Veyseh and Franck Dernoncourt and Thien Huu Nguyen}}, |
|
title={{Open Multilingual LLM Evaluation Leaderboard}}, |
|
year={{2023}} |
|
}} |
|
``` |
|
""" |
|
|