--- license: apache-2.0 datasets: - Mavkif/roman-urdu-msmarco-dataset language: - ur base_model: - unicamp-dl/mt5-base-mmarco-v2 pipeline_tag: question-answering tags: - mt5 - information - retrieval - NLP - urdu - roman-urdu --- # Roman Urdu mT5 msmarco: Fine-Tuned mT5 Model for Roman-Urdu Information Retrieval As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu. We created this model by translating the MS-Marco dataset into Roman-Urdu using the IndicTrans2 model. To establish baseline performance, we initially tested for zero-shot learning for IR in Roman-Urdu using the unicamp-dl/mt5-base-mmarco-v2 model and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset, resulting in State-Of-The-Art results for urdu IR ## Model Details ### Model Description - **Developed by:** Umer Butt - **Model type:** IR model for reranking - **Language(s) (NLP):** Python/pytorch ## Bias, Risks, and Limitations Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too. ## Evaluation The evaluation was done using the scripts in the pygaggle library. Specifically these files: evaluate_monot5_reranker.py ms_marco_eval.py ### Model Architecture and Objective ```json { "_name_or_path": "unicamp-dl/mt5-base-mmarco-v2", "architectures": ["MT5ForConditionalGeneration"], "d_model": 768, "num_heads": 12, "num_layers": 12, "dropout_rate": 0.1, "vocab_size": 250112, "model_type": "mt5", "transformers_version": "4.45.2" } ``` For more details on how to customize the decoding parameters (such as max_length, num_beams, and early_stopping), refer to the Hugging Face documentation.