|
--- |
|
language: |
|
- it |
|
- en |
|
license: apache-2.0 |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
base_model: |
|
- meta-llama/Llama-3.1-8B |
|
--- |
|
|
|
# Llama-3.1-8B-Italian-FVT |
|
<div align="center"> |
|
|
|
<img src="https://github.com/Andrew-Wyn/images/blob/master/sava/italian_adapt-img.jpg?raw=true" width="400" height="400" style="border-radius:10%" /> |
|
|
|
</div> |
|
|
|
The **Llama-3.1-8B-Adapted** collection of large language models (LLMs), is a collection of adapted generative models in 8B (text in/text out), adapted models from **Llama-3.1-8B**. |
|
|
|
*Llama-3.1-8B-Italian-FVT* is a continually trained Llama model, after tokenizer substitution. |
|
|
|
The tokenizer of this model after adaptation is the same as [Minverva-3B](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0). |
|
|
|
**Model developer:** SapienzaNLP, ISTI-CNR, ILC-CNR |
|
|
|
**Model Architecture:** Llama-3.1-8B-Adapted is an auto-regressive language model that uses an optimized transformer architecture. |
|
|
|
## Data used for the adaptation |
|
|
|
The **Llama-3.1-8B-Adapted** model was trained on a collection of Italian and English data extracted from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). |
|
The data was extracted to be skewed toward Italian language with a ratio of one over four. Extracting the first 9B tokens from the Italian part of CulturaX and the first 3B tokens from the English part of CulturaX. |
|
|
|
|
|
## Use with Transformers |
|
|
|
You can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function. |
|
|
|
Make sure to update your transformers installation via `pip install --upgrade transformers`. |
|
|
|
```python |
|
import transformers |
|
import torch |
|
|
|
model_id = "SemanticAlignment/Llama-3.1-8B-Italian-FVT" |
|
|
|
pipeline = transformers.pipeline( |
|
"text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto" |
|
) |
|
|
|
pipeline("Cosa si può fare in una bella giornata di sole?") |
|
``` |
|
|
|
Code: https://github.com/SapienzaNLP/sava |
|
|
|
## Citation |
|
|
|
If you use any part of this work, please consider citing the paper as follows: |
|
|
|
```bibtex |
|
@misc{moroni2025optimizingllmsitalianreducing, |
|
title={Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation}, |
|
author={Luca Moroni and Giovanni Puccetti and Pere-Lluis Huguet Cabot and Andrei Stefan Bejgu and Edoardo Barba and Alessio Miaschi and Felice Dell'Orletta and Andrea Esuli and Roberto Navigli}, |
|
year={2025}, |
|
eprint={2504.17025}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2504.17025}, |
|
} |
|
``` |