Sagui-7B-Instruct-v0.1

Sagui-7B-Instruct-v0.1 is a fine-tuned language model capable of understanding and generating text in both Portuguese and English. The model was fine-tuned using the SlimOrca dataset and its Portuguese-translated version using LibreTranslate. It is fine-tuned from the sabia-7b model and leverages the capabilities of the Llama architecture.

Model Details

Model Description

Sagui-7B-Instruct-v0.1 is designed to assist with natural language understanding and generation tasks. The model was trained to improve its instructive capabilities and can be used for various applications.

Model type: LlamaForCausalLM
Languages (NLP): Portuguese and English
License: Llama
Fine-tuned from model: sabia-7b

Uses

Direct Use

Sagui-7B-Instruct-v0.1 can be directly used for general language understanding tasks.

Downstream Use

Fine-tuning for specific domain-related applications
Integration into multilingual applications and tools

Out-of-Scope Use

Generating harmful, biased, or offensive content
Unauthorized personal data extraction
Tasks requiring real-time decision making without human oversight

Bias, Risks, and Limitations

Potential biases inherited from training data
Risks of misuse in generating misleading or harmful content
Limitations in understanding context-specific nuances

Recommendations

It is recommended to have human oversight in applications involving sensitive information or high-stakes decisions.

How to Get Started with the Model

Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "OliveiraJLT/Sagui-7B-Instruct-v0.1",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("OliveiraJLT/Sagui-7B-Instruct-v0.1")

prompt = "Por favor, conte-me sobre as habilidades de comunicação dos saguis."
messages = [
    {"role": "system", "content": "Você é Sagui-7B-Instruct-v0.1, um modelo de linguagem. Sua missão é ajudar os usuários em diversas tarefas, fornecendo informações precisas, relevantes e úteis de maneira educada, informativa, envolvente e profissional."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_length=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Citation

@software{OliveiraJLT2024Sagui7BInstruct01
  title = {Sagui-7B-Instruct-v0.1},
  author = {Oliveira, J. L. T.},
  year = {2024},
  publisher = {HuggingFace},
  journal = {HuggingFace repository},
  howpublished = {\url{https://huggingface.co/OliveiraJLT/Sagui-7B-Instruct-v0.1}
}

Open Portuguese LLM Leaderboard Evaluation Results

Detailed results can be found here and on the 🚀 Open Portuguese LLM Leaderboard

Metric	Value
Average	39.87
ENEM Challenge (No Images)	51.36
BLUEX (No Images)	43.67
OAB Exams	36.22
Assin2 RTE	71.16
Assin2 STS	3.16
FaQuAD NLI	58.05
HateBR Binary	46.46
PT Hate Speech Binary	30.38
tweetSentBR	18.34