Model Summary
Alisia-7B-it is a 7 billion parameter instruction-tuned language model. It is designed for general-purpose conversational AI and assistant-like tasks, demonstrating strong performance in factual knowledge, commonsense reasoning, and mathematical problem-solvin.
Evaluation
Custom Benchmark Results
The model was evaluated on a manually curated suite of 25 questions across 5 categories. The results for the base model are summarized below:
Benchmark Category | Score | Notes |
---|---|---|
Knowledge & Comprehension | 100% | Excellent factual recall. |
Commonsense Reasoning | 100% | Strong understanding of everyday scenarios. |
Mathematical Reasoning | 100% | Proficient in arithmetic and algebra. |
Linguistic Semantics | 80% | Struggles with complex pronoun resolution. |
Logical & Creative Reasoning | 60% | Primary weakness. Fails on abstract logic and spatial puzzles. |
Overall Score | 88% | A capable generalist with a clear performance profile. |
Further standard benchmark results (e.g., on MMLU, HellaSwag, ARC-Challenge) are recommended to confirm these findings at a larger scal.
Training Details
Training Data
The model was fine-tuned on a mixture of publicly available instruction datasets, including but not limited to cleaned versions of the Alpaca dataset. This data primarily consists of instruction-response pairs designed to teach the model to follow user commands.
Uses
Direct Use
This model is intended for direct use in the following applications:
- Conversational AI: As a chatbot or interactive assistant.
- Question Answering: Providing factual information and explanations.
- Text Generation: Creative writing, summarization, and ideation.
- Educational Tool: Assisting with homework, particularly in mathematics and general knowledge subjects.
Out-of-Scope Use
The model should not be used for:
- Critical decision-making in legal, medical, or financial contexts.
- Generating highly technical or scientific content without human verification.
- Tasks requiring flawless logical or spatial reasoning (see Limitations).
How to Get Started with the Model
Use the code below to get started with the model. Ensure you have the required libraries installed:
Requirements
掳transformers>=4.56.2
掳torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Gems234/Alisia-7B-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
# Create a prompt
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate a response
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
You can also use the chat template format:
# Chat template
messages = [
{"role": "user", "content": "Write me a poem about Machine Learning."},
]
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
return_dict=True
).to(model.device)
# G茅n茅ration
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Instruction format
To take full advantage of the model's performance, we recommend using the Alpaca format:
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}
For exemple:
instruction = "You are Alisia. Be concise and helpful."
input = "where is the Eiffel Tower"
response= ""
prompt = alpaca_prompt.format(instruction, input_text, output_text)
# Tokenizer
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response
Limitations
A manual evaluation on a custom benchmark suite revealed the following performance profile for Alisia-7B-it:
Identified Strengths
- Knowledge & Comprehension (MMLU-like): Achieved a perfect score (100%), demonstrating excellent recall of factual information across history, science, and literature.
- Commonsense Reasoning (HellaSwag-like): Achieved a perfect score (100%), showing a robust understanding of everyday physical and social causality.
- Mathematical Reasoning (GSM8K-like): Achieved a perfect score (100%), excelling at basic arithmetic, algebra, and problem-solving.
Identified Weaknesses
- Logical & Creative Reasoning (ARC-like): Achieved a score of 60%. The model struggles with formal logic puzzles (e.g., syllogisms) and non-intuitive spatial reasoning problems. It is not recommended for applications requiring infallible abstract reasoning.
- Linguistic Semantics (Winogrande-like): Achieved a score of 80%. While generally very good, the model can occasionally fail to resolve complex pronoun coreference ambiguities, potentially leading to minor misunderstandings in narrative text or dialogue.
Overall Benchmark Score: 88% (22/25 correct). The model is a robust generalist with a specific, predictable profile of strengths and weaknesses.
Bias, Risks, and Recommendations
Known Biases
- Identity Bias: Due to the nature of its training data (which includes datasets like
alpaca-cleaned
), the model may occasionally incorrectly identify itself as "ChatGPT" or another AI system. This is a known artifact and does not reflect its actual origin or capabilities. - As with all large language models, Alisia-7B-it may reflect and amplify social biases present in its training data. Outputs should not be assumed to be free from bias.
Recommendations
Users should:
- Be aware of the model's limitations in logical and spatial reasoning.
- Critically evaluate its outputs, especially for critical applications.
- Use a safety classifier or content filter in production environments.
- Downloads last month
- 126
Model tree for Gems234/Alisia-7B-it
Base model
Gems234/Alisia-7B-Instruct-V1.0-private