attention_mask bug
#18
by
ngxson
- opened
using the provided example:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import re
model_name_or_path = os.environ['MODEL_PATH']
# model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
messages = [
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
enable_thinking=True # Toggle thinking mode (default: True)
)
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
...
I noticed that attention_mask
is never initialized. That means the model always running in non-causal model even for text generation.
Wondering if this is a bug.
Some updates:
- For
eager
attn impl, missingattention_mask
causes the model to be always in non-causal mode, thus produces wrong result - For
sdpa
, it doesn't care about mask, so the output is correct - No idea if
flash_attn
work or not, seems like it's broken
Also your router has a bug where some tokens use 0 experts