Creation Process: Upscale > CPT > SFT > Merge
After upscaling, was pretrained on approx 100MB of light novels and a subset of DCLM records.
SFT on approx 10 million tokens, SFW / NSFW RP, stories and creative instruct. I've removed some chat data which I think hurt more than helped and replaced it with conversations from real character cards.
Did some experimenting with lora methods. Particularly dora vs rslora. With dora the writing was fantastic, but the model wasn't able to handle its own creativity, even with further RLHF applied. Rslora took the data far better, but was significantly less adept at writing. Merged the two models together, using the stable version as a base, which seems to have successfully combined the positives of both models.
>
Mergekit configs
Upscale
base_model: Darkhn/Magistral-Small-2509-Text-Only
merge_method: passthrough
dtype: bfloat16
slices:
- sources:
- model: Darkhn/Magistral-Small-2509-Text-Only
layer_range: [0, 29]
- sources:
- model: Darkhn/Magistral-Small-2509-Text-Only
layer_range: [10, 40]
Slerp Merge
models:
- model: ApocalypseParty/Magi-PT-2-SFT-1-DPO-3
- model: ApocalypseParty/Magi-PT-2-SFT-2
merge_method: slerp
base_model: ApocalypseParty/Magi-PT-2-SFT-2
parameters:
t: [0, 0, 0, 0.1, 0.2]
dtype: bfloat16
>
Axolotl configs
Pretrain (2*H100)
# ====================
# MODEL CONFIGURATION
# ====================
base_model: ApocalypseParty/magistral-34b
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
chat_template: mistral_v7_tekken
# ====================
# DATASET CONFIGURATION
# ====================
datasets:
- path: ./data/text_files_minimal_dataset.jsonl
type: completion
- path: ./data/filtered_results.jsonl
type: completion
dataset_prepared_path:
train_on_inputs: false # Only train on assistant responses
# ====================
# QLORA CONFIGURATION
# ====================
adapter: qlora
load_in_4bit: true
lora_r: 64
lora_alpha: 64
lora_dropout: 0.0
lora_target_linear: true
# lora_modules_to_save: # Uncomment only if you added NEW tokens
# ====================
# TRAINING PARAMETERS
# ====================
num_epochs: 1
micro_batch_size: 4
gradient_accumulation_steps: 1
learning_rate: 3e-5
optimizer: paged_adamw_8bit
lr_scheduler: rex
warmup_ratio: 0.05
weight_decay: 0.0
max_grad_norm: 1.0
# ====================
# SEQUENCE & PACKING
# ====================
sequence_len: 16384
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
# ====================
# HARDWARE OPTIMIZATIONS
# ====================
bf16: auto
flash_attention: true
gradient_checkpointing: offload
deepspeed: deepspeed_configs/zero1.json
plugins:
- axolotl.integrations.liger.LigerPlugin
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
cut_cross_entropy: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_cross_entropy: false # Cut Cross Entropy overrides this
liger_fused_linear_cross_entropy: false # Cut Cross Entropy overrides this
# ====================
# EVALUATION & CHECKPOINTING
# ====================
save_strategy: steps
save_steps: 40
save_total_limit: 5 # Keep best + last few checkpoints
load_best_model_at_end: true
greater_is_better: false
# ====================
# LOGGING & OUTPUT
# ====================
output_dir: ./Magi-PT-2
logging_steps: 1
save_safetensors: true
# ====================
# WANDB TRACKING
# ====================
wandb_project: Magi-PT
# wandb_entity: your_entity
wandb_name: Magi-PT-2
SFT (2*H100)
base_model: ApocalypseParty/Magi-PT-2
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
chat_template: mistral_v7_tekken
plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
load_in_8bit: true
load_in_bit: false
deepspeed: deepspeed_configs/zero1.json
datasets:
- path: ./data/automated_dataset.jsonl
type: chat_template
- path: ./data/chub_dataset.jsonl
type: chat_template
- path: ./data/handcrafted_dataset.jsonl
type: chat_template
- path: ./data/cw_dataset.jsonl
type: chat_template
- path: ./data/instruct_dataset.jsonl
type: chat_template
- path: ./data/nsfw_stories.jsonl
type: chat_template
- path: ./data/stories_dataset.jsonl
type: chat_template
dataset_prepared_path: last_run_prepared
val_set_size: 0
output_dir: ./Magi-PT-2-SFT-2
adapter: lora
peft_use_rslora: true
lora_model_dir:
sequence_len: 8192
sample_packing: true
lora_r: 128
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
wandb_project: Magi-SFT
wandb_name: Magi-PT-2-SFT-2
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 1e-5
bf16: auto
tf32: false
gradient_checkpointing: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true
warmup_ratio: 0.05
evals_per_epoch: 1
saves_per_epoch: 2