Painted Fantasy Visage v4

Overview

Magistral 24B Upscaled to 34B.

The latest Magistral model seems pretty good. Has some refreshing prose.

This model is an uncensored, creative writing and RP model. It uses a new (still a work in progress) dataset I've been curating based on real character cards.

Has some structural repetition, at this point it's a calling card of Mistral models. I think it's better than v3 though.

SillyTavern Settings

Recommended Roleplay Format

> Actions: In plaintext

> Dialogue: "In quotes"

> Thoughts: *In asterisks*

Recommended Samplers

> Temp: 0.8

> MinP: 0.05

> TopP: 0.95

Instruct

Mistral v7 Tekken

Quantizations

EXL3

> 3bpw

> 4bpw

> 5bpw

> 6bpw

GGUF

> iMatrix (mradermacher)

Creation Process

Creation Process: Upscale > CPT > SFT > Merge

After upscaling, was pretrained on approx 100MB of light novels and a subset of DCLM records.

SFT on approx 10 million tokens, SFW / NSFW RP, stories and creative instruct. I've removed some chat data which I think hurt more than helped and replaced it with conversations from real character cards.

Did some experimenting with lora methods. Particularly dora vs rslora. With dora the writing was fantastic, but the model wasn't able to handle its own creativity, even with further RLHF applied. Rslora took the data far better, but was significantly less adept at writing. Merged the two models together, using the stable version as a base, which seems to have successfully combined the positives of both models.

Upscale

base_model: Darkhn/Magistral-Small-2509-Text-Only
merge_method: passthrough
dtype: bfloat16
slices:
  - sources:
      - model: Darkhn/Magistral-Small-2509-Text-Only
        layer_range: [0, 29]
  - sources:
      - model: Darkhn/Magistral-Small-2509-Text-Only
        layer_range: [10, 40]

Slerp Merge

models:
  - model: ApocalypseParty/Magi-PT-2-SFT-1-DPO-3
  - model: ApocalypseParty/Magi-PT-2-SFT-2
merge_method: slerp
base_model: ApocalypseParty/Magi-PT-2-SFT-2
parameters:
  t: [0, 0, 0, 0.1, 0.2]
dtype: bfloat16

Pretrain (2*H100)

# ====================
# MODEL CONFIGURATION
# ====================
base_model: ApocalypseParty/magistral-34b
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
chat_template: mistral_v7_tekken
# ====================
# DATASET CONFIGURATION
# ====================
datasets:
  - path: ./data/text_files_minimal_dataset.jsonl
    type: completion
  - path: ./data/filtered_results.jsonl
    type: completion


dataset_prepared_path:
train_on_inputs: false  # Only train on assistant responses


# ====================
# QLORA CONFIGURATION
# ====================
adapter: qlora
load_in_4bit: true
lora_r: 64
lora_alpha: 64
lora_dropout: 0.0
lora_target_linear: true
# lora_modules_to_save:  # Uncomment only if you added NEW tokens


# ====================
# TRAINING PARAMETERS
# ====================
num_epochs: 1
micro_batch_size: 4
gradient_accumulation_steps: 1
learning_rate: 3e-5
optimizer: paged_adamw_8bit
lr_scheduler: rex
warmup_ratio: 0.05
weight_decay: 0.0
max_grad_norm: 1.0


# ====================
# SEQUENCE & PACKING
# ====================
sequence_len: 16384
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true


# ====================
# HARDWARE OPTIMIZATIONS
# ====================
bf16: auto
flash_attention: true
gradient_checkpointing: offload
deepspeed: deepspeed_configs/zero1.json


plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
cut_cross_entropy: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_cross_entropy: false  # Cut Cross Entropy overrides this
liger_fused_linear_cross_entropy: false  # Cut Cross Entropy overrides this


# ====================
# EVALUATION & CHECKPOINTING
# ====================
save_strategy: steps
save_steps: 40
save_total_limit: 5  # Keep best + last few checkpoints
load_best_model_at_end: true
greater_is_better: false


# ====================
# LOGGING & OUTPUT
# ====================
output_dir: ./Magi-PT-2
logging_steps: 1
save_safetensors: true


# ====================
# WANDB TRACKING
# ====================
wandb_project: Magi-PT
# wandb_entity: your_entity
wandb_name: Magi-PT-2

SFT (2*H100)

base_model: ApocalypseParty/Magi-PT-2
model_type: MistralForCausalLM
tokenizer_type: AutoTokenizer
chat_template: mistral_v7_tekken


plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin


load_in_8bit: true
load_in_bit: false
deepspeed: deepspeed_configs/zero1.json


datasets:
  - path: ./data/automated_dataset.jsonl
    type: chat_template
  - path: ./data/chub_dataset.jsonl
    type: chat_template
  - path: ./data/handcrafted_dataset.jsonl
    type: chat_template
  - path: ./data/cw_dataset.jsonl
    type: chat_template
  - path: ./data/instruct_dataset.jsonl
    type: chat_template
  - path: ./data/nsfw_stories.jsonl
    type: chat_template
  - path: ./data/stories_dataset.jsonl
    type: chat_template


dataset_prepared_path: last_run_prepared
val_set_size: 0
output_dir: ./Magi-PT-2-SFT-2


adapter: lora
peft_use_rslora: true
lora_model_dir:


sequence_len: 8192
sample_packing: true


lora_r: 128
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj


wandb_project: Magi-SFT
wandb_name: Magi-PT-2-SFT-2


gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 1e-5


bf16: auto
tf32: false


gradient_checkpointing: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true


warmup_ratio: 0.05
evals_per_epoch: 1
saves_per_epoch: 2