license: cc0-1.0 datasets:
- PleIAs/common_corpus
- isaacus/mteb-GovReport
- sedthh/gutenberg_english
- wikimedia/wikipedia language:
- en
LibreModel I (0.96B) Model Description LibreModel I is a 960M parameter language model trained exclusively on copyright-free, public domain data using a novel 4-phase curriculum learning approach. This model demonstrates that competitive language models can be built without relying on copyrighted content, making AI development more accessible and legally clear. Key Innovation: First model to use curriculum learning with public domain data exclusively, proving that copyright-free training can achieve competitive results at a fraction of typical training costs ($500 total budget). Model Details
Model Type: Causal Language Model (GPT-style) Parameters: 960M (0.96B) Architecture: LlamaConfig with optimizations Context Length: 3,072 tokens Vocabulary Size: 128,256 (LLaMA 3 tokenizer) Training Tokens: 19.2B (Chinchilla-optimal) Training Cost: ~$500 using AWS spot instances
Architecture Features
Layers: 22 transformer layers Attention Heads: 24 total, 8 key-value heads (3:1 GQA) Hidden Size: 1,536 Sink Tokens: 4 persistent context tokens for improved long-range attention Optimizations: Flash Attention 2, gradient checkpointing, bf16 mixed precision
4-Phase Curriculum Training Phase 1: Foundation (0-8%)
70% Project Gutenberg (literature, classics) 30% Government Reports (analytical structure)
Phase 2: Diversification (8-20%)
50% Project Gutenberg 45% Wikipedia (factual knowledge) 5% Government Reports
Phase 3: Advanced Reasoning (20-40%)
40% Project Gutenberg 30% Harvard Legal Cases (logical reasoning) 30% Wikipedia
Phase 4: Optimization (40-100%)
40% Project Gutenberg 30% Wikipedia 30% OpenGovernment (diverse analytical content)
Note: Harvard legal data was eliminated after 40% due to persistent training instabilities, replaced with OpenGovernment data for better stability while maintaining reasoning patterns. Training Data Sources (100% Public Domain)
Project Gutenberg: Classical literature, philosophy, science texts Wikipedia: Encyclopedia articles and factual content Government Documents: Policy papers, reports, legal documents OpenGovernment: Diverse government publications and analyses
Total: ~19.2B tokens across all phases, with careful curation to ensure public domain status.
This is a base model and not ready for use. We are beginning post-training end of month and will upload once done. GGUFs can be found at https://github.com/openconstruct/libremodel/releases---
- Downloads last month
- 3