--- title: AI‑Culture‑Commons emoji: 📚 colorFrom: indigo colorTo: gray sdk: static pinned: true thumbnail: >- https://cdn-uploads.huggingface.co/production/uploads/678d64ee7967054e64970908/_DjOT2roXXm_sltcskzDZ.jpeg short_description: Multilingual cultural corpora for AI research license: cc-by-4.0 --- # AI‑Culture‑Commons AI‑Culture‑Commons curates [multilingual cultural corpora](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) for language‑model research. We are a **non-profit digital humanities project**, advancing humane AI development through high-quality, rich cultural content. We strive to contribute to the **cultural evolution of artificial intelligence** by providing sophisticated training data that explores the intersection of technology, artificial intelligence, and human culture. [Our repositories](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) provide models with deep philosophical-intellectual context, diverse connections between culture, philosophy, literature, and technology—particularly AI. Our content is specifically designed to help train more **culturally aware and philosophically grounded AI models**. ## Our Datasets | Dataset | Size | Languages | Formats | License | Citation & Research | |---------|------|-----------|---------|---------|---------| | **Multilingual Culture Corpus** | 16M words | 12 ALIGNED languages | HTML · CSV · DOLMA JSONL | CC‑BY‑4.0 | [![DOI](https://zenodo.org/badge/1021100370.svg)](https://doi.org/10.5281/zenodo.16001657) | | **Project Websites Raw** | 160MB | 12 ALIGNED languages | ZIP (HTML + images + CSS) | CC‑BY‑4.0 | [![DOI](https://zenodo.org/badge/1021100223.svg)](https://doi.org/10.5281/zenodo.16001641) | ## Key Features - **Perfect Alignment**: All 12 languages contain identical content with exact same complex HTML structure. All datasets include both pure text and HTML source files - **AI-Optimized**: Designed specifically for training multilingual AI systems - **Truly Open**: [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/) - use freely, even commercially - **Content Quality**: Sophisticated content with intellectual depth, authored by a group of academics and writers - **Completely Clean Data**: No user comments, scraped texts, or unwanted content - pure, high-quality, carefully edited content - **Full Documentation**: [Complete pipeline](https://github.com/AI-Culture-Commons/ai-culture-pipeline) description and documentation in dataset cards. All datasets are versioned and archived for research reproducibility ## Languages English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin, Hindi, Hebrew ## Source Websites & Licensing Our corpora are carefully extracted from our websites: - **Original Project**: [https://hitdarderut-haaretz.org](https://hitdarderut-haaretz.org) - Cultural analysis - [License Terms](https://hitdarderut-haaretz.org/license): CC-BY-4.0 - **Multicultural Project**: [https://degeneration-of-nation.org](https://degeneration-of-nation.org) - Critical philosophical commentary - [License Terms](https://degeneration-of-nation.org/license): CC-BY-4.0 --- *As a non-profit organization, we're committed to advancing humane AI through high-quality, clean cultural datasets with perfect multilingual alignment*