---
title: AI‑Culture‑Commons
emoji: 📚
colorFrom: indigo
colorTo: gray
sdk: static
pinned: true
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/678d64ee7967054e64970908/_DjOT2roXXm_sltcskzDZ.jpeg
short_description: Multilingual cultural corpora for AI research
license: cc-by-4.0
---

# AI‑Culture‑Commons
AI‑Culture‑Commons curates [multilingual cultural corpora](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) for language‑model research.

We are a **non-profit digital humanities project**, advancing humane AI development through high-quality, rich cultural content. We strive to contribute to the **cultural evolution of artificial intelligence** by providing sophisticated training data that explores the intersection of technology, artificial intelligence, and human culture.

[Our repositories](https://huggingface.co/datasets/AI-Culture-Commons/ai-culture-html-multilingual) provide models with deep philosophical-intellectual context, diverse connections between culture, philosophy, literature, and technology—particularly AI. Our content is specifically designed to help train more **culturally aware and philosophically grounded AI models**.

## Our Datasets
| Dataset | Size | Languages | Formats | License | Citation & Research |
|---------|------|-----------|---------|---------|---------|
| **Multilingual Culture Corpus** | 16M words | 12 ALIGNED languages | HTML · CSV · DOLMA JSONL | CC‑BY‑4.0 | [![DOI](https://zenodo.org/badge/1021100370.svg)](https://doi.org/10.5281/zenodo.16001657) |
| **Project Websites Raw** | 160MB | 12 ALIGNED languages | ZIP (HTML + images + CSS) | CC‑BY‑4.0 | [![DOI](https://zenodo.org/badge/1021100223.svg)](https://doi.org/10.5281/zenodo.16001641) |

## Key Features
- **Perfect Alignment**: All 12 languages contain identical content with exact same complex HTML structure. All datasets include both pure text and HTML source files
- **AI-Optimized**: Designed specifically for training multilingual AI systems
- **Truly Open**: [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/) - use freely, even commercially
- **Content Quality**: Sophisticated content with intellectual depth, authored by a group of academics and writers
- **Completely Clean Data**: No user comments, scraped texts, or unwanted content - pure, high-quality, carefully edited content
- **Full Documentation**: [Complete pipeline](https://github.com/AI-Culture-Commons/ai-culture-pipeline) description and documentation in dataset cards. All datasets are versioned and archived for research reproducibility

## Languages
English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin, Hindi, Hebrew

## Source Websites & Licensing
Our corpora are carefully extracted from our websites:
- **Original Project**: [https://hitdarderut-haaretz.org](https://hitdarderut-haaretz.org) - Cultural analysis
  - [License Terms](https://hitdarderut-haaretz.org/license): CC-BY-4.0
- **Multicultural Project**: [https://degeneration-of-nation.org](https://degeneration-of-nation.org) - Critical philosophical commentary  
  - [License Terms](https://degeneration-of-nation.org/license): CC-BY-4.0

---
*As a non-profit organization, we're committed to advancing humane AI through high-quality, clean cultural datasets with perfect multilingual alignment*