HuggingFaceFW-Dev

Enterprise

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

mjaggi authored a paper 4 days ago

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

mjaggi authored a paper about 1 month ago

Benchmarking Optimizers for Large Language Model Pretraining

joelniklaus authored a paper about 2 months ago

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

View all activity

mjaggi

authored a paper 4 days ago

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Paper • 2509.14233 • Published 18 days ago • 11

davanstrien

posted an update about 1 month ago

Post

661

I fine-tuned a smol VLM to generate specialized art history metadata!

https://huggingface.co/davanstrien/iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)

Trained with TRL + HF Jobs - single UV script, no GPU needed!

Space to explore predictions on a test set: davanstrien/iconclass-predictions

Blog soon!

mjaggi

authored a paper about 1 month ago

Benchmarking Optimizers for Large Language Model Pretraining

Paper • 2509.01440 • Published Sep 1 • 24

eliebak

posted an update about 1 month ago

Post

3150

Super excited to announce that our research team at Hugging Face will be doing an AMA on reddit r/LocalLLaMA.

Come ask any questions to the team behind SmolLM, FineWeb and more! And who knows, maybe there’ll be a shiny new release to talk about?

Thursday 4th September, 8AM-11AM PST 🤗

science

eliebak

posted an update about 1 month ago

Post

606

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale!

> It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training.
> They use WSD with a "Simple moving average" averaging the last 6 ckpt every 8B token.
> They trained on Finemath, Fineweb2, DCLM, TxT360.
> Lot of details in the finetuning data they used, for instance they used EvolKit and did some "dataset fusion" to have more compressed knowledge into the data.
> They mention they also tried Normalized GPT, QK-Norm and Cross Layer Attention.

Motif-Technologies/Motif-2.6B

joelniklaus

authored 5 papers about 2 months ago

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Paper • 2411.19799 • Published Nov 29, 2024 • 14

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Paper • 2505.12864 • Published May 19 • 2

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Paper • 2508.04796 • Published Aug 6

From Citations to Criticality: Predicting Legal Decision Influence in the Multilingual Swiss Jurisprudence

Paper • 2410.13460 • Published Oct 17, 2024

Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland

Paper • 2410.13456 • Published Oct 17, 2024

hynky

updated a Space about 2 months ago

Trackio Classification Space

🚀

Visualize project metrics and runs

hynky

published a Space about 2 months ago

Trackio Classification Space

🚀

Visualize project metrics and runs

hynky

updated a model about 2 months ago

HuggingFaceFW-Dev/fine-pdfs-classification-teacher-eng_Latn-Qwen_Qwen3-235B-A22B-Instruct-2507

Text Classification • 0.1B • Updated Aug 18 • 13 • 1

hynky

published a model about 2 months ago

HuggingFaceFW-Dev/fine-pdfs-classification-teacher-eng_Latn-Qwen_Qwen3-235B-A22B-Instruct-2507

Text Classification • 0.1B • Updated Aug 18 • 13 • 1

eliebak

posted an update 3 months ago

Post

4723

Kimi K2 tech report is full of gems as always. Here are my notes on it:

> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.

With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.

> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.

The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU

NXz64Fdf8Y

authored a paper 3 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 71

thomwolf

authored a paper 3 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 71

lvwerra

authored a paper 3 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 71

mjaggi

authored a paper 3 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 71

negar-foroutan

authored a paper 3 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 71

AI & ML interests

Recent Activity

Team members 18

HuggingFaceFW-Dev's activity

Trackio Classification Space

Trackio Classification Space