ml-fw-prerelease (ml-fw-prerelease)

posted an update 4 days ago

Post

255

You're probably training on outdated Wikipedia data right now and don't know it. 💡

In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."

He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
• For English, that's 700,000 missing articles.
• For Moroccan Arabic, 30% of the language's entire Wikipedia.
• For 31 other languages, there was literally no text corpus at all until recently.

I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).

Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.

Here's the full story of how I built Wikipedia Monthly 👇

https://omarkamali.com/blog/wikipedia-monthly-pipeline

omarkamali

posted an update 2 months ago

Post

1694

New year, new dataset 🚀

I just released omarkamali/wikipedia-labels, with all the structural labels and namespace from wikipedia in 300+ languages. A gift for the data preprocessors and cleaners among us.

Happy new year 2026 everyone! 🎆

omarkamali

posted an update 3 months ago

Post

297

Picomon v0.2.0 released! 💫

- Supports all of AMD, Nvidia and Apple Silicon 🧑‍🧑‍🧒‍🧒
- Beautiful TUI with themes (who said monitoring should be boring?) 💅
- Shareable Rig Cards! Boast to friends, family and foes alike 🫨

Get it now! uvx picomon or pip install picomon then picomon

3 replies

·

omarkamali

posted an update 3 months ago

Post

3491

Hello picomon! AMD GPU Monitoring made easy

Just run uvx picomon and behold:

┌──────────────────────────────────────────┐  ┌──────────────────────────────────────────┐
│ GPU 0  GFX  42%  UMC  21%                │  │ GPU 1  GFX  78%  UMC  66%                │
│ PWR 135/250W (54%)  VRAM 10.0/16.0GB 62% │  │ PWR 210/250W (84%)  VRAM 14.5/16.0GB 90% │
│                                          │  │                                          │
│ GFX ▁▂▂▃▄▄▅▆▆▇█▇▆▅▄▃▂▁                   │  │ GFX ▂▃▄▅▆▇██▇▆▅▄▂▂▃▅▆                    │
│ PWR ▁▁▂▂▃▄▄▅▆▇██▇▆▅▄▂▁                   │  │ PWR ▂▂▃▄▅▆▇██▇▆▅▄▃▂▂▃                    │
│ VRM ▁▁▂▂▃▄▄▅▆▇███▇▆▅▄▂                   │  │ VRM ▂▃▄▅▆▆▇███▇▆▅▄▃▂▂                    │
└──────────────────────────────────────────┘  └──────────────────────────────────────────┘

Repo at https://github.com/omarkamali/picomon
Or pypi at https://pypi.org/project/picomon

omarkamali

posted an update 3 months ago

Post

5229

Exciting updates to the Wikipedia Monthly dataset for November! 🚀

・ Fixed a bug to remove infobox leftovers and other wiki markers such as __TOC__
・ New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose)
・ Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!).
・ Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.

Check out the dataset:
omarkamali/wikipedia-monthly

SivilTaram

authored a paper 4 months ago

Diffusion Language Models are Super Data Learners

Paper • 2511.03276 • Published Nov 5, 2025 • 129

nouamanetazi

posted an update 4 months ago

Post

4547

After training 𝐒𝐦𝐨𝐥𝐋𝐌𝟑 on 𝟑𝟖𝟒 𝐇𝟏𝟎𝟎𝐬 for nearly a month, I've come to realize something most people overlook: 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐚𝐤𝐞-𝐨𝐫-𝐛𝐫𝐞𝐚𝐤 𝐟𝐚𝐜𝐭𝐨𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠. 🔥

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious 𝐍𝐂𝐂𝐋 𝐞𝐫𝐫𝐨𝐫𝐬, or when your expensive GPU cluster is running at 𝟔𝟎% 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲, the problem isn't your model. It's most probably a 𝐦𝐢𝐬𝐮𝐬𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐡𝐚𝐫𝐝𝐰𝐚𝐫𝐞. 🛠️

Questions that seemed simple but had no clear answers: Why is 𝐌𝐨𝐄 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐥𝐨𝐰𝐞𝐫 𝐭𝐡𝐚𝐧 𝐝𝐞𝐧𝐬𝐞 𝐦𝐨𝐝𝐞𝐥𝐬? Which 𝐍𝐂𝐂𝐋 𝐟𝐥𝐚𝐠𝐬 should we actually set? How often should we checkpoint without killing throughput?

That's why we built 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤 📖: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐥𝐚𝐲𝐞𝐫 that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: 𝐇𝐁𝐌𝟑 𝐡𝐢𝐭𝐭𝐢𝐧𝐠 𝟑 𝐓𝐁/𝐬, 𝐍𝐕𝐋𝐢𝐧𝐤 𝟒.𝟎 𝐫𝐞𝐚𝐜𝐡𝐢𝐧𝐠 𝟕𝟖𝟔 𝐆𝐁/𝐬, 𝐏𝐂𝐈𝐞 𝐆𝐞𝐧𝟒 𝐚𝐭 𝟏𝟒.𝟐 𝐆𝐁/𝐬. Then we ran collective operations across 𝟏𝟐𝟖 𝐆𝐏𝐔𝐬 (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from 𝟒𝟖𝟎 𝐆𝐁/𝐬 on a single node to 𝟑𝟐𝟎-𝟑𝟓𝟎 𝐆𝐁/𝐬 across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤: https://lnkd.in/e5MKXUHS

Shared with ❤️ by the HuggingFace team

Zaid

authored a paper 4 months ago

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Paper • 2510.24081 • Published Oct 28, 2025 • 20

lbourdois

posted an update 5 months ago

Post

1508

New blog post analyzing the top 50 entities with the most downloaded models on @huggingface 🤗!

https://huggingface.co/blog/lbourdois/huggingface-models-stats

The purpose here is to get an idea of the profile of the models with the greatest impact in open source (we are not interested in closed models here!).

32 figures + data

Enjoy 🤗

omarkamali

posted an update 5 months ago

Post

303

Another month, another Wikipedia Monthly release! 🎃

Highlights of October's edition:
· 🗣️ 341 languages
· 📚 64.7M articles (+2.5%)
· 📦 89.4GB of data (+3.3%)

We are now sampling a random subset of each language with a reservoir sampling method to produce splits 1000, 5000, and 10000 in addition to the existing train split that contains all the data.

Now you can load the english (or your favorite language) subset in seconds:
dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")

Happy data engineering! 🧰

omarkamali/wikipedia-monthly

2 replies

·

BramVanroy

posted an update 5 months ago

Post

558

What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?

1 reply

·

Zaid

authored a paper 5 months ago

MeXtract: Light-Weight Metadata Extraction from Scientific Papers

Paper • 2510.06889 • Published Oct 8, 2025 • 1

omarkamali

posted an update 6 months ago

Post

1607

**Wikipedia Monthly's September edition is now live 🎉**

Highlights of this edition:
· 🗣️ 341 languages
· 📚 63.1M articles
· 📦 86.5GB of data

This update also solves upload issues in the August edition where some languages had missing parts. Happy data engineering!

omarkamali/wikipedia-monthly

2 replies

·

SivilTaram

authored a paper 6 months ago

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Paper • 2509.02479 • Published Sep 2, 2025 • 84

alibayram

authored 3 papers 7 months ago

Tokens with Meaning: A Hybrid Tokenization Approach for NLP

Paper • 2508.14292 • Published Aug 19, 2025 • 1

Doğal Dil İşlemede Tokenizasyon Standartları ve Ölçümü: Türkçe Üzerinden Büyük Dil Modellerinin Karşılaştırmalı Analizi

Paper • 2508.13058 • Published Aug 18, 2025 • 1

Büyük Dil Modelleri için TR-MMLU Benchmarkı: Performans Değerlendirmesi, Zorluklar ve İyileştirme Fırsatları

Paper • 2508.13044 • Published Aug 18, 2025 • 1

BramVanroy

posted an update 7 months ago

Post

1024

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

SivilTaram

authored 2 papers 8 months ago

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

Paper • 2507.12415 • Published Jul 16, 2025 • 43

First Return, Entropy-Eliciting Explore

Paper • 2507.07017 • Published Jul 9, 2025 • 24

AI & ML interests

Team members 26

ml-fw-prerelease's activity