ml-fw-prerelease

community
Activity Feed

AI & ML interests

None defined yet.

omarkamaliย 
posted an update 4 days ago
view post
Post
255
You're probably training on outdated Wikipedia data right now and don't know it. ๐Ÿ’ก

In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."

He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
โ€ข For English, that's 700,000 missing articles.
โ€ข For Moroccan Arabic, 30% of the language's entire Wikipedia.
โ€ข For 31 other languages, there was literally no text corpus at all until recently.

I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).

Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.

Here's the full story of how I built Wikipedia Monthly ๐Ÿ‘‡

https://omarkamali.com/blog/wikipedia-monthly-pipeline
omarkamaliย 
posted an update 2 months ago
view post
Post
1694
New year, new dataset ๐Ÿš€

I just released omarkamali/wikipedia-labels, with all the structural labels and namespace from wikipedia in 300+ languages. A gift for the data preprocessors and cleaners among us.

Happy new year 2026 everyone! ๐ŸŽ†
omarkamaliย 
posted an update 3 months ago
view post
Post
297
Picomon v0.2.0 released! ๐Ÿ’ซ

- Supports all of AMD, Nvidia and Apple Silicon ๐Ÿง‘โ€๐Ÿง‘โ€๐Ÿง’โ€๐Ÿง’
- Beautiful TUI with themes (who said monitoring should be boring?) ๐Ÿ’…
- Shareable Rig Cards! Boast to friends, family and foes alike ๐Ÿซจ

Get it now! uvx picomon or pip install picomon then picomon
  • 3 replies
ยท
omarkamaliย 
posted an update 3 months ago
view post
Post
3491
Hello picomon! AMD GPU Monitoring made easy

Just run uvx picomon and behold:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ GPU 0  GFX  42%  UMC  21%                โ”‚  โ”‚ GPU 1  GFX  78%  UMC  66%                โ”‚
โ”‚ PWR 135/250W (54%)  VRAM 10.0/16.0GB 62% โ”‚  โ”‚ PWR 210/250W (84%)  VRAM 14.5/16.0GB 90% โ”‚
โ”‚                                          โ”‚  โ”‚                                          โ”‚
โ”‚ GFX โ–โ–‚โ–‚โ–ƒโ–„โ–„โ–…โ–†โ–†โ–‡โ–ˆโ–‡โ–†โ–…โ–„โ–ƒโ–‚โ–                   โ”‚  โ”‚ GFX โ–‚โ–ƒโ–„โ–…โ–†โ–‡โ–ˆโ–ˆโ–‡โ–†โ–…โ–„โ–‚โ–‚โ–ƒโ–…โ–†                    โ”‚
โ”‚ PWR โ–โ–โ–‚โ–‚โ–ƒโ–„โ–„โ–…โ–†โ–‡โ–ˆโ–ˆโ–‡โ–†โ–…โ–„โ–‚โ–                   โ”‚  โ”‚ PWR โ–‚โ–‚โ–ƒโ–„โ–…โ–†โ–‡โ–ˆโ–ˆโ–‡โ–†โ–…โ–„โ–ƒโ–‚โ–‚โ–ƒ                    โ”‚
โ”‚ VRM โ–โ–โ–‚โ–‚โ–ƒโ–„โ–„โ–…โ–†โ–‡โ–ˆโ–ˆโ–ˆโ–‡โ–†โ–…โ–„โ–‚                   โ”‚  โ”‚ VRM โ–‚โ–ƒโ–„โ–…โ–†โ–†โ–‡โ–ˆโ–ˆโ–ˆโ–‡โ–†โ–…โ–„โ–ƒโ–‚โ–‚                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜


Repo at https://github.com/omarkamali/picomon
Or pypi at https://pypi.org/project/picomon
omarkamaliย 
posted an update 3 months ago
view post
Post
5229
Exciting updates to the Wikipedia Monthly dataset for November! ๐Ÿš€

ใƒป Fixed a bug to remove infobox leftovers and other wiki markers such as __TOC__
ใƒป New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose)
ใƒป Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!).
ใƒป Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.

Check out the dataset:
omarkamali/wikipedia-monthly
nouamanetaziย 
posted an update 4 months ago
view post
Post
4547
After training ๐’๐ฆ๐จ๐ฅ๐‹๐Œ๐Ÿ‘ on ๐Ÿ‘๐Ÿ–๐Ÿ’ ๐‡๐Ÿ๐ŸŽ๐ŸŽ๐ฌ for nearly a month, I've come to realize something most people overlook: ๐ข๐ง๐Ÿ๐ซ๐š๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž ๐ข๐ฌ ๐ญ๐ก๐ž ๐ฆ๐š๐ค๐ž-๐จ๐ซ-๐›๐ซ๐ž๐š๐ค ๐Ÿ๐š๐œ๐ญ๐จ๐ซ ๐ข๐ง ๐‹๐‹๐Œ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐ . ๐Ÿ”ฅ

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐‚๐‚๐‹ ๐ž๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐Ÿ”๐ŸŽ% ๐ž๐Ÿ๐Ÿ๐ข๐œ๐ข๐ž๐ง๐œ๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ž ๐จ๐Ÿ ๐ญ๐ก๐ž ๐ก๐š๐ซ๐๐ฐ๐š๐ซ๐ž. ๐Ÿ› ๏ธ

Questions that seemed simple but had no clear answers: Why is ๐Œ๐จ๐„ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐ฌ๐ฅ๐จ๐ฐ๐ž๐ซ ๐ญ๐ก๐š๐ง ๐๐ž๐ง๐ฌ๐ž ๐ฆ๐จ๐๐ž๐ฅ๐ฌ? Which ๐๐‚๐‚๐‹ ๐Ÿ๐ฅ๐š๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?

That's why we built ๐“๐ก๐ž ๐’๐ฆ๐จ๐ฅ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐๐ฅ๐š๐ฒ๐›๐จ๐จ๐ค ๐Ÿ“–: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐Ÿ๐ซ๐š๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž ๐ฅ๐š๐ฒ๐ž๐ซ that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: ๐‡๐๐Œ๐Ÿ‘ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐  ๐Ÿ‘ ๐“๐/๐ฌ, ๐๐•๐‹๐ข๐ง๐ค ๐Ÿ’.๐ŸŽ ๐ซ๐ž๐š๐œ๐ก๐ข๐ง๐  ๐Ÿ•๐Ÿ–๐Ÿ” ๐†๐/๐ฌ, ๐๐‚๐ˆ๐ž ๐†๐ž๐ง๐Ÿ’ ๐š๐ญ ๐Ÿ๐Ÿ’.๐Ÿ ๐†๐/๐ฌ. Then we ran collective operations across ๐Ÿ๐Ÿ๐Ÿ– ๐†๐๐”๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐Ÿ’๐Ÿ–๐ŸŽ ๐†๐/๐ฌ on a single node to ๐Ÿ‘๐Ÿ๐ŸŽ-๐Ÿ‘๐Ÿ“๐ŸŽ ๐†๐/๐ฌ across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

๐“๐ก๐ž ๐’๐ฆ๐จ๐ฅ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐๐ฅ๐š๐ฒ๐›๐จ๐จ๐ค: https://lnkd.in/e5MKXUHS

Shared with โค๏ธ by the HuggingFace team
lbourdoisย 
posted an update 5 months ago
omarkamaliย 
posted an update 5 months ago
view post
Post
303
Another month, another Wikipedia Monthly release! ๐ŸŽƒ

Highlights of October's edition:
ยท ๐Ÿ—ฃ๏ธ 341 languages
ยท ๐Ÿ“š 64.7M articles (+2.5%)
ยท ๐Ÿ“ฆ 89.4GB of data (+3.3%)

We are now sampling a random subset of each language with a reservoir sampling method to produce splits 1000, 5000, and 10000 in addition to the existing train split that contains all the data.

Now you can load the english (or your favorite language) subset in seconds:
dataset = load_dataset("omarkamali/wikipedia-monthly", "latest.en", split="10000")

Happy data engineering! ๐Ÿงฐ

omarkamali/wikipedia-monthly
  • 2 replies
ยท
BramVanroyย 
posted an update 5 months ago
view post
Post
558
What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?
  • 1 reply
ยท
omarkamaliย 
posted an update 6 months ago
view post
Post
1607
**Wikipedia Monthly's September edition is now live ๐ŸŽ‰**

Highlights of this edition:
ยท ๐Ÿ—ฃ๏ธ 341 languages
ยท ๐Ÿ“š 63.1M articles
ยท ๐Ÿ“ฆ 86.5GB of data

This update also solves upload issues in the August edition where some languages had missing parts. Happy data engineering!

omarkamali/wikipedia-monthly
  • 2 replies
ยท
BramVanroyย 
posted an update 7 months ago
view post
Post
1024
Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.