BigScience Workshop

non-profit

https://huggingface.co/proxy/bigscience.huggingface.co

bigscienceW

bigscience-workshop

Activity Feed

AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 🌸

Recent Activity

christopher new activity 15 days ago

bigscience/bloomz-560m:Fails to load with transformers v4.57+

shannons authored a paper 29 days ago

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

shannons authored a paper 29 days ago

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

View all activity

christopher

in bigscience/bloomz-560m 15 days ago

Fails to load with transformers v4.57+

#14 opened 16 days ago by

qgallouedec

christopher

authored a paper 20 days ago

Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem

Paper • 2512.03073 • Published 27 days ago • 4

shannons

authored 3 papers 29 days ago

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

Paper • 2406.07835 • Published Jun 10, 2024 • 2

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Paper • 2510.09541 • Published Oct 10 • 15

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Paper • 2511.19399 • Published 30 days ago • 60

christopher

in bigscience/petals-api about 1 month ago

Bloom

#2 opened about 1 month ago by

Raz-Test

rabiulawal

authored a paper about 1 month ago

Grounding Computer Use Agents on Human Demonstrations

Paper • 2511.07332 • Published Nov 10 • 105

nouamanetazi

posted an update about 2 months ago

Post

4092

After training 𝐒𝐦𝐨𝐥𝐋𝐌𝟑 on 𝟑𝟖𝟒 𝐇𝟏𝟎𝟎𝐬 for nearly a month, I've come to realize something most people overlook: 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐚𝐤𝐞-𝐨𝐫-𝐛𝐫𝐞𝐚𝐤 𝐟𝐚𝐜𝐭𝐨𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠. 🔥

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious 𝐍𝐂𝐂𝐋 𝐞𝐫𝐫𝐨𝐫𝐬, or when your expensive GPU cluster is running at 𝟔𝟎% 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲, the problem isn't your model. It's most probably a 𝐦𝐢𝐬𝐮𝐬𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐡𝐚𝐫𝐝𝐰𝐚𝐫𝐞. 🛠️

Questions that seemed simple but had no clear answers: Why is 𝐌𝐨𝐄 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐥𝐨𝐰𝐞𝐫 𝐭𝐡𝐚𝐧 𝐝𝐞𝐧𝐬𝐞 𝐦𝐨𝐝𝐞𝐥𝐬? Which 𝐍𝐂𝐂𝐋 𝐟𝐥𝐚𝐠𝐬 should we actually set? How often should we checkpoint without killing throughput?

That's why we built 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤 📖: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐥𝐚𝐲𝐞𝐫 that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: 𝐇𝐁𝐌𝟑 𝐡𝐢𝐭𝐭𝐢𝐧𝐠 𝟑 𝐓𝐁/𝐬, 𝐍𝐕𝐋𝐢𝐧𝐤 𝟒.𝟎 𝐫𝐞𝐚𝐜𝐡𝐢𝐧𝐠 𝟕𝟖𝟔 𝐆𝐁/𝐬, 𝐏𝐂𝐈𝐞 𝐆𝐞𝐧𝟒 𝐚𝐭 𝟏𝟒.𝟐 𝐆𝐁/𝐬. Then we ran collective operations across 𝟏𝟐𝟖 𝐆𝐏𝐔𝐬 (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from 𝟒𝟖𝟎 𝐆𝐁/𝐬 on a single node to 𝟑𝟐𝟎-𝟑𝟓𝟎 𝐆𝐁/𝐬 across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤: https://lnkd.in/e5MKXUHS

Shared with ❤️ by the HuggingFace team

teelinsan

authored a paper 2 months ago

Language Models are Injective and Hence Invertible

Paper • 2510.15511 • Published Oct 17 • 69

christopher

authored a paper 2 months ago

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Paper • 2510.13996 • Published Oct 15 • 8

vumichien

authored 2 papers 2 months ago

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Paper • 2509.25531 • Published Sep 29 • 8

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published Oct 9 • 36

BramVanroy

posted an update 3 months ago

Post

379

What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?

1 reply

giadap

posted an update 3 months ago

Post

4513

🌎 AI ethics and sustainability are two sides of the same coin.

In our new blog post with Dr. Sasha Luccioni, we argue that separating them (as is too often the case) means missing the bigger picture of how AI systems impact both people and the planet.

Ethical and sustainable AI development can’t be pursued in isolation. The same choices that affect who benefits or is harmed by AI systems also determine how much energy and resources they consume.

We explore how two key concepts, evaluation and transparency, can serve as bridges between these domains:

📊 Evaluation, by moving beyond accuracy or performance metrics to include environmental and social costs, as we’ve done with tools like the AI Energy Score.

🔍 Transparency, by enabling reproducibility, accountability, and environmental reporting through open tools like the Environmental Transparency Space.

AI systems mirror our priorities. If we separate ethics from sustainability, we risk building technologies that are efficient but unjust, or fair but unsustainable.

Read our blog post here: https://huggingface.co/blog/sasha/ethics-sustainability

AIEnergyScore/Leaderboard
sasha/environmental-transparency

1 reply

sasha

authored 3 papers 3 months ago

Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model

Paper • 2211.02001 • Published Nov 3, 2022

Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI

Paper • 2409.14160 • Published Sep 21, 2024 • 3

From Efficiency Gains to Rebound Effects: The Problem of Jevons' Paradox in AI's Polarized Environmental Debate

Paper • 2501.16548 • Published Jan 27

christopher

posted an update 3 months ago

Post

541

Something very cool is cooking at

Lichess

1 reply

giadap

posted an update 3 months ago

Post

10953

One of the hardest challenges in AI safety is finding the right balance: how do we protect people from harm without undermining their agency? This tension is especially visible in conversational systems, where safeguards can sometimes feel more paternalistic than supportive.

In my latest piece for Hugging Face, I argue that open source and community-driven approaches offer a promising (though not exclusive) way forward.

✨ Transparency can make safety mechanisms into learning opportunities.
✨ Collaboration with diverse communities makes safeguards more relevant across contexts.
✨ Iteration in the open lets protections evolve rather than freeze into rigid, one-size-fits-all rules.

Of course, this isn’t a silver bullet. Top-down safety measures will still be necessary in some cases. But if we only rely on corporate control, we risk building systems that are safe at the expense of trust and autonomy.

Read the blog post here: https://huggingface.co/blog/giadap/preserving-agency

7 replies

giadap

posted an update 4 months ago

Post

417

I've noticed something. While we're careful about what we post on social media, we're sharing our deepest and most intimate thoughts with AI chatbots -- health concerns, financial worries, relationship issues, business ideas...

With OpenAI hinting at ChatGPT advertising, this matters more than ever. Unlike banner ads, AI advertising happens within the conversation itself. Sponsors could subtly influence that relationship advice or financial guidance.

The good news? We have options.
🤝 Open source AI models let us keep conversations private, avoid surveillance-based business models, and build systems that actually serve users first.

Read more about it in our latest blog post, co-written with
@frimelle
https://huggingface.co/blog/giadap/privacy-conversational-ai

AI & ML interests

Recent Activity

Team members 328

bigscience's activity

Fails to load with transformers v4.57+

Bloom