Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
Guilherme Penedo's picture
106 20 24

Guilherme Penedo

guipenedo
yushureal's profile picture Fajaru's profile picture Mrh0ang's profile picture
·
https://guipenedo.com
  • gui_penedo
  • guipenedo

AI & ML interests

None yet

Organizations

Hugging Face's profile picture BigScience Data's profile picture HuggingFaceBR4's profile picture Hugging Face H4's profile picture Hugging Face Extreme-Scale's profile picture Hugging Face Smol Models Research's profile picture Hugging Face Smol Cluster's profile picture Nanotron Research's profile picture FineData's profile picture Data Is Better Together's profile picture mlo-data-cleaning's profile picture Dev Mode Explorers's profile picture HuggingFaceFW-Dev's profile picture Hugging Face Science's profile picture ml-fw-prerelease's profile picture Open R1's profile picture OpenEvals's profile picture todo's profile picture carbon's profile picture

authored 2 papers 7 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26, 2025 • 75

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5, 2025 • 59
authored a paper 11 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4, 2025 • 253
authored a paper 12 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14, 2025 • 62
authored a paper over 1 year ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 99
authored a paper about 2 years ago

The Falcon Series of Open Language Models

Paper • 2311.16867 • Published Nov 28, 2023 • 14
authored a paper over 2 years ago

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Paper • 2306.01116 • Published Jun 1, 2023 • 41
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs