Asankhaya Sharma's picture
In a Training Loop ๐Ÿ”„

Asankhaya Sharma

codelion
hugging-science

AI & ML interests

Creator of OptiLLM, OpenEvolve, Adaptive Classifier, and Ellora. Pioneering a new category in AI infrastructure: inference-time compute for LLMs.

Recent Activity

reacted to their post with ๐Ÿค— 1 day ago
Scaling Pedagogical Pre-training to 10 Billion Tokens New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself. We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse. The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry. We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours. Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through. Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.
reacted to their post with ๐Ÿ‘€ 1 day ago
Scaling Pedagogical Pre-training to 10 Billion Tokens New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself. We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse. The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry. We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours. Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through. Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.
reacted to their post with ๐Ÿš€ 1 day ago
Scaling Pedagogical Pre-training to 10 Billion Tokens New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself. We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse. The result is https://huggingface.co/datasets/codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry. We trained https://huggingface.co/codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours. Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through. Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.
View all activity

Organizations

meraGPT's profile picture Lambda Security's profile picture National University of Singapore's profile picture Patched's profile picture ZeroGPU Explorers's profile picture MLX Community's profile picture Social Post Explorers's profile picture Hugging Face Discord Community's profile picture Dria's profile picture Adaptive Classifier's profile picture Reasoning datasets competition 's profile picture Cerebras Hugging Face Hackathon's profile picture LeRobot Worldwide Hackathon's profile picture Hugging Face MCP Course's profile picture Agents-MCP-Hackathon's profile picture Hugging Science's profile picture MCP-1st-Birthday's profile picture Algorithmic SuperIntelligence Labs's profile picture Unsloth Jobs Explorers's profile picture