Building on HF

nyuuzyou

https://ducks.party/donate

AI & ML interests

None yet

Recent Activity

upvoted a paper about 13 hours ago

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

updated a dataset about 15 hours ago

nyuuzyou/wikis

new activity about 16 hours ago

nyuuzyou/casino-benchmark:good work!

View all activity

Organizations

upvoted a paper about 13 hours ago

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Paper • 2602.21548 • Published 5 days ago • 31

updated a dataset about 15 hours ago

nyuuzyou/wikis

Preview • Updated about 15 hours ago • 4 • 2

New activity in nyuuzyou/casino-benchmark about 16 hours ago

good work!

#2 opened 1 day ago by

Roman1111111

replied to their post about 16 hours ago

The pipeline adapts to the source, beginning with collecting target URLs from sitemaps or APIs into a text file to track progress. I fetch the content concurrently. Go with 50 to 200 goroutines handles large scrapes, while Python ThreadPoolExecutor works for smaller jobs. This stage requires retry logic, rate limiting, and checkpoint files to resume interrupted downloads. The custom work happens during parsing since every site structures its data differently. I extract the target data using BeautifulSoup or goquery for HTML and standard parsers for APIs. I then filter the output to drop binaries, validate UTF-8, and skip generated files using tools like go-enry. The clean data gets written to an intermediate JSONL format, appending with a file lock for thread safety. I convert the final JSONL files to Parquet using DuckDB, PyArrow, or parquet-go. These get compressed with Zstandard at level 19, using 10K to 100K row groups and 512MB to 1GB shards. Go handles the high-throughput scraping, Python manages the custom parsing, and DuckDB takes care of the format conversions.

reacted to ajibawa-2023's post with 🔥 1 day ago

Post

3123

Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.

1 reply

replied to their post 1 day ago

Thanks! Since the datasets vary so much in size and format, I write custom parsing and processing pipelines for almost every single one.

upvoted a paper 4 days ago

Arcee Trinity Large Technical Report

Paper • 2602.17004 • Published 11 days ago • 17

reacted to appvoid's post with 👍 4 days ago

Post

888

granite-4.0-350m, rwkv7-g1d-0.4b and LFM2-350M are currently the best sub 0.5b models currently for fewshot, simple language tasks

no one is saying this:

if you need the absolute speed + small size + quality, granite 350m is the current king

1 reply

posted an update 4 days ago

Post

1847

🌍 Street-Level Imagery Dataset nyuuzyou/streetview

934,191 image records index Eastern Europe and Northern Asia. Temporal links map historical views at identical coordinates across nine years.

Key Stats:

- 905,940 unique images
- Coverage spanning 2016 to 2025
- Average 14.3 historical links per location

Geographic bounds span 20.49° E to 152.32° E. Urban centers show higher data density.

4 replies

updated a dataset 4 days ago

nyuuzyou/streetview

Viewer • Updated 4 days ago • 934k • 65 • 7

upvoted a changelog 4 days ago

Changelog

Public Storage Add-ons

4 days ago

• 68

reacted to mitkox's post with 🔥 7 days ago

Post

5306

My USB charger has a Blackwell GPU and 128GB RAM.
What. A. Time. To. Be. Alive.
People in Sofia: “It’s freezing.”
Me: sitting next to 3kW of space AI heaters on my desk 👀
1x GLM-5, 2x MiniMax-M2.5, 1x Qwen3 Coder Next; all on single Aibrix/K8s cluster

6 replies

reacted to danielhanchen's post with 🚀 8 days ago

Post

2635

We collabed with HF on showing how you can use HF Jobs and Unsloth! https://huggingface.co/blog/unsloth-jobs

replied to Tonic's post 9 days ago

In short, the students won. They did so by fine-tuning LFM2. LFM2 is a foundation built by Liquid AI. Liquid AI is a $2 billion startup from MIT.

liked a dataset 11 days ago

ajibawa-2023/Java-Code-Large

Viewer • Updated 14 days ago • 10.9M • 3.9k • 30

updated a collection 11 days ago

Code

Collection

11 items • Updated 11 days ago

reacted to ajibawa-2023's post with 🔥 11 days ago

Post

3240

JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large

JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.

By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.

JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .

liked a dataset 11 days ago

ajibawa-2023/JavaScript-Code-Large

Viewer • Updated 12 days ago • 2.64M • 16.5k • 32

posted an update 12 days ago

Post

450

🎰 Casino Benchmark: Dataset + Space
nyuuzyou/casino-benchmark
nyuuzyou/casino-benchmark

14 models faced 1,400 simulations of heads-up Blackjack and European Roulette. Shared seeds locked identical cards and spins for each.

Key Stats:

- 14 models benchmarked
- 59,483 rows
- 35 MB compressed Parquet
- 35,000 scored decisions
- Full prompts, JSON responses, reasoning traces, latency
- Bankroll tracking from $1,000 start per run

Live leaderboard tracks bets, hits, stands, and risk management.
Gemini 3 Flash leads at +$3,396. Claude 4.5 Haiku at -$7,788.
Traces in the dataset. Leaderboard in the space.

nyuuzyou

AI & ML interests

Recent Activity

Organizations

nyuuzyou's activity

good work!

Public Storage Add-ons