SFEvent (Open-Source AI Meetup)

lunarflu

posted an update about 1 month ago

Post

753

The #1 trending AI/ML dataset today 🏆

Massive scale, diversity and end-to-end potential from nvidia !
nvidia/PhysicalAI-Autonomous-Vehicles

lunarflu

posted an update about 1 month ago

Post

560

The new King 👑has arrived!

Moonshot AI now the top model on Hugging Face 🔥
moonshotai/Kimi-K2-Thinking

lunarflu

posted an update about 1 month ago

Post

2712

💸🤑You don’t need 100 GPUs to train something amazing!

Our Smol Training Playbook teaches you a better path to world-class LLMs, for free!

Check out the #1 trending space on 🤗 :
HuggingFaceTB/smol-training-playbook

lunarflu

posted an update 3 months ago

Post

2276

Cool stuff these past weeks on huggingface! 🤗 🚀 !
• 📈Trackio, local-first W&B alternative
https://github.com/gradio-app/trackio/issues
• 🌍EmbeddingGemma, 300M-param, multilingual embeddings, on-device
https://huggingface.co/blog/embeddinggemma
• 💻Open LLMs in VS Code (Inference Providers)
https://x.com/reach_vb/status/1966185427582497171
• 🤖Smol2Operator GUI agents
https://huggingface.co/blog/smol2operator
• 🖼️Gradio visible watermarking
https://huggingface.co/blog/watermarking-with-gradio

ehristoforu

posted an update 3 months ago

Post

2273

🚀Hello from the Project Fluently team!

✨ We are happy to share with you our new universal LLM models based on Qwen3 1.7B and 4B — powerful, multilingual and ready to solve a wide range of problems!

🛠️ We have conducted additional training and carefully merged them to achieve even better results and maximize the potential of the models.

🆓 And most importantly — the models are completely open and free under the Apache-2.0 license!

🔗 Links to repositories:
- FluentlyQwen3-4B: fluently/FluentlyQwen3-4B
- FluentlyQwen3-1.7B: fluently/FluentlyQwen3-1.7B

😍 We will be very glad to hear your feedback and impressions! Your opinion is very important to us!

1024m

authored a paper 4 months ago

Query Attribute Modeling: Improving search relevance with Semantic Search and Meta Data Filtering

Paper • 2508.04683 • Published Aug 6

1024m

authored a paper 5 months ago

DSBC : Data Science task Benchmarking with Context engineering

Paper • 2507.23336 • Published Jul 31 • 2

1024m

authored a paper 7 months ago

Uncovering Cultural Representation Disparities in Vision-Language Models

Paper • 2505.14729 • Published May 20 • 1

1024m

authored 3 papers 8 months ago

Robust and Fine-Grained Detection of AI Generated Texts

Paper • 2504.11952 • Published Apr 16 • 12

Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Paper • 2504.09753 • Published Apr 13 • 6

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Paper • 2504.07072 • Published Apr 9 • 9

ehristoforu

posted an update 10 months ago

Post

4087

Introducing our first standalone model – FluentlyLM Prinum

Introducing the first standalone model from Project Fluently LM! We worked on it for several months, used different approaches and eventually found the optimal one.

General characteristics:
- Model type: Causal language models (QwenForCausalLM, LM Transformer)
- Number of parameters: 32.5B
- Number of parameters (not embedded): 31.0B
- Number of layers: 64
- Context: 131,072 tokens
- Language(s) (NLP): English, French, Spanish, Russian, Chinese, Japanese, Persian (officially supported)
- License: MIT

Creation strategy:
The basis of the strategy is shown in Pic. 2.
We used Axolotl & Unsloth for SFT-finetuning with PEFT LoRA (rank=64, alpha=64) and Mergekit for SLERP and TIES mergers.

Evolution:
🏆 12th place in the Open LLM Leaderboard ( open-llm-leaderboard/open_llm_leaderboard) (21.02.2025)

Detailed results and comparisons are presented in Pic. 3.

Links:
- Model: https://huggingface.co/fluently-lm/FluentlyLM-Prinum
- GGUF version: mradermacher/FluentlyLM-Prinum-GGUF
- Demo on ZeroGPU: ehristoforu/FluentlyLM-Prinum-demo

7 replies

·

umarigan

posted an update 11 months ago

Post

943

** Extracting Reasoning Prompts with DeepSeek-R1: A Step Towards Better AI Reasoning **

Hi everyone! 👋

I’m excited to share a small but impactful project I’ve been working on, where I extracted **reasoning prompts** using the **DeepSeek-R1 model**. Reasoning prompts are a powerful way to understand how AI models arrive at their answers, and they can be used to train smaller, more efficient models to generate reasoning. Let me walk you through the process and explain why this is important.

---

#### **The Code: Extracting Reasoning Prompts**

Here’s the code I used to extract reasoning prompts from the openaccess-ai-collective/oo-gpt4-filtered dataset:

from tqdm import tqdm
import time

reasoning_data = []

for example in tqdm(ds, desc="answering"):
    try:
        response = client.chat.completions.create(
            model='deepseek-reasoner',  # Using DeepSeek-R1 for reasoning
            messages=[
                {"role": "system", "content": example['system_prompt']},
                {"role": "user", "content": example['question']},
            ],
            stream=False,
            max_tokens=4096,
            temperature=0.7,
        )
        
        answer = response.choices[0].message.content
        reasoning = response.choices[0].message.reasoning_content

        reasonng_example = {
            "id": example['id'],
            "question": example['question'],
            'answer': answer,
            'reasoning': reasoning,
        }

        reasoning_data.append(reasonng_example)
    except Exception as e:
        print(f"Error translating example: {e}")
        time.sleep(3)  # Wait for 3 seconds before continuing
        continue  # Skip the current example and move to the next one

data: umarigan/deepseek-r1-reasoning-prompts

Hellisotherpeople

authored a paper 12 months ago

OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset

Paper • 2406.14657 • Published Jun 20, 2024

ehristoforu

posted an update 12 months ago

Post

4651

✒️ Ultraset - all-in-one dataset for SFT training in Alpaca format.
fluently-sets/ultraset

❓ Ultraset is a comprehensive dataset for training Large Language Models (LLMs) using the SFT (instruction-based Fine-Tuning) method. This dataset consists of over 785 thousand entries in eight languages, including English, Russian, French, Italian, Spanish, German, Chinese, and Korean.

🤯 Ultraset solves the problem faced by users when selecting an appropriate dataset for LLM training. It combines various types of data required to enhance the model's skills in areas such as text writing and editing, mathematics, coding, biology, medicine, finance, and multilingualism.

🤗 For effective use of the dataset, it is recommended to utilize only the "instruction," "input," and "output" columns and train the model for 1-3 epochs. The dataset does not include DPO or Instruct data, making it suitable for training various types of LLM models.

❇️ Ultraset is an excellent tool to improve your language model's skills in diverse knowledge areas.