🎨 Thumbnail VLM — Janus-Pro-7B for Thumbnail Generation

A Vision-Language Model fine-tuned for professional thumbnail generation. Accepts flexible multimodal inputs (text, image, or both) and always outputs a thumbnail image.

🎯 Capabilities

Input Mode	Description	Example
Text → Thumbnail	Generate thumbnail from text description	`"Epic gaming video about Minecraft"` → 🖼️
Image → Thumbnail	Generate thumbnail from reference image	📷 → 🖼️
Text + Image → Thumbnail	Generate thumbnail from both	`"Make a cooking thumbnail"` + 📷 → 🖼️

🏗️ Architecture

┌─────────────────────────────────────────────────┐
│              Janus-Pro-7B Architecture           │
├─────────────────────────────────────────────────┤
│                                                  │
│  Input Text ──→ Tokenizer ──→ ┐                 │
│                                ├──→ DeepSeek-LLM │
│  Input Image ──→ SigLIP ──→  ┘    (7B, 30 layers│
│                                     4096-dim)    │
│                                                  │
│  DeepSeek-LLM ──→ gen_head ──→ VQ Logits        │
│                    (4096→16384)                   │
│                                                  │
│  VQ Tokens ──→ VQ-16 Decoder ──→ Output Image   │
│                (16384 codebook,   (384×384)      │
│                 576 tokens/img)                   │
└─────────────────────────────────────────────────┘

Base Model: deepseek-ai/Janus-Pro-7B (7.4B params)
Understanding Encoder: SigLIP-Large (384×384, 576 tokens)
Generation Tokenizer: VQ-16 (codebook=16384, 576 discrete tokens per image)
Training Method: Full SFT following Janus-4o recipe

📊 Training Recipe

Parameter	Value	Source
Base model	`deepseek-ai/Janus-Pro-7B`	Janus-4o paper
Learning Rate	5e-6	Janus-4o §3.3
Epochs	3	Janus-4o §3.3
Effective Batch Size	16 (1×16 grad accum)	Adapted from paper's 128
Optimizer	AdamW (β₁=0.9, β₂=0.95)	Janus-4o
CFG Prompt Masking	10%	Janus-4o §3.1
Precision	bfloat16	Model default
Image Resolution	384×384	Architecture constraint
Frozen	SigLIP + VQ Tokenizer	Efficiency
Trainable	LLM + gen_head + aligners	~6.5B params

Training Data

Dataset	Samples	Type
PosterCraft/Poster100K	8,000	Movie/TV posters (T2I)
Synthetic thumbnail prompts	2,000	YouTube-style prompts (T2I)
Total	~10,000

🚀 Quick Start

Installation

# Install Janus library
git clone https://github.com/deepseek-ai/Janus.git
cd Janus && pip install -e .

# Install other dependencies
pip install torch transformers Pillow numpy

Text → Thumbnail

import torch
import numpy as np
import PIL.Image
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor

model_path = "asats/thumbnail-vlm-janus-pro"
processor = VLChatProcessor.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
).cuda().eval()

# Generate thumbnail
prompt = "Professional tech review thumbnail: iPhone 16 with dramatic lighting, text 'BEST PHONE 2025'"
conversation = [
    {"role": "<|User|>", "content": prompt},
    {"role": "<|Assistant|>", "content": ""},
]
sft_format = processor.apply_sft_template_for_multi_turn_prompts(
    conversations=conversation, sft_format=processor.sft_format, system_prompt=""
)
prompt_text = sft_format + processor.image_start_tag

with torch.inference_mode():
    input_ids = torch.LongTensor(processor.tokenizer.encode(prompt_text))
    tokens = torch.zeros((2, len(input_ids)), dtype=torch.int).cuda()
    tokens[0] = input_ids  # conditional
    tokens[1] = input_ids; tokens[1, 1:-1] = processor.pad_id  # unconditional
    
    inputs_embeds = model.language_model.get_input_embeddings()(tokens)
    generated = torch.zeros((1, 576), dtype=torch.int).cuda()
    
    past_kv = None
    for t in range(576):
        outputs = model.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=past_kv)
        past_kv = outputs.past_key_values
        logits = model.gen_head(outputs.last_hidden_state[:, -1, :])
        guided = logits[1:2] + 5.0 * (logits[0:1] - logits[1:2])
        next_tok = torch.multinomial(torch.softmax(guided, -1), 1)
        generated[:, t] = next_tok.squeeze(-1)
        img_emb = model.prepare_gen_img_embeds(torch.cat([next_tok, next_tok], 0).squeeze(-1))
        inputs_embeds = img_emb.unsqueeze(1)
    
    dec = model.gen_vision_model.decode_code(generated, shape=[1, 8, 24, 24])
    img = np.clip((dec.float().cpu().numpy().transpose(0,2,3,1) + 1) / 2 * 255, 0, 255).astype(np.uint8)
    PIL.Image.fromarray(img[0]).save("thumbnail.png")

Image → Thumbnail

# Uses model's understanding to caption, then generates
python scripts/inference_janus.py --mode image --input_image photo.jpg

Text + Image → Thumbnail

# Uses both text instruction and reference image
python scripts/inference_janus.py --mode both \
    --prompt "Create a cooking video thumbnail with text 'EASY RECIPE'" \
    --input_image food_photo.jpg

🔧 Training from Scratch

Option 1: HuggingFace Jobs (Recommended)

# Launch via HF Jobs API
from huggingface_hub import HfApi
api = HfApi()

# Requires: a100-large hardware, 8h timeout
# Dependencies: torch, transformers, datasets, Pillow, numpy, tqdm, 
#               trackio, accelerate, janus @ git+https://github.com/deepseek-ai/Janus.git

Option 2: Local Training

# Clone repo and install
git clone https://github.com/deepseek-ai/Janus.git && cd Janus && pip install -e .
pip install torch transformers datasets Pillow numpy tqdm trackio accelerate

# Run training (needs ~40GB VRAM, A100 recommended)
python run_training.py

Option 3: Alternative — OmniGen LoRA (Lower VRAM)

For a lighter approach using OmniGen-v1 (3.8B params, LoRA fine-tuning on single 24GB GPU):

pip install OmniGen accelerate peft
accelerate launch train_omnigen.py \
    --model_name_or_path Shitao/OmniGen-v1 \
    --json_file train.jsonl \
    --image_path ./images \
    --use_lora --lora_rank 8 \
    --lr 1e-3 --epochs 3

📁 Repository Structure

├── README.md                    # This file
├── scripts/
│   ├── run_training.py          # End-to-end training pipeline (data prep + train + eval)
│   ├── inference_janus.py       # Inference for all 3 input modes
│   ├── train_janus.py           # Modular Janus training script
│   ├── train_omnigen.py         # Alternative OmniGen LoRA training
│   └── prepare_data.py          # Data preparation utilities

📈 Training Data Sources

Dataset	Size	Content	Format
PosterCraft/Poster100K	93K	Movie/TV posters	image + rich caption
ShareGPT-4o-Image	91K	GPT-4o synthetic pairs	prompt + image
CSU-JPG/TextAtlas5M	5M+	Text-in-image data	image + annotation
fantasyfish/laion-art	20K	High-aesthetic images	image + text