🎨 Thumbnail VLM β€” Janus-Pro-7B for Thumbnail Generation

A Vision-Language Model fine-tuned for professional thumbnail generation. Accepts flexible multimodal inputs (text, image, or both) and always outputs a thumbnail image.

🎯 Capabilities

Input Mode Description Example
Text β†’ Thumbnail Generate thumbnail from text description "Epic gaming video about Minecraft" β†’ πŸ–ΌοΈ
Image β†’ Thumbnail Generate thumbnail from reference image πŸ“· β†’ πŸ–ΌοΈ
Text + Image β†’ Thumbnail Generate thumbnail from both "Make a cooking thumbnail" + πŸ“· β†’ πŸ–ΌοΈ

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Janus-Pro-7B Architecture           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                  β”‚
β”‚  Input Text ──→ Tokenizer ──→ ┐                 β”‚
β”‚                                β”œβ”€β”€β†’ DeepSeek-LLM β”‚
β”‚  Input Image ──→ SigLIP ──→  β”˜    (7B, 30 layersβ”‚
β”‚                                     4096-dim)    β”‚
β”‚                                                  β”‚
β”‚  DeepSeek-LLM ──→ gen_head ──→ VQ Logits        β”‚
β”‚                    (4096β†’16384)                   β”‚
β”‚                                                  β”‚
β”‚  VQ Tokens ──→ VQ-16 Decoder ──→ Output Image   β”‚
β”‚                (16384 codebook,   (384Γ—384)      β”‚
β”‚                 576 tokens/img)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Base Model: deepseek-ai/Janus-Pro-7B (7.4B params)
  • Understanding Encoder: SigLIP-Large (384Γ—384, 576 tokens)
  • Generation Tokenizer: VQ-16 (codebook=16384, 576 discrete tokens per image)
  • Training Method: Full SFT following Janus-4o recipe

πŸ“Š Training Recipe

Parameter Value Source
Base model deepseek-ai/Janus-Pro-7B Janus-4o paper
Learning Rate 5e-6 Janus-4o Β§3.3
Epochs 3 Janus-4o Β§3.3
Effective Batch Size 16 (1Γ—16 grad accum) Adapted from paper's 128
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95) Janus-4o
CFG Prompt Masking 10% Janus-4o Β§3.1
Precision bfloat16 Model default
Image Resolution 384Γ—384 Architecture constraint
Frozen SigLIP + VQ Tokenizer Efficiency
Trainable LLM + gen_head + aligners ~6.5B params

Training Data

Dataset Samples Type
PosterCraft/Poster100K 8,000 Movie/TV posters (T2I)
Synthetic thumbnail prompts 2,000 YouTube-style prompts (T2I)
Total ~10,000

πŸš€ Quick Start

Installation

# Install Janus library
git clone https://github.com/deepseek-ai/Janus.git
cd Janus && pip install -e .

# Install other dependencies
pip install torch transformers Pillow numpy

Text β†’ Thumbnail

import torch
import numpy as np
import PIL.Image
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor

model_path = "asats/thumbnail-vlm-janus-pro"
processor = VLChatProcessor.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
).cuda().eval()

# Generate thumbnail
prompt = "Professional tech review thumbnail: iPhone 16 with dramatic lighting, text 'BEST PHONE 2025'"
conversation = [
    {"role": "<|User|>", "content": prompt},
    {"role": "<|Assistant|>", "content": ""},
]
sft_format = processor.apply_sft_template_for_multi_turn_prompts(
    conversations=conversation, sft_format=processor.sft_format, system_prompt=""
)
prompt_text = sft_format + processor.image_start_tag

with torch.inference_mode():
    input_ids = torch.LongTensor(processor.tokenizer.encode(prompt_text))
    tokens = torch.zeros((2, len(input_ids)), dtype=torch.int).cuda()
    tokens[0] = input_ids  # conditional
    tokens[1] = input_ids; tokens[1, 1:-1] = processor.pad_id  # unconditional
    
    inputs_embeds = model.language_model.get_input_embeddings()(tokens)
    generated = torch.zeros((1, 576), dtype=torch.int).cuda()
    
    past_kv = None
    for t in range(576):
        outputs = model.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=past_kv)
        past_kv = outputs.past_key_values
        logits = model.gen_head(outputs.last_hidden_state[:, -1, :])
        guided = logits[1:2] + 5.0 * (logits[0:1] - logits[1:2])
        next_tok = torch.multinomial(torch.softmax(guided, -1), 1)
        generated[:, t] = next_tok.squeeze(-1)
        img_emb = model.prepare_gen_img_embeds(torch.cat([next_tok, next_tok], 0).squeeze(-1))
        inputs_embeds = img_emb.unsqueeze(1)
    
    dec = model.gen_vision_model.decode_code(generated, shape=[1, 8, 24, 24])
    img = np.clip((dec.float().cpu().numpy().transpose(0,2,3,1) + 1) / 2 * 255, 0, 255).astype(np.uint8)
    PIL.Image.fromarray(img[0]).save("thumbnail.png")

Image β†’ Thumbnail

# Uses model's understanding to caption, then generates
python scripts/inference_janus.py --mode image --input_image photo.jpg

Text + Image β†’ Thumbnail

# Uses both text instruction and reference image
python scripts/inference_janus.py --mode both \
    --prompt "Create a cooking video thumbnail with text 'EASY RECIPE'" \
    --input_image food_photo.jpg

πŸ”§ Training from Scratch

Option 1: HuggingFace Jobs (Recommended)

# Launch via HF Jobs API
from huggingface_hub import HfApi
api = HfApi()

# Requires: a100-large hardware, 8h timeout
# Dependencies: torch, transformers, datasets, Pillow, numpy, tqdm, 
#               trackio, accelerate, janus @ git+https://github.com/deepseek-ai/Janus.git

Option 2: Local Training

# Clone repo and install
git clone https://github.com/deepseek-ai/Janus.git && cd Janus && pip install -e .
pip install torch transformers datasets Pillow numpy tqdm trackio accelerate

# Run training (needs ~40GB VRAM, A100 recommended)
python run_training.py

Option 3: Alternative β€” OmniGen LoRA (Lower VRAM)

For a lighter approach using OmniGen-v1 (3.8B params, LoRA fine-tuning on single 24GB GPU):

pip install OmniGen accelerate peft
accelerate launch train_omnigen.py \
    --model_name_or_path Shitao/OmniGen-v1 \
    --json_file train.jsonl \
    --image_path ./images \
    --use_lora --lora_rank 8 \
    --lr 1e-3 --epochs 3

πŸ“ Repository Structure

β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_training.py          # End-to-end training pipeline (data prep + train + eval)
β”‚   β”œβ”€β”€ inference_janus.py       # Inference for all 3 input modes
β”‚   β”œβ”€β”€ train_janus.py           # Modular Janus training script
β”‚   β”œβ”€β”€ train_omnigen.py         # Alternative OmniGen LoRA training
β”‚   └── prepare_data.py          # Data preparation utilities

πŸ“ˆ Training Data Sources

Dataset Size Content Format
PosterCraft/Poster100K 93K Movie/TV posters image + rich caption
ShareGPT-4o-Image 91K GPT-4o synthetic pairs prompt + image
CSU-JPG/TextAtlas5M 5M+ Text-in-image data image + annotation
fantasyfish/laion-art 20K High-aesthetic images image + text

πŸ“š References

βš–οΈ License

MIT (code) + DeepSeek Model License (model weights)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for asats/thumbnail-vlm-janus-pro

Finetuned
(35)
this model

Datasets used to train asats/thumbnail-vlm-janus-pro

Papers for asats/thumbnail-vlm-janus-pro