PosterCraft/Poster100K
Viewer β’ Updated β’ 48.4k β’ 2.17k β’ 8
How to use asats/thumbnail-vlm-janus-pro with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("asats/thumbnail-vlm-janus-pro", dtype="auto")A Vision-Language Model fine-tuned for professional thumbnail generation. Accepts flexible multimodal inputs (text, image, or both) and always outputs a thumbnail image.
| Input Mode | Description | Example |
|---|---|---|
| Text β Thumbnail | Generate thumbnail from text description | "Epic gaming video about Minecraft" β πΌοΈ |
| Image β Thumbnail | Generate thumbnail from reference image | π· β πΌοΈ |
| Text + Image β Thumbnail | Generate thumbnail from both | "Make a cooking thumbnail" + π· β πΌοΈ |
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Janus-Pro-7B Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Input Text βββ Tokenizer βββ β β
β ββββ DeepSeek-LLM β
β Input Image βββ SigLIP βββ β (7B, 30 layersβ
β 4096-dim) β
β β
β DeepSeek-LLM βββ gen_head βββ VQ Logits β
β (4096β16384) β
β β
β VQ Tokens βββ VQ-16 Decoder βββ Output Image β
β (16384 codebook, (384Γ384) β
β 576 tokens/img) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
| Parameter | Value | Source |
|---|---|---|
| Base model | deepseek-ai/Janus-Pro-7B |
Janus-4o paper |
| Learning Rate | 5e-6 | Janus-4o Β§3.3 |
| Epochs | 3 | Janus-4o Β§3.3 |
| Effective Batch Size | 16 (1Γ16 grad accum) | Adapted from paper's 128 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95) | Janus-4o |
| CFG Prompt Masking | 10% | Janus-4o Β§3.1 |
| Precision | bfloat16 | Model default |
| Image Resolution | 384Γ384 | Architecture constraint |
| Frozen | SigLIP + VQ Tokenizer | Efficiency |
| Trainable | LLM + gen_head + aligners | ~6.5B params |
| Dataset | Samples | Type |
|---|---|---|
| PosterCraft/Poster100K | 8,000 | Movie/TV posters (T2I) |
| Synthetic thumbnail prompts | 2,000 | YouTube-style prompts (T2I) |
| Total | ~10,000 |
# Install Janus library
git clone https://github.com/deepseek-ai/Janus.git
cd Janus && pip install -e .
# Install other dependencies
pip install torch transformers Pillow numpy
import torch
import numpy as np
import PIL.Image
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
model_path = "asats/thumbnail-vlm-janus-pro"
processor = VLChatProcessor.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
).cuda().eval()
# Generate thumbnail
prompt = "Professional tech review thumbnail: iPhone 16 with dramatic lighting, text 'BEST PHONE 2025'"
conversation = [
{"role": "<|User|>", "content": prompt},
{"role": "<|Assistant|>", "content": ""},
]
sft_format = processor.apply_sft_template_for_multi_turn_prompts(
conversations=conversation, sft_format=processor.sft_format, system_prompt=""
)
prompt_text = sft_format + processor.image_start_tag
with torch.inference_mode():
input_ids = torch.LongTensor(processor.tokenizer.encode(prompt_text))
tokens = torch.zeros((2, len(input_ids)), dtype=torch.int).cuda()
tokens[0] = input_ids # conditional
tokens[1] = input_ids; tokens[1, 1:-1] = processor.pad_id # unconditional
inputs_embeds = model.language_model.get_input_embeddings()(tokens)
generated = torch.zeros((1, 576), dtype=torch.int).cuda()
past_kv = None
for t in range(576):
outputs = model.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=past_kv)
past_kv = outputs.past_key_values
logits = model.gen_head(outputs.last_hidden_state[:, -1, :])
guided = logits[1:2] + 5.0 * (logits[0:1] - logits[1:2])
next_tok = torch.multinomial(torch.softmax(guided, -1), 1)
generated[:, t] = next_tok.squeeze(-1)
img_emb = model.prepare_gen_img_embeds(torch.cat([next_tok, next_tok], 0).squeeze(-1))
inputs_embeds = img_emb.unsqueeze(1)
dec = model.gen_vision_model.decode_code(generated, shape=[1, 8, 24, 24])
img = np.clip((dec.float().cpu().numpy().transpose(0,2,3,1) + 1) / 2 * 255, 0, 255).astype(np.uint8)
PIL.Image.fromarray(img[0]).save("thumbnail.png")
# Uses model's understanding to caption, then generates
python scripts/inference_janus.py --mode image --input_image photo.jpg
# Uses both text instruction and reference image
python scripts/inference_janus.py --mode both \
--prompt "Create a cooking video thumbnail with text 'EASY RECIPE'" \
--input_image food_photo.jpg
# Launch via HF Jobs API
from huggingface_hub import HfApi
api = HfApi()
# Requires: a100-large hardware, 8h timeout
# Dependencies: torch, transformers, datasets, Pillow, numpy, tqdm,
# trackio, accelerate, janus @ git+https://github.com/deepseek-ai/Janus.git
# Clone repo and install
git clone https://github.com/deepseek-ai/Janus.git && cd Janus && pip install -e .
pip install torch transformers datasets Pillow numpy tqdm trackio accelerate
# Run training (needs ~40GB VRAM, A100 recommended)
python run_training.py
For a lighter approach using OmniGen-v1 (3.8B params, LoRA fine-tuning on single 24GB GPU):
pip install OmniGen accelerate peft
accelerate launch train_omnigen.py \
--model_name_or_path Shitao/OmniGen-v1 \
--json_file train.jsonl \
--image_path ./images \
--use_lora --lora_rank 8 \
--lr 1e-3 --epochs 3
βββ README.md # This file
βββ scripts/
β βββ run_training.py # End-to-end training pipeline (data prep + train + eval)
β βββ inference_janus.py # Inference for all 3 input modes
β βββ train_janus.py # Modular Janus training script
β βββ train_omnigen.py # Alternative OmniGen LoRA training
β βββ prepare_data.py # Data preparation utilities
| Dataset | Size | Content | Format |
|---|---|---|---|
| PosterCraft/Poster100K | 93K | Movie/TV posters | image + rich caption |
| ShareGPT-4o-Image | 91K | GPT-4o synthetic pairs | prompt + image |
| CSU-JPG/TextAtlas5M | 5M+ | Text-in-image data | image + annotation |
| fantasyfish/laion-art | 20K | High-aesthetic images | image + text |
MIT (code) + DeepSeek Model License (model weights)
Base model
deepseek-ai/Janus-Pro-7B