Robotics
LeRobot
Safetensors
English
vla
imitation-learning
diffusion-policy
gemma-3
siglip
scaledp
multimodal
Instructions to use Ryukijano/gemma-groot with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use Ryukijano/gemma-groot with LeRobot:
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - robotics | |
| - vla | |
| - lerobot | |
| - imitation-learning | |
| - diffusion-policy | |
| - gemma-3 | |
| - siglip | |
| - scaledp | |
| - multimodal | |
| # Gemma-Le: SigLIP + Gemma 3 + ScaleDP (LeRobot VLA Policy) | |
| Gemma-Le is a compact Vision-Language-Action policy for robotic manipulation built on top of LeRobot. | |
| It replaces NV Eagle with standard Hugging Face components: | |
| - SigLIP `google/siglip-so400m-patch14-384` for vision | |
| - Gemma 3 `google/gemma-3-4b-it` for language/reasoning (with LoRA PEFT) | |
| - ScaleDP (Scalable Diffusion Transformer) as the action head | |
| This repo hosts exported checkpoints trained on LeRobot-format datasets (e.g., `robot_sim.PickNPlace`). | |
| ## Architecture | |
| - Vision: SigLIP ViT encoder (384px, patch14), pooled embedding | |
| - Text: Gemma 3 4B-IT, mean-pooled hidden states | |
| - LoRA: rank=16 on `[q_proj, k_proj, v_proj, o_proj]` | |
| - Fusion: MLP projects [vision || text] -> `conditioning_dim=768` | |
| - Action head: ScaleDP Transformer (layers=12, d_model=320, heads=8, ff=1280) predicts diffusion noise | |
| - Temporal context: `chunk_size=8`; diffusion steps `num_diffusion_steps=50` | |
| - Mixed precision: AMP auto-selects bf16/fp16; bf16 uses no GradScaler | |
| ## Default config (excerpt) | |
| ```yaml | |
| vision_model_id: google/siglip-so400m-patch14-384 | |
| text_model_id: google/gemma-3-4b-it | |
| image_features: ["observation.images.ego_view"] | |
| action_feature: "action" | |
| chunk_size: 8 | |
| num_diffusion_steps: 50 | |
| conditioning_dim: 768 | |
| plan_update_interval: 10 | |
| scaledp_num_layers: 12 | |
| scaledp_dim_model: 320 | |
| scaledp_num_heads: 8 | |
| scaledp_dim_feedforward: 1280 | |
| use_lora: true | |
| lora_rank: 16 | |
| lora_target_modules: ["q_proj","k_proj","v_proj","o_proj"] | |
| optimizer_lr: 1e-4 | |
| optimizer_weight_decay: 1e-6 | |
| ``` | |
| ## Usage (with this repo’s LeRobot fork) | |
| Install deps and set `PYTHONPATH` to include `lerobot` in this repository. | |
| Evaluation-style load: | |
| ```python | |
| import torch | |
| from lerobot.common.policies.gemma_le.modeling_gemma_le import GemmaLePolicy | |
| from huggingface_hub import snapshot_download | |
| ckpt_dir = snapshot_download(repo_id="Ryukijano/gemma-groot", revision="main") | |
| policy = GemmaLePolicy.from_pretrained(ckpt_dir, torch_dtype=torch.bfloat16) | |
| policy.eval() | |
| ``` | |
| Training entrypoint: | |
| ```bash | |
| python lerobot/lerobot/scripts/train.py \ | |
| --policy.type gemma_le \ | |
| --dataset.repo_id local/robot_sim.PickNPlace \ | |
| --dataset.root /path/to/robot_sim.PickNPlace \ | |
| --dataset.episodes "[0,1,2,3,4]" \ | |
| --batch_size 3 \ | |
| --steps 200000 \ | |
| --log_freq 100 \ | |
| --save_freq 5000 \ | |
| --policy.vision_model_id google/siglip-so400m-patch14-384 \ | |
| --policy.text_model_id google/gemma-3-4b-it \ | |
| --policy.use_amp true \ | |
| --progress_bar true \ | |
| --push_to_hub true \ | |
| --push_repo_id Ryukijano/gemma-groot \ | |
| --push_branch main \ | |
| --push_exist_ok true | |
| ``` | |
| ### Slurm (3× L40) | |
| See `submit_job.sh`. Ensure caches on scratch and set: | |
| - `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` | |
| - `HF_HOME`, `HUGGINGFACE_HUB_CACHE`, `TRANSFORMERS_CACHE` to scratch | |
| ## Checkpoints | |
| - Latest runs uploaded under `runs/<date>/<run>/<step>` in this repo. | |
| - Example: `runs/2025-08-12/13-06-07_gemma_le/020000/`. | |
| ## Data | |
| - LeRobotDataset (parquet + mp4 + metadata). Single RGB view: `observation.images.ego_view`. Targets: `action`. | |
| - Timestamp tolerance is auto-relaxed to `max(tolerance_s, 1/fps + 1e-4)` during training for robust decoding. | |
| ## Notes | |
| - Base model access: `google/gemma-3-4b-it` may require TOS. | |
| - Intended for imitation learning; ThinkAct-style planning can be layered on top. | |
| ## Citations | |
| - LeRobot: https://github.com/huggingface/lerobot | |
| - Gemma 3: https://ai.google.dev/gemma | |
| - SigLIP: https://huggingface.co/timm/ViT-SigLIP | |
| - Diffusion Policy: https://arxiv.org/abs/2303.04137 | |
| ``` |