Update README.md

8315a7e verified 10 months ago

3.68 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- robotics
	- vla
	- lerobot
	- imitation-learning
	- diffusion-policy
	- gemma-3
	- siglip
	- scaledp
	- multimodal
	---

	# Gemma-Le: SigLIP + Gemma 3 + ScaleDP (LeRobot VLA Policy)

	Gemma-Le is a compact Vision-Language-Action policy for robotic manipulation built on top of LeRobot.
	It replaces NV Eagle with standard Hugging Face components:

	- SigLIP `google/siglip-so400m-patch14-384` for vision
	- Gemma 3 `google/gemma-3-4b-it` for language/reasoning (with LoRA PEFT)
	- ScaleDP (Scalable Diffusion Transformer) as the action head

	This repo hosts exported checkpoints trained on LeRobot-format datasets (e.g., `robot_sim.PickNPlace`).

	## Architecture
	- Vision: SigLIP ViT encoder (384px, patch14), pooled embedding
	- Text: Gemma 3 4B-IT, mean-pooled hidden states
	- LoRA: rank=16 on `[q_proj, k_proj, v_proj, o_proj]`
	- Fusion: MLP projects [vision \|\| text] -> `conditioning_dim=768`
	- Action head: ScaleDP Transformer (layers=12, d_model=320, heads=8, ff=1280) predicts diffusion noise
	- Temporal context: `chunk_size=8`; diffusion steps `num_diffusion_steps=50`
	- Mixed precision: AMP auto-selects bf16/fp16; bf16 uses no GradScaler

	## Default config (excerpt)
	```yaml
	vision_model_id: google/siglip-so400m-patch14-384
	text_model_id: google/gemma-3-4b-it
	image_features: ["observation.images.ego_view"]
	action_feature: "action"
	chunk_size: 8
	num_diffusion_steps: 50
	conditioning_dim: 768
	plan_update_interval: 10
	scaledp_num_layers: 12
	scaledp_dim_model: 320
	scaledp_num_heads: 8
	scaledp_dim_feedforward: 1280
	use_lora: true
	lora_rank: 16
	lora_target_modules: ["q_proj","k_proj","v_proj","o_proj"]
	optimizer_lr: 1e-4
	optimizer_weight_decay: 1e-6
	```

	## Usage (with this repo’s LeRobot fork)
	Install deps and set `PYTHONPATH` to include `lerobot` in this repository.

	Evaluation-style load:
	```python
	import torch
	from lerobot.common.policies.gemma_le.modeling_gemma_le import GemmaLePolicy
	from huggingface_hub import snapshot_download
	ckpt_dir = snapshot_download(repo_id="Ryukijano/gemma-groot", revision="main")
	policy = GemmaLePolicy.from_pretrained(ckpt_dir, torch_dtype=torch.bfloat16)
	policy.eval()
	```

	Training entrypoint:
	```bash
	python lerobot/lerobot/scripts/train.py \
	--policy.type gemma_le \
	--dataset.repo_id local/robot_sim.PickNPlace \
	--dataset.root /path/to/robot_sim.PickNPlace \
	--dataset.episodes "[0,1,2,3,4]" \
	--batch_size 3 \
	--steps 200000 \
	--log_freq 100 \
	--save_freq 5000 \
	--policy.vision_model_id google/siglip-so400m-patch14-384 \
	--policy.text_model_id google/gemma-3-4b-it \
	--policy.use_amp true \
	--progress_bar true \
	--push_to_hub true \
	--push_repo_id Ryukijano/gemma-groot \
	--push_branch main \
	--push_exist_ok true
	```

	### Slurm (3× L40)
	See `submit_job.sh`. Ensure caches on scratch and set:
	- `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
	- `HF_HOME`, `HUGGINGFACE_HUB_CACHE`, `TRANSFORMERS_CACHE` to scratch

	## Checkpoints
	- Latest runs uploaded under `runs/<date>/<run>/<step>` in this repo.
	- Example: `runs/2025-08-12/13-06-07_gemma_le/020000/`.

	## Data
	- LeRobotDataset (parquet + mp4 + metadata). Single RGB view: `observation.images.ego_view`. Targets: `action`.
	- Timestamp tolerance is auto-relaxed to `max(tolerance_s, 1/fps + 1e-4)` during training for robust decoding.

	## Notes
	- Base model access: `google/gemma-3-4b-it` may require TOS.
	- Intended for imitation learning; ThinkAct-style planning can be layered on top.

	## Citations
	- LeRobot: https://github.com/huggingface/lerobot
	- Gemma 3: https://ai.google.dev/gemma
	- SigLIP: https://huggingface.co/timm/ViT-SigLIP
	- Diffusion Policy: https://arxiv.org/abs/2303.04137
	```