EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots
Abstract
EgoPush enables robot manipulation in cluttered environments through perception-driven policy learning that uses object-centric latent spaces and stage-decomposed rewards for long-horizon tasks.
Humans can rearrange objects in cluttered environments using egocentric perception, navigating occlusions without global coordinates. Inspired by this capability, we study long-horizon multi-object non-prehensile rearrangement for mobile robots using a single egocentric camera. We introduce EgoPush, a policy learning framework that enables egocentric, perception-driven rearrangement without relying on explicit global state estimation that often fails in dynamic scenes. EgoPush designs an object-centric latent space to encode relative spatial relations among objects, rather than absolute poses. This design enables a privileged reinforcement-learning (RL) teacher to jointly learn latent states and mobile actions from sparse keypoints, which is then distilled into a purely visual student policy. To reduce the supervision gap between the omniscient teacher and the partially observed student, we restrict the teacher's observations to visually accessible cues. This induces active perception behaviors that are recoverable from the student's viewpoint. To address long-horizon credit assignment, we decompose rearrangement into stage-level subproblems using temporally decayed, stage-local completion rewards. Extensive simulation experiments demonstrate that EgoPush significantly outperforms end-to-end RL baselines in success rate, with ablation studies validating each design choice. We further demonstrate zero-shot sim-to-real transfer on a mobile platform in the real world. Code and videos are available at https://ai4ce.github.io/EgoPush/.
Community
EgoPush trains an object-centric, egocentric RL framework for long-horizon multi-object rearrangement with latent relational states, achieving effective sim-to-real transfer without explicit global state estimation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety (2026)
- Learning Situated Awareness in the Real World (2026)
- Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data (2026)
- Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures (2026)
- RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion (2025)
- Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels (2026)
- EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper