arxiv:2604.11784

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Published on Apr 13

· Submitted by

Yongliang Shen on Apr 15

Zhejiang University

Upvote

128

Authors:

Yongliang Shen

Abstract

ClawGUI presents an open-source framework that addresses key challenges in GUI agent development through unified reinforcement learning, standardized evaluation, and cross-platform deployment capabilities.

AI-generated summary

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present ClawGUI, an open-source framework addressing these three gaps within a single harness. ClawGUI-RL provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. ClawGUI-Eval enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. ClawGUI-Agent brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, ClawGUI-2B achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.

View arXiv page View PDF Project page GitHub 434 Add to collection

Community

taesiri

3 days ago

taesiri

3 days ago

fatihmert

2 days ago

https://github.com/ZJU-REAL/ClawGUI

tricktreat

Paper author Paper submitter 1 day ago

This comment has been hidden (marked as Resolved)

tricktreat

Paper author Paper submitter 1 day ago

Github: https://github.com/ZJU-REAL/ClawGUI
Tech Report: https://arxiv.org/abs/2604.11784

librarian-bot

about 17 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

about 16 hours ago

dense step-level supervision with GiGPO and the Process Reward Model is the most interesting hinge here, it directly tackles the sparse-reward problem in long-horizon GUI tasks. an ablation removing the step-level rewards would be the cleanest way to confirm it's driving the gains, since episode-level GRPO could still be a plausible contributor. the arxivlens breakdown helped me parse the method details and i appreciated the clear triad of training, eval, and deployment (arxivlens: https://arxivlens.com/PaperView/Details/clawgui-a-unified-framework-for-training-evaluating-and-deploying-gui-agents-3771-36ae137f). i'd also like to know how it handles real-device quirks when emulators and devices diverge, given the memory system's personalization and privacy implications. if the ablations confirm the driver is dense step-level supervision, ClawGUI could push us toward a more principled, end-to-end GUI agent harness rather than piecemeal stacks.