QTrack: Query-Driven Reasoning for Multi-modal MOT

QTrack is an end-to-end vision-language model designed for query-driven multi-object tracking (MOT). Unlike traditional MOT which tracks all objects in a scene, QTrack selectively localizes and tracks specific targets based on natural language instructions while maintaining temporal coherence and identity consistency.

Paper: QTrack: Query-Driven Reasoning for Multi-modal MOT
Project Page: https://gaash-lab.github.io/QTrack/
Repository: https://github.com/gaash-lab/QTrack

Description

Multi-object tracking has traditionally focused on estimating trajectories of all objects. QTrack introduces a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries.

Key Contributions

RMOT26 Benchmark: A large-scale benchmark with grounded queries and sequence-level splits to enable robust evaluation of generalization.
QTrack Model: An end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization.
Temporal Perception-Aware Policy Optimization (TPA-PO): A structured reward strategy to encourage motion-aware reasoning.

Benchmark Results

QTrack achieves state-of-the-art performance on the RMOT26 benchmark.

Model	Params	MCP↑	MOTP↑	CLE (px)↓	NDE↓
GPT-5.2	-	0.25	0.61	94.2	0.55
Qwen3-VL-Instruct	8B	0.25	0.64	96.0	0.97
Gemma 3	27B	0.24	0.56	58.4	0.88
InternVL	8B	0.21	0.66	117.44	0.64
QTrack (Ours)	3B	0.30	0.75	44.61	0.39

Installation

To set up the environment and use the model, please follow the instructions in the official repository:

# Create conda environment
conda create -n qtrack python=3.12
conda activate qtrack

# Install QTrack and dependencies
git clone https://github.com/gaash-lab/QTrack.git
cd QTrack
pip install -r requirements.txt
pip install -e .

Citation

If you find QTrack useful for your research, please cite:

@article{ashraf2026qtrack,
  title={QTrack: Query-Driven Reasoning for Multi-modal MOT},
  author={Ashraf, Tajamul and Tariq, Tavaheed and Yadav, Sonia and Ul Riyaz, Abrar and Tak, Wasif and Abdar, Moloud and Bashir, Janibul},
  journal={arXiv preprint arXiv:2603.13759},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for GAASH-Lab/QTrack

QTrack: Query-Driven Reasoning for Multi-modal MOT

Paper • 2603.13759 • Published 5 days ago