Localize Viusal Understanding
updated
GLaMM: Pixel Grounding Large Multimodal Model
Paper
• 2311.03356
• Published
• 36
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models
Paper
• 2311.07575
• Published
• 15
CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding
Paper
• 2311.03354
• Published
• 7
Language-Informed Visual Concept Learning
Paper
• 2312.03587
• Published
• 8
Denoising Vision Transformers
Paper
• 2401.02957
• Published
• 31
Learning Anatomically Consistent Embedding for Chest Radiography
Paper
• 2312.00335
• Published
Representing Part-Whole Hierarchies in Foundation Models by Learning
Localizability, Composability, and Decomposability from Anatomy via
Self-Supervision
Paper
• 2404.15672
• Published
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
• 2406.19389
• Published
• 54
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Paper
• 2406.17770
• Published
• 19
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
• 2407.16198
• Published
• 13
Contrastive Localized Language-Image Pre-Training
Paper
• 2410.02746
• Published
• 37