Paper: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval | Blog

NanoVDR-L

ModernBERT-base ablation variant. For production use, we recommend NanoVDR-S-Multi.

NanoVDR-L is a 151M-parameter text-only query encoder for visual document retrieval, trained via asymmetric cross-modal distillation from Qwen3-VL-Embedding-2B. It uses ModernBERT-base + a 2-layer MLP projector and achieves the highest v1 score (82.4) among all NanoVDR variants.

Highlights

Single-vector retrieval — queries and documents share the same 2048-dim embedding space as Qwen3-VL-Embedding-2B; retrieval is a plain dot product, FAISS-compatible, 4 KB per page (float16)
Lightweight on storage — 612 MB model; doc index costs 64× less than ColPali's multi-vector patches
Asymmetric setup — tiny 151M text encoder at query time; large VLM indexes documents offline once

Results

Model	Params	ViDoRe v1	ViDoRe v2	ViDoRe v3	Avg Retention
Qwen3-VL-Emb (Teacher)	2.0B	84.3	65.3	50.0	—
NanoVDR-L	151M	82.4	61.5	44.2	93.4%
NanoVDR-S-Multi	69M	82.2	61.9	46.5	95.1%

_{NDCG@5 (×100). Retention = Student / Teacher averaged across v1/v2/v3.}

Usage

Prerequisite: Documents must be indexed offline using Qwen3-VL-Embedding-2B (the teacher model). See the NanoVDR-S-Multi model page for a complete indexing guide.

from sentence_transformers import SentenceTransformer
import numpy as np

# doc_embeddings: (N, 2048) from teacher indexing (see prerequisite above)

model = SentenceTransformer("nanovdr/NanoVDR-L")
query_embeddings = model.encode(["What was the revenue growth in Q3?"])  # (1, 2048)

scores = query_embeddings @ doc_embeddings.T
top_k_indices = np.argsort(scores[0])[-5:][::-1]

Training Details

	Value
Architecture	ModernBERT-base (149M) + MLP projector (768 → 768 → 2048, 2.4M) = 151M
Objective	Pointwise cosine alignment with teacher query embeddings
Data	711K query-document pairs
Epochs / lr	20 / 2e-4
Training cost	~11.7 GPU-hours (1× H200)
CPU query latency	109 ms

All NanoVDR Models

Model	Backbone	Params	v1	v2	v3	Retention
NanoVDR-S-Multi	DistilBERT	69M	82.2	61.9	46.5	95.1%
NanoVDR-S	DistilBERT	69M	82.2	60.5	43.5	92.4%
NanoVDR-M	BERT-base	112M	82.1	62.2	44.7	94.0%
NanoVDR-L	ModernBERT	151M	82.4	61.5	44.2	93.4%

Citation

@article{nanovdr2026,
  title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval},
  author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu},
  journal={arXiv preprint arXiv:2603.12824},
  year={2026}
}

License

Apache 2.0

Downloads last month: 52

Model tree for nanovdr/NanoVDR-L

Base model

answerdotai/ModernBERT-base

Finetuned

(1130)

this model

Datasets used to train nanovdr/NanoVDR-L

Paper for nanovdr/NanoVDR-L

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Paper • 2603.12824 • Published 8 days ago • 5

Evaluation results

NDCG@5 on ViDoRe v1
self-reported

82.400
NDCG@5 on ViDoRe v2
self-reported

61.500