Instructions to use chromadb/context-1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use chromadb/context-1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="chromadb/context-1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("chromadb/context-1") model = AutoModelForCausalLM.from_pretrained("chromadb/context-1") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use chromadb/context-1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "chromadb/context-1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "chromadb/context-1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/chromadb/context-1
- SGLang
How to use chromadb/context-1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "chromadb/context-1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "chromadb/context-1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "chromadb/context-1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "chromadb/context-1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use chromadb/context-1 with Docker Model Runner:
docker model run hf.co/chromadb/context-1
Self-editing context for long-horizon retrieval: practical deployment questions
The self-editing context mechanism with 0.94 prune accuracy addresses a core pain point in production RAG systems — context bloat degrading retrieval quality over long reasoning chains. Query decomposition + parallel tool calling (2.56 calls/turn) is an elegant efficiency win.
A few deployment questions:
The 10x faster inference speed claim — is that comparing against frontier models like GPT-4o running full retrieval loops, or against smaller specialized retrievers? For a 20B MoE, I'd expect significant latency gains, but curious about the baseline.
The staged curriculum training (SFT + RL with CISPO) — is there a threshold where the RL fine-tuning becomes critical? In my experience with retrieval agents, SFT-only models often struggle with strategic context pruning.
The harness requirement is notable. For teams building with Context-1 before the harness release, what's the minimum scaffold needed to get functional retrieval? Is it primarily the token budget manager and deduplication layer?
Looking forward to the harness release — self-editing context is the right abstraction for agentic RAG pipelines.
- Its comparing to models such as Opus4.5 and GPT5.2. We benchmarked on a single B200 MXFP4 checkpoint and it is even faster with speculative decoding.
- We didn't ablate RL vs no-RL rigorously. But I can say RL was necessary here for performance.
- Thats about right, we should have the harness ready in open source this week and you can also use it today hosted - https://www.trychroma.com/products/agent
For speculative decoding you used nvidia/gpt-oss-120b-Eagle3-long-context ?
- Its comparing to models such as Opus4.5 and GPT5.2. We benchmarked on a single B200 MXFP4 checkpoint and it is even faster with speculative decoding.
- We didn't ablate RL vs no-RL rigorously. But I can say RL was necessary here for performance.
- Thats about right, we should have the harness ready in open source this week and you can also use it today hosted - https://www.trychroma.com/products/agent
- Its comparing to models such as Opus4.5 and GPT5.2. We benchmarked on a single B200 MXFP4 checkpoint and it is even faster with speculative decoding.
- We didn't ablate RL vs no-RL rigorously. But I can say RL was necessary here for performance.
- Thats about right, we should have the harness ready in open source this week and you can also use it today hosted - https://www.trychroma.com/products/agent
Could you share any updates on the timeline for open-sourcing the harness?
Can't wait for the open source harness to be available!