ForeAgent — Qwen3-VL-8B for Agentic Image Forensics
ForeAgent (Forensics Agent) is a fine-tuned Qwen3-VL-8B model for AI-generated image detection. It determines whether an image is real (authentic) or fake (AI-generated) through multi-view forensic analysis spanning semantic, frequency-domain, and spatial-domain features.
Paper: Perception, Verdict, and Evolution: Hindsight-Driven Self-Refining Forensics Agent for AI-Generated Image Detection
Highlights
- 82.18% accuracy on Chameleon benchmark, outperforming AIDE by 16.41%
- Competitive results on AIGCDetectBenchmark
- Reasoning quality comparable to GPT-5 (per qualitative evaluations)
- Dual-input analysis: original image + frequency-domain representation (wavelet cD)
- Produces structured JSON output with conclusion, confidence, and reasoning
Model Details
| Attribute | Value |
|---|---|
| Base Model | Qwen/Qwen3-VL-8B |
| Fine-tuning | LoRA SFT with iterative self-refinement |
| LoRA Config | r=16, alpha=32, lr=3e-6 |
| Training Epochs | 2 per iteration, multiple iterations |
| Max Sequence Length | 1400 |
| Framework | Transformers + PEFT |
| Input | Image(s) + text prompt |
| Output | JSON: {"conclusion": "real/fake", "confidence": 0.0-1.0, "reasoning": "..."} |
How It Works
Perception-Verdict Mechanism
ForeAgent aggregates multi-view cues for forensic analysis:
- Semantic features — the original image, analyzed for texture anomalies, anatomical integrity, physical consistency, and artifact detection
- Frequency-domain features — diagonal detail coefficients (cD) from wavelet transform, revealing spectral patterns characteristic of AI-generated images
- Spatial-domain features — noise pattern residuals (NPR) from a spatial expert model, detecting GAN-specific artifacts
An MLLM-based Critic fuses these multi-view signals to produce a logically grounded verdict.
Hindsight-Driven Self-Refining (EFA Pipeline)
The model is trained through an iterative Sampling → Reflection → Evolution loop:
Iteration N:
1. Agent Inference — Two-round reasoning (Perception + Critic) on data split N
2. Sample Classification — Separate correct vs incorrect predictions
3. Quality Assessment — Dual-model gating (Qwen3-VL-8B + Qwen3-VL-Plus)
4. Reflection — Guided by ground-truth, regenerate high-quality reasoning
5. Training Data — Merge reflections + knowledge retention, label-balanced
6. LoRA Fine-tuning — Update model weights
7. Evaluation — Test on held-out benchmarks
→ Next iteration uses the improved model
Usage
Quick Start (Single Image)
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Shimin/qwen3_vl_8b_foreagent",
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Shimin/qwen3_vl_8b_foreagent")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": (
"You are an expert forensic analyst specializing in distinguishing "
"natural images from AI-generated images. Analyze the given image "
"systematically.\n\n"
"## Output JSON Format:\n"
'```json\n{\n "conclusion": "real or fake",\n'
' "confidence": 0.0-1.0,\n'
' "reasoning": "Brief reasoning (max 64 words)."\n}\n```\n\n'
"Please analyze this image and determine if it is real or AI-generated."
)},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=256)
output_text = processor.batch_decode(
generated_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True,
)[0]
print(output_text)
Dual-Image Mode (Original + Wavelet Frequency Domain)
For best performance, provide both the original image and its wavelet-transformed frequency-domain representation (diagonal detail coefficients cD):
import numpy as np
import pywt
from PIL import Image
def extract_wavelet_cd(image_path, output_size=256):
"""Extract diagonal detail coefficients (cD) via wavelet transform."""
img = Image.open(image_path).convert("L")
img_array = np.array(img, dtype=np.float64)
coeffs = pywt.dwt2(img_array, "db1")
_, (_, _, cD) = coeffs
cD_normalized = np.clip((cD - cD.min()) / (cD.max() - cD.min() + 1e-8) * 255, 0, 255)
cD_image = Image.fromarray(cD_normalized.astype(np.uint8))
return cD_image.resize((output_size, output_size), Image.BILINEAR)
wavelet_image = extract_wavelet_cd("path/to/image.jpg")
wavelet_image.save("path/to/wavelet.png")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "image", "image": "path/to/wavelet.png"},
{"type": "text", "text": (
"You are an expert forensic analyst specializing in distinguishing "
"natural images from AI-generated images. You are given two images: "
"the original image and its frequency domain representation "
"(diagonal detail coefficients cD from wavelet transform). "
"Analyze them systematically.\n\n"
"## Output JSON Format:\n"
'```json\n{\n "conclusion": "real or fake",\n'
' "confidence": 0.0-1.0,\n'
' "reasoning": "Brief reasoning (max 64 words)."\n}\n```\n\n'
"Based on the original image and its frequency domain representation, "
"judge whether this image is real or fake."
)},
],
}
]
# ... same inference code as above
Serving with SGLang (High-Throughput)
# Start SGLang server
python -m sglang.launch_server \
--model-path Shimin/qwen3_vl_8b_foreagent \
--port 8001 \
--tp 1
# Call via OpenAI-compatible API
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Is this image real or AI-generated? Output JSON: {\"conclusion\": \"real or fake\"}"}
]
}
],
"max_tokens": 1024
}'
Detection Criteria
The model analyzes images across multiple forensic dimensions:
- Texture Analysis — skin smoothness, surface uniformity, edge transitions
- Anatomical Integrity — hand/finger correctness, facial symmetry, body proportions
- Physical Consistency — lighting coherence, shadow correctness, reflection plausibility
- Artifact Detection — compression anomalies, generation artifacts, blending errors
- Frequency Domain — wavelet coefficient distributions, spectral anomalies
- Semantic Coherence — object relationships, scene composition logic
Training Data
The model is trained on a mixture of:
- GenImage — diverse AI-generated images from multiple generators
- ProGAN — GAN-generated face images
- Iteratively refined through the EFA pipeline with dual-model quality gating
Training data undergoes label balancing (undersampling) and includes:
- Reflection samples — full reasoning traces for error correction
- Knowledge retention samples — conclusion-only samples to preserve existing capabilities
Intended Use
- AI-generated image detection and forensic analysis
- Deepfake detection in content moderation pipelines
- Research on multimodal reasoning for image authenticity verification
- Integration into agentic forensic workflows
Limitations
- Performance varies across different AI generation methods; GAN-generated faces may be harder to detect than diffusion-based generations depending on the specific generator
- Frequency-domain analysis (dual-image mode) improves accuracy but requires wavelet preprocessing
- The model outputs natural language reasoning which may occasionally be inconsistent with the final conclusion
- Detection accuracy may degrade on heavily compressed or low-resolution images
Technical Requirements
- GPU Memory: ~32 GB (float16) for inference
- Dependencies:
transformers,qwen-vl-utils,pywt(PyWavelets) for wavelet preprocessing - Serving: Compatible with SGLang, vLLM, and standard Transformers inference
Citation
@article{foreagent2025,
title={Perception, Verdict, and Evolution: Hindsight-Driven Self-Refining Forensics Agent for AI-Generated Image Detection},
author={Yangjun Wu, Keyu Yan, Yu Liu, Jingren Zhou, Fei Huang, Rong Zhang, Zhou Zhao, and Fei Wu},
year={2026}
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 16