ForeAgent — Qwen3-VL-8B for Agentic Image Forensics

ForeAgent (Forensics Agent) is a fine-tuned Qwen3-VL-8B model for AI-generated image detection. It determines whether an image is real (authentic) or fake (AI-generated) through multi-view forensic analysis spanning semantic, frequency-domain, and spatial-domain features.

Paper: Perception, Verdict, and Evolution: Hindsight-Driven Self-Refining Forensics Agent for AI-Generated Image Detection

Highlights

82.18% accuracy on Chameleon benchmark, outperforming AIDE by 16.41%
Competitive results on AIGCDetectBenchmark
Reasoning quality comparable to GPT-5 (per qualitative evaluations)
Dual-input analysis: original image + frequency-domain representation (wavelet cD)
Produces structured JSON output with conclusion, confidence, and reasoning

Model Details

Attribute	Value
Base Model	Qwen/Qwen3-VL-8B
Fine-tuning	LoRA SFT with iterative self-refinement
LoRA Config	r=16, alpha=32, lr=3e-6
Training Epochs	2 per iteration, multiple iterations
Max Sequence Length	1400
Framework	Transformers + PEFT
Input	Image(s) + text prompt
Output	JSON: `{"conclusion": "real/fake", "confidence": 0.0-1.0, "reasoning": "..."}`

How It Works

Perception-Verdict Mechanism

ForeAgent aggregates multi-view cues for forensic analysis:

Semantic features — the original image, analyzed for texture anomalies, anatomical integrity, physical consistency, and artifact detection
Frequency-domain features — diagonal detail coefficients (cD) from wavelet transform, revealing spectral patterns characteristic of AI-generated images
Spatial-domain features — noise pattern residuals (NPR) from a spatial expert model, detecting GAN-specific artifacts

An MLLM-based Critic fuses these multi-view signals to produce a logically grounded verdict.

Hindsight-Driven Self-Refining (EFA Pipeline)

The model is trained through an iterative Sampling → Reflection → Evolution loop:

Iteration N:
  1. Agent Inference    — Two-round reasoning (Perception + Critic) on data split N
  2. Sample Classification — Separate correct vs incorrect predictions
  3. Quality Assessment — Dual-model gating (Qwen3-VL-8B + Qwen3-VL-Plus)
  4. Reflection         — Guided by ground-truth, regenerate high-quality reasoning
  5. Training Data      — Merge reflections + knowledge retention, label-balanced
  6. LoRA Fine-tuning   — Update model weights
  7. Evaluation         — Test on held-out benchmarks
  → Next iteration uses the improved model

Usage

Quick Start (Single Image)

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Shimin/qwen3_vl_8b_foreagent",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Shimin/qwen3_vl_8b_foreagent")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": (
                "You are an expert forensic analyst specializing in distinguishing "
                "natural images from AI-generated images. Analyze the given image "
                "systematically.\n\n"
                "## Output JSON Format:\n"
                '```json\n{\n  "conclusion": "real or fake",\n'
                '  "confidence": 0.0-1.0,\n'
                '  "reasoning": "Brief reasoning (max 64 words)."\n}\n```\n\n'
                "Please analyze this image and determine if it is real or AI-generated."
            )},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=256)
output_text = processor.batch_decode(
    generated_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)[0]
print(output_text)

Dual-Image Mode (Original + Wavelet Frequency Domain)

For best performance, provide both the original image and its wavelet-transformed frequency-domain representation (diagonal detail coefficients cD):

import numpy as np
import pywt
from PIL import Image

def extract_wavelet_cd(image_path, output_size=256):
    """Extract diagonal detail coefficients (cD) via wavelet transform."""
    img = Image.open(image_path).convert("L")
    img_array = np.array(img, dtype=np.float64)
    coeffs = pywt.dwt2(img_array, "db1")
    _, (_, _, cD) = coeffs
    cD_normalized = np.clip((cD - cD.min()) / (cD.max() - cD.min() + 1e-8) * 255, 0, 255)
    cD_image = Image.fromarray(cD_normalized.astype(np.uint8))
    return cD_image.resize((output_size, output_size), Image.BILINEAR)

wavelet_image = extract_wavelet_cd("path/to/image.jpg")
wavelet_image.save("path/to/wavelet.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "image", "image": "path/to/wavelet.png"},
            {"type": "text", "text": (
                "You are an expert forensic analyst specializing in distinguishing "
                "natural images from AI-generated images. You are given two images: "
                "the original image and its frequency domain representation "
                "(diagonal detail coefficients cD from wavelet transform). "
                "Analyze them systematically.\n\n"
                "## Output JSON Format:\n"
                '```json\n{\n  "conclusion": "real or fake",\n'
                '  "confidence": 0.0-1.0,\n'
                '  "reasoning": "Brief reasoning (max 64 words)."\n}\n```\n\n'
                "Based on the original image and its frequency domain representation, "
                "judge whether this image is real or fake."
            )},
        ],
    }
]

# ... same inference code as above

Serving with SGLang (High-Throughput)

# Start SGLang server
python -m sglang.launch_server \
    --model-path Shimin/qwen3_vl_8b_foreagent \
    --port 8001 \
    --tp 1

# Call via OpenAI-compatible API
curl http://localhost:8001/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "default",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                    {"type": "text", "text": "Is this image real or AI-generated? Output JSON: {\"conclusion\": \"real or fake\"}"}
                ]
            }
        ],
        "max_tokens": 1024
    }'

Detection Criteria

The model analyzes images across multiple forensic dimensions:

Texture Analysis — skin smoothness, surface uniformity, edge transitions
Anatomical Integrity — hand/finger correctness, facial symmetry, body proportions
Physical Consistency — lighting coherence, shadow correctness, reflection plausibility
Artifact Detection — compression anomalies, generation artifacts, blending errors
Frequency Domain — wavelet coefficient distributions, spectral anomalies
Semantic Coherence — object relationships, scene composition logic

Training Data

The model is trained on a mixture of:

GenImage — diverse AI-generated images from multiple generators
ProGAN — GAN-generated face images
Iteratively refined through the EFA pipeline with dual-model quality gating

Training data undergoes label balancing (undersampling) and includes:

Reflection samples — full reasoning traces for error correction
Knowledge retention samples — conclusion-only samples to preserve existing capabilities

Intended Use

AI-generated image detection and forensic analysis
Deepfake detection in content moderation pipelines
Research on multimodal reasoning for image authenticity verification
Integration into agentic forensic workflows

Limitations

Performance varies across different AI generation methods; GAN-generated faces may be harder to detect than diffusion-based generations depending on the specific generator
Frequency-domain analysis (dual-image mode) improves accuracy but requires wavelet preprocessing
The model outputs natural language reasoning which may occasionally be inconsistent with the final conclusion
Detection accuracy may degrade on heavily compressed or low-resolution images

Technical Requirements

GPU Memory: ~32 GB (float16) for inference
Dependencies: transformers, qwen-vl-utils, pywt (PyWavelets) for wavelet preprocessing
Serving: Compatible with SGLang, vLLM, and standard Transformers inference

Citation

@article{foreagent2025,
  title={Perception, Verdict, and Evolution: Hindsight-Driven Self-Refining Forensics Agent for AI-Generated Image Detection},
  author={Yangjun Wu, Keyu Yan, Yu Liu, Jingren Zhou, Fei Huang, Rong Zhang, Zhou Zhao, and Fei Wu},
  year={2026}
}

License

This model is released under the Apache 2.0 License.

Downloads last month: 16

Safetensors

Model size

9B params

Tensor type

BF16