ForeAgent — Qwen3-VL-8B for Agentic Image Forensics

ForeAgent (Forensics Agent) is a fine-tuned Qwen3-VL-8B model for AI-generated image detection. It determines whether an image is real (authentic) or fake (AI-generated) through multi-view forensic analysis spanning semantic, frequency-domain, and spatial-domain features.

Paper: Perception, Verdict, and Evolution: Hindsight-Driven Self-Refining Forensics Agent for AI-Generated Image Detection

Highlights

  • 82.18% accuracy on Chameleon benchmark, outperforming AIDE by 16.41%
  • Competitive results on AIGCDetectBenchmark
  • Reasoning quality comparable to GPT-5 (per qualitative evaluations)
  • Dual-input analysis: original image + frequency-domain representation (wavelet cD)
  • Produces structured JSON output with conclusion, confidence, and reasoning

Model Details

Attribute Value
Base Model Qwen/Qwen3-VL-8B
Fine-tuning LoRA SFT with iterative self-refinement
LoRA Config r=16, alpha=32, lr=3e-6
Training Epochs 2 per iteration, multiple iterations
Max Sequence Length 1400
Framework Transformers + PEFT
Input Image(s) + text prompt
Output JSON: {"conclusion": "real/fake", "confidence": 0.0-1.0, "reasoning": "..."}

How It Works

Perception-Verdict Mechanism

ForeAgent aggregates multi-view cues for forensic analysis:

  1. Semantic features — the original image, analyzed for texture anomalies, anatomical integrity, physical consistency, and artifact detection
  2. Frequency-domain features — diagonal detail coefficients (cD) from wavelet transform, revealing spectral patterns characteristic of AI-generated images
  3. Spatial-domain features — noise pattern residuals (NPR) from a spatial expert model, detecting GAN-specific artifacts

An MLLM-based Critic fuses these multi-view signals to produce a logically grounded verdict.

Hindsight-Driven Self-Refining (EFA Pipeline)

The model is trained through an iterative Sampling → Reflection → Evolution loop:

Iteration N:
  1. Agent Inference    — Two-round reasoning (Perception + Critic) on data split N
  2. Sample Classification — Separate correct vs incorrect predictions
  3. Quality Assessment — Dual-model gating (Qwen3-VL-8B + Qwen3-VL-Plus)
  4. Reflection         — Guided by ground-truth, regenerate high-quality reasoning
  5. Training Data      — Merge reflections + knowledge retention, label-balanced
  6. LoRA Fine-tuning   — Update model weights
  7. Evaluation         — Test on held-out benchmarks
  → Next iteration uses the improved model

Usage

Quick Start (Single Image)

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Shimin/qwen3_vl_8b_foreagent",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Shimin/qwen3_vl_8b_foreagent")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": (
                "You are an expert forensic analyst specializing in distinguishing "
                "natural images from AI-generated images. Analyze the given image "
                "systematically.\n\n"
                "## Output JSON Format:\n"
                '```json\n{\n  "conclusion": "real or fake",\n'
                '  "confidence": 0.0-1.0,\n'
                '  "reasoning": "Brief reasoning (max 64 words)."\n}\n```\n\n'
                "Please analyze this image and determine if it is real or AI-generated."
            )},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=256)
output_text = processor.batch_decode(
    generated_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)[0]
print(output_text)

Dual-Image Mode (Original + Wavelet Frequency Domain)

For best performance, provide both the original image and its wavelet-transformed frequency-domain representation (diagonal detail coefficients cD):

import numpy as np
import pywt
from PIL import Image

def extract_wavelet_cd(image_path, output_size=256):
    """Extract diagonal detail coefficients (cD) via wavelet transform."""
    img = Image.open(image_path).convert("L")
    img_array = np.array(img, dtype=np.float64)
    coeffs = pywt.dwt2(img_array, "db1")
    _, (_, _, cD) = coeffs
    cD_normalized = np.clip((cD - cD.min()) / (cD.max() - cD.min() + 1e-8) * 255, 0, 255)
    cD_image = Image.fromarray(cD_normalized.astype(np.uint8))
    return cD_image.resize((output_size, output_size), Image.BILINEAR)

wavelet_image = extract_wavelet_cd("path/to/image.jpg")
wavelet_image.save("path/to/wavelet.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "image", "image": "path/to/wavelet.png"},
            {"type": "text", "text": (
                "You are an expert forensic analyst specializing in distinguishing "
                "natural images from AI-generated images. You are given two images: "
                "the original image and its frequency domain representation "
                "(diagonal detail coefficients cD from wavelet transform). "
                "Analyze them systematically.\n\n"
                "## Output JSON Format:\n"
                '```json\n{\n  "conclusion": "real or fake",\n'
                '  "confidence": 0.0-1.0,\n'
                '  "reasoning": "Brief reasoning (max 64 words)."\n}\n```\n\n'
                "Based on the original image and its frequency domain representation, "
                "judge whether this image is real or fake."
            )},
        ],
    }
]

# ... same inference code as above

Serving with SGLang (High-Throughput)

# Start SGLang server
python -m sglang.launch_server \
    --model-path Shimin/qwen3_vl_8b_foreagent \
    --port 8001 \
    --tp 1

# Call via OpenAI-compatible API
curl http://localhost:8001/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "default",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                    {"type": "text", "text": "Is this image real or AI-generated? Output JSON: {\"conclusion\": \"real or fake\"}"}
                ]
            }
        ],
        "max_tokens": 1024
    }'

Detection Criteria

The model analyzes images across multiple forensic dimensions:

  1. Texture Analysis — skin smoothness, surface uniformity, edge transitions
  2. Anatomical Integrity — hand/finger correctness, facial symmetry, body proportions
  3. Physical Consistency — lighting coherence, shadow correctness, reflection plausibility
  4. Artifact Detection — compression anomalies, generation artifacts, blending errors
  5. Frequency Domain — wavelet coefficient distributions, spectral anomalies
  6. Semantic Coherence — object relationships, scene composition logic

Training Data

The model is trained on a mixture of:

  • GenImage — diverse AI-generated images from multiple generators
  • ProGAN — GAN-generated face images
  • Iteratively refined through the EFA pipeline with dual-model quality gating

Training data undergoes label balancing (undersampling) and includes:

  • Reflection samples — full reasoning traces for error correction
  • Knowledge retention samples — conclusion-only samples to preserve existing capabilities

Intended Use

  • AI-generated image detection and forensic analysis
  • Deepfake detection in content moderation pipelines
  • Research on multimodal reasoning for image authenticity verification
  • Integration into agentic forensic workflows

Limitations

  • Performance varies across different AI generation methods; GAN-generated faces may be harder to detect than diffusion-based generations depending on the specific generator
  • Frequency-domain analysis (dual-image mode) improves accuracy but requires wavelet preprocessing
  • The model outputs natural language reasoning which may occasionally be inconsistent with the final conclusion
  • Detection accuracy may degrade on heavily compressed or low-resolution images

Technical Requirements

  • GPU Memory: ~32 GB (float16) for inference
  • Dependencies: transformers, qwen-vl-utils, pywt (PyWavelets) for wavelet preprocessing
  • Serving: Compatible with SGLang, vLLM, and standard Transformers inference

Citation

@article{foreagent2025,
  title={Perception, Verdict, and Evolution: Hindsight-Driven Self-Refining Forensics Agent for AI-Generated Image Detection},
  author={Yangjun Wu, Keyu Yan, Yu Liu, Jingren Zhou, Fei Huang, Rong Zhang, Zhou Zhao, and Fei Wu},
  year={2026}
}

License

This model is released under the Apache 2.0 License.

Downloads last month
16
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support