Test Type

Select the type of test to evaluate

Go over the AI output and make sure all the claims made in the output are grounded in the prompt.

Text

Claim

Anonymous Evaluation 1

Anonymous Evaluation 2

Result

Judge Name	ELO Score	Wins	Losses	Total Evaluations
Meta Llama 4 Scout 17B 16E Instruct	1724.8384234654231	40	28	44

Judge Name	ELO Score	Wins	Losses	Total Evaluations
Qualifire	1724.8384234654231	40	4	44
Claude 3 Haiku	1558.9789022015404	4	1	5
Claude 3.5 Haiku	1553.2109613480875	3	0	3
Qwen 2.5 7B Instruct	1543.37554446099	3	0	3
Meta Llama 3.1 70B Instruct	1535.5696544480506	6	3	9
GPT-3.5 Turbo	1530.628203437139	2	0	2
Claude 3 Sonnet	1528.1056355333478	2	1	3
Meta Llama 4 Scout 17B 16E Instruct	1516.2892092665088	2	2	4
Qwen 2.5 72B Instruct	1515.1480974364024	1	0	1
Mistral (7B) Instruct v0.1	1500	0	0	0
Mixtral	1500	0	0	0
Qwen 2 Instruct (72B)	1500	0	0	0
GPT-4 Turbo	1499.7217358602074	1	1	2
Gemma 2 27B	1484.736306793522	0	1	1
Claude 3 Opus	1483.8496849577325	1	3	4
GPT-4o	1483.5476042607663	1	3	4
Meta Llama 3.1 405B Instruct	1480.7273197431043	1	5	6
Mistral (7B) Instruct v0.3	1478.3323551088422	0	2	2
Claude 3.5 Sonnet	1477.6257758061242	2	4	6
GPT-4.1	1468.847220765222	0	2	2
DeepSeek V3	1466.4505035965371	0	3	3
DeepSeek R1	1466.3355627816525	1	4	5
o3-mini	1456.1196407049383	0	3	3
Meta Llama 4 Scout 32K Instruct	1455.5159399586212	0	3	3
Meta Llama 3.1 8B Instruct	1448.2672724548872	0	6	6
Gemma 2 9B	1322.1433222517269	3	28	31

Benchmark Type

Select the type of benchmark to view

Benchmark Dataset

Select the benchmark dataset to view

Benchmark Results

Benchmark Results
Judge Name	F1 Score	Balanced Accuracy	Avg Latency (s)	Correct	Total

Select a benchmark dataset to view results

About AI Evaluators Arena

This platform allows you to evaluate and compare different AI judges in their ability to assess various types of content.

How it works

The leaderboard tracks judge performance using an ELO rating system, with scores adjusted based on human preferences.