AI Evaluators Arena
Choose which AI judge provides better evaluation of the output. This is a blind evaluation - judges' identities are hidden until after you make your selection.
Test Type
Select the type of test to evaluate
Go over the AI output and make sure all the claims made in the output are grounded in the prompt.
Judge Name | ELO Score | Wins | Losses | Total Evaluations |
|---|---|---|---|---|
Meta Llama 4 Scout 17B 16E Instruct | 1724.8384234654231 | 40 | 28 | 44 |
Benchmark Type
Select the type of benchmark to view
Benchmark Dataset
Select the benchmark dataset to view
Benchmark Results
Judge Name | F1 Score | Balanced Accuracy | Avg Latency (s) | Correct | Total |
|---|---|---|---|---|---|
Judge Name | F1 Score | Balanced Accuracy | Avg Latency (s) | Correct | Total |
|---|
Select a benchmark dataset to view results
About AI Evaluators Arena
This platform allows you to evaluate and compare different AI judges in their ability to assess various types of content.
How it works
- Choose a test type from the dropdown
- Fill in the input fields or load a random example from our dataset
- Click "Evaluate" to get assessments from two randomly selected judges
- Choose which evaluation you think is better
- See which judge provided each evaluation
Test Types
- Grounding: Evaluate if a claim is grounded in a given text
- Prompt Injections: Detect attempts to manipulate or jailbreak the model
- Safety: Identify harmful, offensive, or dangerous content
- Policy: Determine if output complies with a given policy
Leaderboard
The leaderboard tracks judge performance using an ELO rating system, with scores adjusted based on human preferences.
made with โค๏ธ by Qualifire