LLM Playground

Evals

Run a question across multiple models and have a judge model rate each response pass/fail.

Question

Prompt

How many r's are in the word strawberry?

Rubric (sent to judge)

The correct answer is 3. The response passes if its final stated count is 3, regardless of how it explains the reasoning. It fails if it states any other number (most commonly 2). Extra commentary, listing positions, or step-by-step reasoning is fine as long as the final answer is 3.

Models

Moonshot AI
Qwen
DeepSeek
Google
MiniMax
OpenAI
Z.AI
Anthropic
xAI

0 models selected ยท judge: Claude Sonnet 4.6