Evals
Run a question across multiple models and have a judge model rate each response pass/fail.
Question
Prompt
How many r's are in the word strawberry?
Rubric (sent to judge)
The correct answer is 3. The response passes if its final stated count is 3, regardless of how it explains the reasoning. It fails if it states any other number (most commonly 2). Extra commentary, listing positions, or step-by-step reasoning is fine as long as the final answer is 3.
Models
Moonshot AI
Qwen
DeepSeek
Google
MiniMax
OpenAI
Z.AI
Anthropic
xAI
0 models selected ยท judge: Claude Sonnet 4.6