AI Leaderboard
GridPonder levels are fully deterministic with verified gold paths, making them a clean benchmark for evaluating AI reasoning.
What is measured
Each turn the model sees the goal, the last action taken, the full list of valid actions, and the current board โ as a text grid, a rendered image, or both, depending on the selected modality. It can maintain a persistent memory string across turns โ its only way to carry information forward between steps.
How scores work
The headline Score is a 50/50 blend of solve-rate and path efficiency. Detailed column definitions live under each table below. Per attempt the action budget is 2ร the gold-path length; total across attempts is 3ร.
Which models
Local models (Gemma 4, Qwen 3.5, GPT-OSS) run via Ollama; API models (Claude, GPT, Gemini, Grok, MiniMax) run via LiteLLM. Vision-capable models can additionally be evaluated in image and text+image modes.
Overall Rankings
| # | Model | Type | Score | Accuracy | Efficiency | Cost | Score / $ |
|---|
Score โ 0.5 ร accuracy + 0.5 ร efficiency per level (failed levels contribute 0); used for ranking. ยท Accuracy โ fraction of levels solved within the action budget. ยท Efficiency โ gold path length รท actions used, on solved levels only (1.0 = optimal). ยท Cost โ average USD per level (API models). ยท Score / $ โ aggregate score divided by cost per level; higher = better value (API models only).
Cost vs. Score โ API Models
Blue = standard, purple = thinking/reasoning. Log scale on x-axis. Hover any dot for details.
Difficulty Curve
Accuracy by gold-path length. Buckets are 2 steps wide below 10 and 5 steps wide from 10 onward. Shaded band is a 95% Wilson confidence interval on the average. Hover any point for sample size.
Click models to overlay individual curves:
Grid Size vs Accuracy
Accuracy by playable-cell count (excludes void cells). Buckets are 5 wide below 30 and 20 wide from 30 onward. Shaded band is a 95% Wilson confidence interval on the average.
Click models to overlay individual curves:
Game Analysis
How models cope, recover, and use memory. Reflects the inference mode selected above. Column definitions below the table.
Recovery โ fraction of levels in which the model failed at least one attempt and then went on to succeed.
Rejection โ fraction of LLM calls returning an action the engine refused (wrong shape, illegal target, etc.).
Resets/lvl โ average number of resets per level, counting both auto-resets (per-attempt action limit hit) and voluntary give-ups.
Gave up โ fraction of levels where the model used the give_up action at least once.
Mem write rate โ fraction of LLM calls in which the model wrote a non-empty memory string. The final winning call on solved levels is excluded (its memory update is irrelevant โ the game is over).
Mem chars โ median character length of non-empty memory strings written.
State loops โ fraction of turns whose resulting (board, inventory) hash had been seen earlier in the same attempt โ i.e. the model cycled.
A dash (โ) means the metric isn't yet available โ it requires a benchmark re-run with the latest runner instrumentation.
Breakdown by Pack
| Model | box builder | carrot quest | diagonal swipes | flood colors | number cells | rotate flip | twinseed |
|---|
Values show accuracy (% levels solved). โฆ = thinking/reasoning enabled.
Running the benchmark
The benchmark tool lives in tools/benchmark/.
Requires Ollama for local models,
and API keys in .env (copy from .env.example) for cloud models.
make benchmark-build cd tools/benchmark && python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt cd tools/benchmark && bash run_local_nothink.sh make benchmark-suite make benchmark make benchmark-agg