🏆 AI Benchmarks

AI Leaderboard

GridPonder levels are fully deterministic with verified gold paths, making them a clean benchmark for evaluating AI reasoning.

🧩

What is measured

Each turn the model sees the goal, the last action taken, the full list of valid actions, and the current board — as a text grid, a rendered image, or both, depending on the selected modality. It can maintain a persistent memory string across turns — its only way to carry information forward between steps.

📊

How scores work

The headline Score is a 50/50 blend of solve-rate and path efficiency. Detailed column definitions live under each table below. Per attempt the action budget is 2× the gold-path length; total across attempts is 3×.

🤖

Which models

Local models (Gemma 4, Qwen 3.5, GPT-OSS) run via Ollama; API models (Claude, GPT, Gemini, Grok, MiniMax) run via LiteLLM. Vision-capable models can additionally be evaluated in image and text+image modes.

Overall Rankings

Updated 25 Apr 2026

#	Model	Type	Score	Accuracy	Efficiency	Cost	Score / $

Score — 0.5 × accuracy + 0.5 × efficiency per level (failed levels contribute 0); used for ranking. · Accuracy — fraction of levels solved within the action budget. · Efficiency — gold path length ÷ actions used, on solved levels only (1.0 = optimal). · Cost — average USD per level (API models). · Score / $ — aggregate score divided by cost per level; higher = better value (API models only).

Cost vs. Score — API Models

Blue = standard, purple = thinking/reasoning. Log scale on x-axis. Hover any dot for details.

Difficulty Curve

Accuracy by gold-path length. Buckets are 2 steps wide below 10 and 5 steps wide from 10 onward. Shaded band is a 95% Wilson confidence interval on the average. Hover any point for sample size.

Pack:

Click models to overlay individual curves:

Grid Size vs Accuracy

Accuracy by playable-cell count (excludes void cells). Buckets are 5 wide below 30 and 20 wide from 30 onward. Shaded band is a 95% Wilson confidence interval on the average.

Pack:

Click models to overlay individual curves:

Game Analysis

How models cope, recover, and use memory. Reflects the inference mode selected above. Column definitions below the table.

Recovery — fraction of levels in which the model failed at least one attempt and then went on to succeed.

Rejection — fraction of LLM calls returning an action the engine refused (wrong shape, illegal target, etc.).

Resets/lvl — average number of resets per level, counting both auto-resets (per-attempt action limit hit) and voluntary give-ups.

Gave up — fraction of levels where the model used the give_up action at least once.

Mem write rate — fraction of LLM calls in which the model wrote a non-empty memory string. The final winning call on solved levels is excluded (its memory update is irrelevant — the game is over).

Mem chars — median character length of non-empty memory strings written.

State loops — fraction of turns whose resulting (board, inventory) hash had been seen earlier in the same attempt — i.e. the model cycled.

A dash (—) means the metric isn't yet available — it requires a benchmark re-run with the latest runner instrumentation.

Breakdown by Pack

Model	box builder	carrot quest	diagonal swipes	flood colors	number cells	rotate flip	twinseed

Values show accuracy (% levels solved). ✦ = thinking/reasoning enabled.

Running the benchmark

The benchmark tool lives in tools/benchmark/. Requires Ollama for local models, and API keys in .env (copy from .env.example) for cloud models.

Build the runner (once) make benchmark-build

Install Python deps cd tools/benchmark && python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt

Run local non-think (5 combos) cd tools/benchmark && bash run_local_nothink.sh

Run curated suite (all models) make benchmark-suite

Run all packs × all models make benchmark

Aggregate results make benchmark-agg