๐Ÿ† AI Benchmarks

AI Leaderboard

GridPonder levels are fully deterministic with verified gold paths, making them a clean benchmark for evaluating AI reasoning.

๐Ÿงฉ

What is measured

Each turn the model sees the goal, the last action taken, the full list of valid actions, and the current board โ€” as a text grid, a rendered image, or both, depending on the selected modality. It can maintain a persistent memory string across turns โ€” its only way to carry information forward between steps.

๐Ÿ“Š

How scores work

The headline Score is a 50/50 blend of solve-rate and path efficiency. Detailed column definitions live under each table below. Per attempt the action budget is 2ร— the gold-path length; total across attempts is 3ร—.

๐Ÿค–

Which models

Local models (Gemma 4, Qwen 3.5, GPT-OSS) run via Ollama; API models (Claude, GPT, Gemini, Grok, MiniMax) run via LiteLLM. Vision-capable models can additionally be evaluated in image and text+image modes.

Overall Rankings

Updated 25 Apr 2026
#ModelTypeScoreAccuracyEfficiencyCostScore / $

Score โ€” 0.5 ร— accuracy + 0.5 ร— efficiency per level (failed levels contribute 0); used for ranking.  ยท  Accuracy โ€” fraction of levels solved within the action budget.  ยท  Efficiency โ€” gold path length รท actions used, on solved levels only (1.0 = optimal).  ยท  Cost โ€” average USD per level (API models).  ยท  Score / $ โ€” aggregate score divided by cost per level; higher = better value (API models only).

Cost vs. Score โ€” API Models

Blue = standard, purple = thinking/reasoning. Log scale on x-axis. Hover any dot for details.

Difficulty Curve

Accuracy by gold-path length. Buckets are 2 steps wide below 10 and 5 steps wide from 10 onward. Shaded band is a 95% Wilson confidence interval on the average. Hover any point for sample size.

Click models to overlay individual curves:

Grid Size vs Accuracy

Accuracy by playable-cell count (excludes void cells). Buckets are 5 wide below 30 and 20 wide from 30 onward. Shaded band is a 95% Wilson confidence interval on the average.

Click models to overlay individual curves:

Game Analysis

How models cope, recover, and use memory. Reflects the inference mode selected above. Column definitions below the table.

Recovery โ€” fraction of levels in which the model failed at least one attempt and then went on to succeed.

Rejection โ€” fraction of LLM calls returning an action the engine refused (wrong shape, illegal target, etc.).

Resets/lvl โ€” average number of resets per level, counting both auto-resets (per-attempt action limit hit) and voluntary give-ups.

Gave up โ€” fraction of levels where the model used the give_up action at least once.

Mem write rate โ€” fraction of LLM calls in which the model wrote a non-empty memory string. The final winning call on solved levels is excluded (its memory update is irrelevant โ€” the game is over).

Mem chars โ€” median character length of non-empty memory strings written.

State loops โ€” fraction of turns whose resulting (board, inventory) hash had been seen earlier in the same attempt โ€” i.e. the model cycled.

A dash (โ€”) means the metric isn't yet available โ€” it requires a benchmark re-run with the latest runner instrumentation.

Breakdown by Pack

Model box buildercarrot questdiagonal swipesflood colorsnumber cellsrotate fliptwinseed

Values show accuracy (% levels solved). โœฆ = thinking/reasoning enabled.

Running the benchmark

The benchmark tool lives in tools/benchmark/. Requires Ollama for local models, and API keys in .env (copy from .env.example) for cloud models.

Build the runner (once) make benchmark-build
Install Python deps cd tools/benchmark && python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
Run local non-think (5 combos) cd tools/benchmark && bash run_local_nothink.sh
Run curated suite (all models) make benchmark-suite
Run all packs ร— all models make benchmark
Aggregate results make benchmark-agg