Investigating spatial reasoning capabilities through word search puzzles. Testing how language models locate, track, and report grid coordinates.
Published January 17, 2026
LLMs are remarkably good at finding patterns. Give them a wall of text and they'll pull out names, dates, themes, contradictions. But here's a question I've been chewing on: can they tell you where they found something? Not just what they found, but the precise coordinates in a grid, the exact position in a structure?
Word search puzzles turn out to be a clean way to test this. The task is simple enough that any model can understand it: find the hidden words in a grid of letters, then report their start and end coordinates. The interesting part isn't whether models can find the words - spoiler, they're great at that - it's whether they can accurately report where those words are.
What we found surprised us. There's a fundamental gap between finding and locating, and it reveals something important about how these models process positional information.
The percentage of hidden words the model successfully identifies in the puzzle. A model that finds 4 out of 5 words has 80% word accuracy. This measures pattern recognition ability.
Of the words found, the percentage where the model reports the correct start and end coordinates. A model may find a word but give wrong grid positions. This measures spatial reasoning ability.
| Model | Evaluations | Word Accuracy | Position Accuracy | Avg Latency | Total Cost |
|---|---|---|---|---|---|
| GPT-4o | 94 | 99.0% | 17.9% | 2.6s | $0.30 |
| GPT-4 Turbo | 94 | 98.0% | 15.4% | 4.9s | $0.99 |
| Claude 3.5 Haiku | 94 | 95.8% | 9.3% | 3.8s | $0.18 |
| Claude Opus 4 | 94 | 95.7% | 58.8% | 8.2s | $3.58 |
| Claude Sonnet 4 | 94 | 95.0% | 60.6% | 10.9s | $1.10 |
| GPT-4o Mini | 94 | 95.0% | 11.5% | 4.6s | $0.02 |
| Gemini 2.0 Flash | 59 | 94.4% | 33.2% | 2.0s | $0.01 |
84 evaluations on larger grids (12x12, 15x15, 20x20) with 5-10 words per puzzle.
| Model | 12x12 | 15x15 | 20x20 | Word Acc |
|---|---|---|---|---|
| Claude Sonnet 4 | 58.1% | 46.7% | 40.0% | 100% |
| Claude Opus 4 | 51.9% | 19.2% | 50.0% | 100% |
| Gemini 2.0 Flash | 24.2% | 21.4% | 10.0% | 95% |
| Claude 3.5 Haiku | 8.1% | 1.7% | 0.0% | 100% |
| GPT-4o | 13.1% | 8.3% | 0.0% | 97.9% |
| GPT-4 Turbo | 11.2% | 6.7% | 0.0% | 100% |
| GPT-4o Mini | 5.0% | 0.0% | 0.0% | 99.2% |
Position accuracy shown. Gemini 2.0 Flash maintains 10% at 20x20 while GPT models drop to 0%.
Models often find words but report them backwards - claiming "RIGHT" when the word goes "LEFT", swapping start/end coordinates. This explains most position errors.
Position accuracy drops from 42% (5x5) to 34% (8x8) to 23% (10x10). Claude Opus degrades from 80% to 46% on larger grids.
GPT-4o achieves 99% word-finding but only 21% position accuracy. Models locate patterns but struggle with spatial coordinate reporting.
Position accuracy: Claude Opus 67%, Claude Sonnet 64% vs GPT-4o 21%, GPT-4 Turbo 18%. A fundamental capability difference.
Scaling experiment: GPT-4o, GPT-4 Turbo, and GPT-4o Mini all drop to 0% position accuracy on 20x20 grids while still finding 100% of words. Claude Sonnet maintains 40%.
The pattern here is striking: models can achieve near-perfect word finding while simultaneously failing at position reporting. GPT-4o finds 99% of words but only reports correct coordinates 21% of the time. This isn't a small gap - it's a fundamental disconnect between two capabilities we might naively assume go hand in hand.
Why does this happen? LLMs are trained on likelihood, not spatial truth. When a model scans a grid and recognizes "PYTHON" running diagonally, it's doing pattern matching, something these models excel at. But when asked to report that the word starts at row 3, column 5 and ends at row 8, column 10, it's doing something fundamentally different. It's not predicting the next likely token - it's supposed to be doing precise coordinate math, and that's where things fall apart.
The scaling results make this even clearer. As grids get larger, the coordinate space expands and position accuracy craters even while word-finding stays strong. Claude models degrade more gracefully than GPT models, but everyone struggles. At 20x20, most models are essentially guessing coordinates while still finding every word.
The implication? Be careful when asking LLMs to work with positional data. They might confidently tell you they found what you're looking for, but their sense of where is unreliable. For applications that require coordinate accuracy - parsing structured documents, navigating spatial data, referencing specific locations - verification isn't optional. The model might be right about what exists, but wrong about where it is.
Select a puzzle to visualize the grid and compare model submissions.
Two-phase experiment testing 6 models (3 Anthropic, 3 OpenAI) on word search puzzles. Phase 1: Baseline evaluation on standard grids. Phase 2: Scaling study on larger grids up to 20x20.
Research, analysis, and perspectives on technology — published when we have something worth saying.