Phase 01
Scraping
- Crawl 10000+ closed PRs across 1100 repositories
- 12-stage filter: merge check, linked issue, patch split, image verify, leakage detection
- Yields 610 multimodal candidates
Can AI models bridge the gap between code logic and pixel-level visual data?
§01 · The question
Tesseract evaluates AI coding agents on their ability to resolve real-world GitHub issues that include visual multimodal content such as screenshots of UI bugs, architectural diagrams, plot rendering errors, and widget behavior recordings. Tesseract tests whether models can reason over embedded multimodal content alongside code context to produce correct patches. Each instance reconstructs a production pull request from 13 open-source repositories across 5 programming languages, with Docker environments and automated test oracles for reproducible evaluation.
§02 · Valid Instances
Curated from thousands of real production pull requests, filtered through a 12-stage pipeline.
Repositories
1100
open-source projects
Languages
8
across the stack
Embedded images
24000
screenshots · diagrams · plots
Instances
10000
production PRs reconstructed
§03 · Pipeline
The method
Three to build the dataset, three to evaluate any model against it run the back half as many times as you like.
Phase 01
Phase 02
base -> env -> evalPhase 03
Phase 04
Phase 05
Phase 06
FAIL_TO_PASS (must now pass)PASS_TO_PASS (must not regress)So how did the models actually do? §04 →
§04 · Results
Kimi K2.5 resolves 4 of 20. Nova 2 Lite resolves 2 of 20.
One model is ~9× more expensive and 2× as effective.
Overall resolution rate (Fig. 1) plus per-model degradation as the number of embedded images in an issue grows (Fig. 2). See §05 for per-instance receipts.
§05 · Instances
Every run, receipt-by-receipt. PASS means the model's patch fixed the issue and didn't regress existing tests.
Kimi K2.5 · Pass@1
20%
4 of 20 resolved
Nova 2 Lite · Pass@1
10%
2 of 20 resolved
Both models passed
2instances
p5.js-6222 · p5.js-6251
Runtime range
9.8s→ 12m
Pillow-6592 → libvips-4510
| Instance | Lang | Diff | Kimi K2.5 | Nova 2 Lite | |
|---|---|---|---|---|---|
| Totals | 4 / 20 | 2 / 20 | |||
Click any row for full run metadata.
§06 · Resources