Tesseract — Multimodal SWE-Bench

§01 · The question

Tesseract evaluates AI coding agents on their ability to resolve real-world GitHub issues that include visual multimodal content such as screenshots of UI bugs, architectural diagrams, plot rendering errors, and widget behavior recordings. Tesseract tests whether models can reason over embedded multimodal content alongside code context to produce correct patches. Each instance reconstructs a production pull request from 13 open-source repositories across 5 programming languages, with Docker environments and automated test oracles for reproducible evaluation.

Keep scrolling. The answer is in §04.

§02 · Valid Instances

Curated from thousands of real production pull requests, filtered through a 12-stage pipeline.

Repositories

1100

open-source projects

Languages

8

across the stack

Embedded images

24000

screenshots · diagrams · plots

Instances

10000

production PRs reconstructed

§03 · Pipeline

The method

Six stages turn raw GitHub into a scored metrics.

Three to build the dataset, three to evaluate any model against it run the back half as many times as you like.

A Build the dataset runs once

Phase 01

Scraping

Crawl 10000+ closed PRs across 1100 repositories
12-stage filter: merge check, linked issue, patch split, image verify, leakage detection
Yields 610 multimodal candidates

Phase 02

Docker Images

Build 3-tier Docker images per instance
base -> env -> eval
Language-specific toolchains

Phase 03

Dataset

Curate 10000 instances from 610 candidates
Across 1100 repositories, 8 languages
24000 embedded images · JSONL + HuggingFace

B Run the pipeline runs per model

Phase 04

Inference

Agent receives issue text + embedded screenshots in Docker sandbox
Tool-use interaction (2 models)
Produces a git diff patch

Phase 05

Evaluation

Apply model's git patch to clean repo checkout
Run hermetic container, full project test suite
Capture test output

Phase 06

Scoring

Grade FAIL_TO_PASS (must now pass)
Grade PASS_TO_PASS (must not regress)
Produce resolution verdict per instance

So how did the models actually do? §04 →

§04 · Results

Kimi K2.5 resolves 4 of 20. Nova 2 Lite resolves 2 of 20.

One model is ~9× more expensive and 2× as effective.

Bar chart: resolution rate per model — Kimi K2.5 at 4/20 (20%), Nova 2 Lite at 2/20 (10%) — Fig. 1 — Resolution rate by model

Line chart: success rate versus number of images per issue for both models — Fig. 2 — Success rate vs. image count

Overall resolution rate (Fig. 1) plus per-model degradation as the number of embedded images in an issue grows (Fig. 2). See §05 for per-instance receipts.

§05 · Instances

Every run, receipt-by-receipt. PASS means the model's patch fixed the issue and didn't regress existing tests.

Kimi K2.5 · Pass@1

20%

4 of 20 resolved

Nova 2 Lite · Pass@1

10%

2 of 20 resolved

Both models passed

2instances

p5.js-6222 · p5.js-6251

Runtime range

9.8s→ 12m

Pillow-6592 → libvips-4510

Pass/Fail matrix of 20 instances evaluated against two models: Kimi K2.5 and Nova 2 Lite. Filled cell means the model's patch resolved the issue.
Instance	Lang	Diff	Kimi K2.5	Nova 2 Lite
Totals			4 / 20	2 / 20

Click any row for full run metadata.

§06 · Resources

Trajectories

Trajectories on GitHub

Dataset

Dataset on HuggingFace