TESSERACT

MULTIMODAL SWE-BENCH

Can AI models bridge the gap between code logic and pixel-level visual data?

Tesseract evaluates AI coding agents on their ability to resolve real-world GitHub issues that include visual multimodal content such as screenshots of UI bugs, architectural diagrams, plot rendering errors, and widget behavior recordings. Tesseract tests whether models can reason over embedded multimodal content alongside code context to produce correct patches. Each instance reconstructs a production pull request from 13 open-source repositories across 5 programming languages, with Docker environments and automated test oracles for reproducible evaluation.

Keep scrolling. The answer is in §04.

Curated from thousands of real production pull requests, filtered through a 12-stage pipeline.

Repositories

1100

open-source projects

Languages

8

across the stack

Embedded images

24000

screenshots · diagrams · plots

Instances

10000

production PRs reconstructed

The method

Six stages turn raw GitHub into a scored metrics.

Three to build the dataset, three to evaluate any model against it run the back half as many times as you like.

A Build the dataset runs once

Phase 01

Scraping

  • Crawl 10000+ closed PRs across 1100 repositories
  • 12-stage filter: merge check, linked issue, patch split, image verify, leakage detection
  • Yields 610 multimodal candidates

Phase 02

Docker Images

  • Build 3-tier Docker images per instance
  • base -> env -> eval
  • Language-specific toolchains

Phase 03

Dataset

  • Curate 10000 instances from 610 candidates
  • Across 1100 repositories, 8 languages
  • 24000 embedded images · JSONL + HuggingFace

B Run the pipeline runs per model

Phase 04

Inference

  • Agent receives issue text + embedded screenshots in Docker sandbox
  • Tool-use interaction (2 models)
  • Produces a git diff patch

Phase 05

Evaluation

  • Apply model's git patch to clean repo checkout
  • Run hermetic container, full project test suite
  • Capture test output

Phase 06

Scoring

  • Grade FAIL_TO_PASS (must now pass)
  • Grade PASS_TO_PASS (must not regress)
  • Produce resolution verdict per instance

So how did the models actually do? §04 →

Kimi K2.5 resolves 4 of 20. Nova 2 Lite resolves 2 of 20.

One model is ~9× more expensive and 2× as effective.

Fig. 1 — Resolution rate by model
Fig. 2 — Success rate vs. image count

Overall resolution rate (Fig. 1) plus per-model degradation as the number of embedded images in an issue grows (Fig. 2). See §05 for per-instance receipts.

Every run, receipt-by-receipt. PASS means the model's patch fixed the issue and didn't regress existing tests.

Kimi K2.5 · Pass@1

20%

4 of 20 resolved

Nova 2 Lite · Pass@1

10%

2 of 20 resolved

Both models passed

2instances

p5.js-6222 · p5.js-6251

Runtime range

9.8s→ 12m

Pillow-6592 → libvips-4510

Pass/Fail matrix of 20 instances evaluated against two models: Kimi K2.5 and Nova 2 Lite. Filled cell means the model's patch resolved the issue.
Instance Lang Diff Kimi K2.5 Nova 2 Lite
Totals 4 / 20 2 / 20

Click any row for full run metadata.