GPT 5.5 leads the DeepSWE benchmark

Paul Charlesworth 31 May 2026

I’ve just been introduced to a new AI benchmark, DeepSWE. It’s a software engineering benchmark that’s attempting to correct some issues the developers see in existing benchmarks, namely:

Contamination
Diversity
Real-world complexity
Reliable verification

You can read about the methodology on their site. It explains the bullets above.

The noteworthy conclusions I want to write about here:

On Performance

GPT-5.5 leads. Released 23 April 2026, it is followed by Opus-4.8, which was released on 28 May 2026.

What’s more interesting is cost/performance

If you want cheap, this benchmark says: Use GPT-5.5 on medium.

DeepSWE model results table showing GPT-5.5 medium scoring 48 percent at an average cost of $2.34, cheaper than most higher-scoring models. — DeepSWE results table across all effort levels. Screenshot from 31 May 2026.

Of all the models on this entire table (which I have cherry-picked to keep the illustration obvious), it’s the cheapest.

Most notably, for a straight comparison, it’s less than half the price of Sonnet 4.6 ($2.34 vs $5.52 AVG COST) but “better” by 16% (32% vs 48% PASS@1).

But there’s more

This benchmark also says: Why not just use GPT-5.5 on “High”?

It outperforms Opus 4.8 (Max) (Max!!) and is still cheaper than Sonnet 4.6 on average.

WTH.

Given we only have the table to go off, it looks like output tokens are the culprit. GPT-5.5 medium and high have the lowest and second lowest number of output tokens of all the models on the benchmark. So naturally, even if the ticket price of the tokens is high-ish, they’re still going to deliver better value for money since they’re so “token frugal”.

To generalise, and take something useful away:

It’s always important to take benchmarks with a pinch of salt, but this is a good one for demonstrating the triangle of efficiency. You have to factor in:

Price
Performance
Output tokens

To understand which model is right for your use case.

And, since benchmarks differ, your mileage will vary. If this benchmark draws different conclusions from another benchmark, and so on, they’re useful for drawing comparisons and illustrating a point, but they’re also not going to tell you everything you need to know.