GPT 5.5 leads the DeepSWE benchmark
I’ve just been introduced to a new AI benchmark, DeepSWE. It’s a software engineering benchmark that’s attempting to correct some issues the developers see in existing benchmarks, namely:
- Contamination
- Diversity
- Real-world complexity
- Reliable verification
You can read about the methodology on their site. It explains the bullets above.
The noteworthy conclusions I want to write about here:
On Performance
GPT-5.5 leads. Released 23 April 2026, it is followed by Opus-4.8, which was released on 28 May 2026.
What’s more interesting is cost/performance
If you want cheap, this benchmark says: Use GPT-5.5 on medium.
DeepSWE results table across all effort levels. Screenshot from 31 May 2026.
Of all the models on this entire table (which I have cherry-picked to keep the illustration obvious), it’s the cheapest.
Most notably, for a straight comparison, it’s less than half the price of Sonnet 4.6 ($2.34 vs $5.52 AVG COST) but “better” by 16% (32% vs 48% PASS@1).
But there’s more
This benchmark also says: Why not just use GPT-5.5 on “High”?
It outperforms Opus 4.8 (Max) (Max!!) and is still cheaper than Sonnet 4.6 on average.
WTH.
Given we only have the table to go off, it looks like output tokens are the culprit. GPT-5.5 medium and high have the lowest and second lowest number of output tokens of all the models on the benchmark. So naturally, even if the ticket price of the tokens is high-ish, they’re still going to deliver better value for money since they’re so “token frugal”.
To generalise, and take something useful away:
It’s always important to take benchmarks with a pinch of salt, but this is a good one for demonstrating the triangle of efficiency. You have to factor in:
- Price
- Performance
- Output tokens
To understand which model is right for your use case.
And, since benchmarks differ, your mileage will vary. If this benchmark draws different conclusions from another benchmark, and so on, they’re useful for drawing comparisons and illustrating a point, but they’re also not going to tell you everything you need to know.