Blog

  1. GPT 5.5 leads the DeepSWE benchmark

    DeepSWE is a new software engineering benchmark built around fresh, diverse, real-world tasks with handwritten verification.

  2. Composer 2.5 sits with the frontier coding agents

    Cursor Composer 2.5 ranks third on Artificial Analysis' coding agents benchmark—and costs a fraction of frontier models.

  3. The Falling Cost of Inference

    The latest models are getting more expensive, but the absolute cost of inference is falling rapidly.

  4. AI Proves You Don't Know What You Want

    Why weak AI prompts often reveal unclear requirements, not weak tools.