GPT-5.4 Lands: A Reasoning Powerhouse That Writes Code, Uses Your Computer, and Thinks Ahead

GPT-5.4 illustration

OpenAI’s March 2026 release, GPT-5.4, reads like a careful step toward AI that can carry an entire project from first idea to final delivery. It isn’t just a faster chatbot or a slightly smarter code generator — it’s a consolidated system that bundles advanced reasoning, strong coding skills, and native computer-use capabilities into a single model. The result is a platform designed for sustained, multi-step professional workflows where the model can plan, act, and recover with less hand-holding.

Why this release matters

A recurring frustration with earlier models was fragmentation: different variants excelled at different tasks, and stitching them together for a real-world workflow required a lot of glue work from users. GPT-5.4 is explicitly engineered to reduce that friction. It combines the coding pedigree of GPT-5.3-Codex with a leap in general reasoning and the new ability for agents to control a computer directly — pointing at screens, issuing mouse and keyboard commands, and interpreting screenshots. For professionals building end-to-end pipelines — spreadsheets, presentations, legal drafting, data extraction, or automated agent tasks — GPT-5.4 is pitched as a single coherent partner rather than a toolkit of separate specialists.

New interaction patterns: planning and steerability

One notable interaction enhancement in ChatGPT’s GPT-5.4 Thinking mode is an upfront reasoning plan that the model exposes while it works. That plan lets users interrupt, redirect, or refine the model mid-response without restarting the whole process. Practically, that means fewer wasted drafts and faster convergence to a useful answer — a major usability improvement for complex tasks where course corrections are common.

Measured progress: benchmarks and numbers

OpenAI provided a suite of benchmark results showing meaningful gains over earlier versions. These are not small incremental changes — they show the model pulling ahead on several fronts where real-world utility matters: multi-step agent success, browser-based tasks, coding benchmarks, and profession-level evaluations.

Preserved benchmark table (reported values)

Benchmark GPT-5.4 GPT-5.3-Codex GPT-5.2
GDPval (wins or ties) 83.0% 70.9% 70.9%
SWE-Bench Pro (Public) 57.7% 56.8% 55.6%
OSWorld-Verified 75.0% 74.0% 47.3%
Toolathlon 54.6% 51.9% 46.3%
BrowseComp 82.7% 77.3% 65.8%

What these numbers mean in practice

  • GDPval: GPT-5.4 matches or surpasses human professionals in 83% of comparisons across 44 occupations spanning top U.S. GDP industries — a jump from 70.9% with GPT-5.2. That suggests the model’s usefulness in industry-focused, occupation-specific tasks has substantially increased.
  • OSWorld-Verified: At 75.0%, GPT-5.4 surpasses a human benchmark reported at 72.4% and dramatically outperforms GPT-5.2 (47.3%), indicating much-improved success when interacting with software environments and task-oriented interfaces.
  • Browser and screenshot tasks: The model posts a 67.3% browser success rate on WebArena-Verified and an impressive 92.8% score on Online-Mind2Web when using screenshot observations alone — data points that underscore how effectively it interprets visual UI context.

Real-world signals and efficiency

OpenAI emphasized factuality and efficiency improvements alongside raw performance. According to reported figures, individual factual claims are 33% less likely to be false and full responses are 18% less likely to contain errors compared to GPT-5.2. Token efficiency has also improved — the model solves reasoning problems with substantially fewer tokens, which translates directly into lower API costs and faster responses for developers.

Anecdotal production metrics echo the benchmarks. Mainstay’s CEO Dod Fraser reported that across roughly 30,000 property portal sessions, GPT-5.4 achieved a 95% first-attempt success rate, completed workflows three times faster, and consumed about 70% fewer tokens versus earlier computer-use models. Those are the sorts of operational advantages that can tilt business decisions about adopting new models.

Specialized strengths: legal and coding performance

On legal-document evaluation, GPT-5.4 scored 91% on the BigLaw Bench (reported via Harvey’s Head of Applied Research, Niko Grupen), demonstrating strong domain performance for complex text work. For developers, the model unifies the Codex-class coding capabilities into the general-purpose model family, reducing the need to switch contexts between coding-specialist variants and reasoning-specialist variants.

Context length and agent horizons

A key technical capability for long-running tasks is context window size. GPT-5.4’s API supports up to 1 million tokens of context, enabling extended, long-horizon agent workflows — think entire documents, datasets, and multi-step execution history kept in memory. That puts it on par with large context offerings from other frontier providers and makes it viable for longer, stateful tasks without constant context truncation.

Availability and tiers

GPT-5.4 is currently rolling out across ChatGPT (as GPT-5.4 Thinking), the OpenAI API, and Codex. For reviewers and power users, a GPT-5.4 Pro variant is offered to provide maximum compute on more complex tasks and priority processing for production environments. ChatGPT Plus, Team, and Pro subscribers are being transitioned from GPT-5.2 Thinking to GPT-5.4 Thinking over a multi-month rollout.

Where this could take us next

GPT-5.4 is another step toward agents that can plan and act with minimal human supervision while remaining steerable and grounded in facts. The native computer-use capabilities — screenshots, mouse, and keyboard interactions — lower the barrier for automating desktop or web-based tasks that were previously stubbornly manual. Combined with improved factuality and token efficiency, organisations may find it easier to deploy these models into production workflows while managing costs.

But powerful capabilities invite attention to safety, oversight, and human-in-the-loop design. As models gain the ability to act autonomously on computers and interact with user interfaces, governance, and robust testing become more important than ever. The production anecdotes and benchmark wins are convincing, but careful integration and monitoring will be essential to avoid automation pitfalls.

Conclusion

GPT-5.4 represents a consolidation of several strands of progress — reasoning, coding, and situated, agentic action — into a more unified model. The benchmark gains and real-world reports suggest it’s not just a new milestone on paper: developers and enterprises can expect faster, more reliable outputs, lower token costs, and models better equipped for long, context-heavy tasks. Whether you’re automating data-entry tasks across web portals or drafting complex legal documents, GPT-5.4 is positioned to be the kind of model that reduces friction and speeds outcomes — provided it’s adopted with thoughtful controls and human oversight.

Leave a Reply

Your email address will not be published. Required fields are marked *