OpenAI just dropped ChatGPT Agent, and its a big deal. This isnt just another incremental update; its a move towards genuinely autonomous AI, letting ChatGPT handle complex tasks from start to finish on its own virtual computer. We’re talking about things like briefing you on client meetings, planning and buying groceries, or even analyzing competitors and cranking out editable slide decks and spreadsheets. All of that happens without you needing to jump between different apps or manually feed information. This is OpenAI trying to pull everything into one powerful system.
At its core, ChatGPT Agent combines a few previous breakthroughs: OpenAIs Operator, which handles web interaction; Deep Research, for synthesizing information; and ChatGPTs natural language abilities. The idea is to create a single brain that can fluidly shift between thinking and doing. And the best part? You’re still in charge. ChatGPT asks for permission before doing anything impactful, and you can always pause it, take over the browser, or stop a task. This balance of autonomy and control is something Ive always emphasized should be standard in AI tools.
Starting today, Pro, Plus, and Team users are getting access by activating ‘agent mode’ in the tools dropdown. This is just the beginning; they’re planning regular, significant improvements. So, while it’s already powerful, expect it to get much better over time.
From Operator to Agent: A Unified Powerhouse
Before ChatGPT Agent, OpenAI had Operator and Deep Research, each with its own strengths. Operator was good at clicking, scrolling, and typing on the web, kind of like your digital assistant for basic browsing. Deep Research, on the other hand, was a whiz at analyzing and summarizing large amounts of information, almost like having a research analyst on tap. The problem was they were separate and often couldn’t handle tasks that needed both. Operator couldnt do deep analysis, and Deep Research couldnt interact with websites that needed logins or more refined searches.
OpenAI saw that users often tried to use Operator for tasks better suited for Deep Research, which highlighted the need for a unified solution. By bringing these functionalities together in ChatGPT Agent, theyve created a tool that can actively engage with websites licking, filtering, and gathering precise results ut also performing in-depth analysis and creating detailed reports. You can now just chat with ChatGPT and smoothly transition to asking it to perform complex actions.
Its a natural step forward for truly useful AI. Why have two tools when one can do both, and do them better? This integration means less friction in your workflow and more powerful outcomes. It also means the model can choose the best way to get things done, whether its using a text-based browser for quick information gathering or a visual browser for interacting with human-designed interfaces.
ChatGPT Agent brings together web interaction (Operator) and deep analysis (Deep Research) into a single, cohesive system.
An Agent That Works For You, With You
ChatGPT Agent has a full suite of tools: a visual browser for interacting with websites, a text-based browser for simpler queries, a terminal, and direct API access. It also taps into ChatGPT connectors, letting it link to apps like Gmail and GitHub to find information or use it in responses. You can even take over the browser to securely log in to any website, allowing the agent to dig deeper and act broader in both research and task execution.
The system intelligently picks the best tool for the job. For example, it can use an API to check your calendar, process a lot of text fast with the text browser, and still visually interact with sites made for humans. All this happens on its own virtual computer, keeping task context consistent even when multiple tools are in play. It can open a page, download a file, manipulate it in the terminal, and then view the results back in the visual browser. This adaptability makes it fast, accurate, and efficient.
Figma recently tried something similar with ‘vibe coding,’ which sounds cool but often struggles with reliability, as I discussed in my post on Figma’s AI Gambit. The difference here is the focus on iterative, collaborative workflows. You can interrupt ChatGPT at any point to clarify instructions, steer it, or change the task completely. It picks up right where it left off. ChatGPT can also ask you for more details if needed. If a task is taking too long, you can pause it, ask for a summary, or stop it entirely and get partial results. And if you have the ChatGPT app, it sends a notification when it’s done.
Broadening Real-World Utility and Performance Benchmarks
These new capabilities push ChatGPT beyond being just a chatbot; its now genuinely useful for everyday and professional tasks. At work, you can automate repetitive tasks, like converting screenshots into editable presentations, rearranging meetings, planning company offsites, or updating spreadsheets with new financial data while keeping the formatting consistent. For personal use, it can effortlessly plan and book travel, design and book entire dinner parties, or find specialists and schedule appointments. This moves AI from abstract demonstrations to practical applications.
The model’s performance on real-world evaluations is impressive. On Humanitys Last Exam (HLE), which tests AI across broad subjects with expert-level questions, the model powering ChatGPT Agent scored a new SOTA (state-of-the-art) at 41.6% pass@1. When running eight attempts and picking the best one (a simple parallel rollout strategy), the score jumped to 44.4%. This is what happens when you train a model to be an agent and let it iterate.
For math, FrontierMath is known as the hardest benchmark, with novel problems that challenge expert mathematicians for hours or days. With tool use, like terminal access for code execution, ChatGPT Agent hit 27.4% accuracy, vastly outperforming previous models. This reminds me of when I tested Grok 4, and despite its strengths, it wasn’t quite hitting these levels of complex problem-solving without explicit tool integrations like this.
On an internal benchmark for “complex, economically valuable knowledge-work tasks,” ChatGPT Agent equaled or surpassed human performance in about half the cases across various task completion times, easily beating o3 and o4-mini. These tasks included competitive analyses, amortization schedules, and identifying viable water wells for green hydrogen facilities asically, real professional work. Devstral Small 2507 showed similar jumps in performance for coding benchmarks, highlighting that agentic capabilities are where the real gains are happening.
On DSBench, for realistic data science tasks, ChatGPT Agent significantly outpaced human performance. Similarly, on SpreadsheetBench, it outperformed existing models by a wide margin, especially when given direct editing access to spreadsheets (45.5% vs. Copilot in Excel’s 20.0%). My personal experience with Grok 4 showed that even powerful models have explicit limits when it comes to certain tasks, but this pushes those limits further.
For investment banking analyst modeling tasks (like building a three-statement financial model for a Fortune 500 company), the model powering ChatGPT Agent significantly outperformed Deep Research and o3. Each task was graded on hundreds of correctness and formula-use criteria. This suggests the agent mode is not just for simple tasks but for complex, domain-specific challenges.
Finally, on BrowseComp, a benchmark measuring browsing agents’ ability to find hard-to-locate web information, the model set a new SOTA with 68.9%, a 17.4 percentage point jump over Deep Research. And on WebArena, for real-world web tasks, it improved over o3-powered CUA (the model behind Operator). Its clear that unifying these capabilities actually makes them all better.
| Benchmark | ChatGPT Agent Score | Previous SOTA/Human | Significance |
|---|---|---|---|
| Humanitys Last Exam (HLE) | 41.6% (44.4% with parallel rollout) | New SOTA | Indicates improved expert-level reasoning and planning. |
| FrontierMath | 27.4% | Vastly outperforms previous models | Strong math problem-solving with tool use. |
| Complex Knowledge-Work Tasks (Internal) | Comparable to/better than humans in ~50% cases | Significantly outperforms o3, o4-mini | Directly tackles real-world professional tasks. |
| DSBench (Data Science) | Significantly surpasses human performance | Human performance | Excels in realistic data analysis and modeling. |
| SpreadsheetBench | 35.27% (45.54% with .xlsx editing) | Copilot in Excel (20.0%) | Superior spreadsheet manipulation and editing. |
| Investment Banking Analyst Modeling (Internal) | Significantly outperforms Deep Research and o3 | Deep Research, o3 | Handles complex financial modeling tasks effectively. |
| BrowseComp | 68.9% | Deep Research (51.5%) | Improved web information retrieval. |
| WebArena | Improved over o3-powered CUA | o3-powered CUA | Better performance on real-world web tasks. |
ChatGPT Agent’s benchmark results show significant advancements across various real-world tasks.
The Speculation: Is This GPT-5?
The chatter is loud: is the model powering ChatGPT Agent actually GPT-5? OpenAI hasn’t explicitly stated which model is behind these new capabilities. This silence, combined with the impressive benchmark results that significantly outperform previous models like o3 and o4-mini on complex tasks *without* explicit tool use, certainly fuels the speculation. When a new model sets State-of-the-Art performance on benchmarks like Humanitys Last Exam (HLE) and FrontierMath, its not just a minor update. The fact that it’s better than o3 at HLE speaks volumes.
My take? OpenAI is known for being tight-lipped about their latest breakthroughs before a full, splashy announcement. The performance gains, particularly in areas requiring advanced reasoning and multi-step execution, suggest a substantial underlying model improvement. Whether they call it GPT-5 or something else, it’s clear this isn’t just a re-packaging of existing tech. The capabilities are too pronounced. The ability to dynamically plan and choose its own tools, and then scale that with simple parallel rollout strategies to boost HLE scores, points to a sophisticated architecture. It’s a significant step forward, regardless of the numerical designation.
This kind of performance jump is exactly what we’ve been waiting for in the AI space. It moves beyond theoretical capabilities to practical, real-world utility that can genuinely impact professional workflows. It also sets a new bar for competitors like Grok and other models, pushing the boundaries of what an AI agent can achieve. This isn’t just about getting smarter; it’s about getting more capable and autonomous in meaningful ways. I’ve often discussed the importance of real-world benchmarks, and these results deliver.
The New Risks and Controls
With ChatGPT Agent able to take actions directly on the web and interact with your data via connectors or logged-in websites, new risks pop up. OpenAI has beefed up controls from the Operator research preview, adding new safeguards for sensitive information handling, broader user reach, and restricted terminal network access. While these measures do a lot to reduce risk, the expanded tools and wider user base mean the overall risk profile is higher. This isn’t just a chatbot anymore; it’s an agent that can act on your behalf.
A big focus is on safeguarding against adversarial manipulation through prompt injection. This is a general risk for agentic systems. A malicious instruction hidden on a webpage could trick the agent into unintended actions, like sharing private data from a connector or performing a harmful action on a logged-in site. Because ChatGPT Agent can take direct actions, successful attacks could have a bigger impact. They’ve trained and tested the agent to resist prompt injections and set up monitoring to detect and respond to attacks fast. Requiring explicit user confirmation before consequential actions, and allowing users to intervene, also helps. You need to weigh these tradeoffs and consider disabling connectors when not needed.
They’ve also implemented mitigations for model mistakes, especially since it can now perform tasks that impact the real world:
- Explicit user confirmation: ChatGPT asks for your permission before any actions with real-world consequences, like making a purchase.
- Active supervision (Watch Mode): Critical tasks, like sending emails, need your active oversight.
- Proactive risk mitigation: ChatGPT is trained to refuse high-risk tasks, such as bank transfers.
Additional controls limit the data the model can access:
- Privacy controls: One click in settings deletes all browsing data and logs you out of active website sessions.
- Secure browser takeover mode: When you use ChatGPTs browser (takeover mode), your inputs are private. ChatGPT doesn’t collect or store data like passwords, because it doesnt need them, and its safer if it never sees them.
These safety measures are critical. As AI becomes more autonomous and integrated into our workflows, the potential for misuse or unintended consequences grows. OpenAI’s transparent approach to these risks, and their proactive implementation of safeguards, sets a good precedent. It acknowledges the power of the tool and the responsibility that comes with it. This is a far cry from some of the earlier ‘move fast and break things’ mentalities we saw in tech. The stakes are much higher now.
Biological Risk and Next Steps
Given the agent’s increased capabilities, OpenAI is classifying ChatGPT Agent under their Preparedness Framework for High Biological and Chemical capabilities. This is a cautious step; they don’t have definitive proof it could help a novice cause severe biological harm, but they’re implementing safeguards now. This model has their most robust safety stack for biology: extensive threat modeling, dual-use refusal training, constant classifiers and reasoning monitors, and clear enforcement. This proactive approach to safety is exactly what’s needed for powerful AI.
They’re working with outside biosecurity experts, safety institutes, and academic researchers to refine their threat model, assessments, and policies. Biology-trained reviewers validated their evaluation data, and domain-expert red teamers stress-tested safeguards. They even held a Biodefense workshop to speed up collaboration and AI-powered biodefense research. You can read more in the system card. They’re also launching a bug bounty program to find and fix real-world risks.
The fact that OpenAI is taking such a serious stance on biological and chemical risks, even without definitive proof of harm, shows a heightened awareness of the broader implications of advanced AI. This isn’t just about preventing a chatbot from giving bad advice; it’s about ensuring a powerful agentic system doesn’t inadvertently enable dangerous activities. This level of foresight and collaboration with external experts is commendable and necessary as AI capabilities grow.
Availability and Whats Next
ChatGPT Agent started rolling out today to Pro, Plus, and Team users, with Pro gaining access by the end of the day and others over the next few days. Enterprise and Education users will get access in coming weeks. Pro users get 400 messages monthly, while other paid users get 40, with more usage via flexible credit. They’re still working on access for the European Economic Area and Switzerland.
The Operator research preview site will be sunsetted in a few weeks. Deep Research features are now part of ChatGPT Agent. If you prefer the original Deep Research feature, which is slower but more detailed, you can still select it from the dropdown. This is a similar transition to how Perplexity’s Deep Research feature works, offering specialized modes for different user needs.
ChatGPT Agent is still in early stages. It can still make mistakes. The slideshow functionality is in beta; outputs can be basic in formatting, especially without an existing document. Their initial focus was on organizing information for presentations with editable elements, prioritizing structure and flexibility. There are also occasional discrepancies between the viewer and exported PowerPoint slides, which they’re working to fix. You can upload existing spreadsheets, but not slideshows, yet. Theyre training the next iteration for more polished and capable slideshows. Overall, expect ongoing improvements in efficiency, depth, and versatility, including smoother interactions as they balance user oversight and safety. This sounds a lot like the iterative development cycle seen with models like Grok, where early versions are functional but steadily improve, as I noted in my GPT-5/Grok-4 rubric analysis.
The Future of AI Agents: Beyond ChatGPT
OpenAI’s move with ChatGPT Agent isn’t happening in a vacuum. The entire AI industry is shifting towards agentic systems. We’re seeing models that don’t just generate text or images but can plan, execute, and adapt. This means the future of AI isn’t just about raw intelligence, but about the ability to act autonomously and interact with the real world through tools and interfaces. This is where the true value lies for businesses and individuals.
Consider the broader implications. As these agents become more capable, they will fundamentally change how work is done, not just in knowledge work but across various industries. Imagine AI agents handling complex logistics, managing supply chains, or even assisting in scientific discovery with minimal human oversight. The benchmarks indicate that these systems are already outperforming humans in specific, complex tasks. This isn’t about replacing people wholesale, but about augmenting human capabilities to an unprecedented degree. It means that the skills needed to thrive will increasingly involve directing and collaborating with AI agents, rather than performing every task manually.
The competition in this space will also intensify. While OpenAI has made a significant move, other players are not far behind. Google, Meta, and even smaller startups are investing heavily in agentic AI. The race is on to create the most robust, reliable, and safe AI agents. This competition will drive innovation at an incredible pace, leading to even more sophisticated tools in the near future. It also means that businesses and individuals who adopt these technologies early will gain a significant competitive advantage. This isn’t a trend to watch from the sidelines; it’s a technology to actively integrate into your strategy.
The challenges, of course, remain. Ensuring the safety, reliability, and ethical deployment of these powerful agents will be paramount. The prompt injection risks, model mistakes, and biological safeguards highlighted by OpenAI are not just technical hurdles; they are societal responsibilities. The industry must continue to prioritize safety research and responsible deployment as capabilities grow. But despite these challenges, the trajectory is clear: autonomous AI agents are here, and they are poised to redefine what’s possible.