Pure white background. Centered black sans serif text reads 'Dev Day 2025'. No other elements or graphics. High contrast, ample whitespace.

OpenAI Dev Day 2025: Apps, AgentKit, GPT-5 Pro, and the Platform Play

OpenAI Dev Day 2025 made its priorities obvious: enable developers to build, test, and ship inside OpenAI’s stack. The announcements are not a rebrand of what AI can do; they are a practical set of tools and model options that let engineering teams move from prototype to production with fewer external pieces. If you care about agent orchestration, realtime voice, coding agents, or very large-context reasoning, there are concrete new options to evaluate.

Apps inside ChatGPT and the Apps SDK

Apps inside ChatGPT change distribution. Instead of sending users to a separate product, you can bring an app to where users already spend time. That matters for product teams that want immediate adoption and minimal onboarding friction.

Key practical points:

  • SDK-first model for building interactive experiences that run inside ChatGPT.
  • Direct access to an existing user surface for discovery and distribution.
  • Works best for workflows that are short and frequent: approvals, quick lookups, small automations, or assistant-style helpers.

This reduces the plumbing between an idea and a working in-chat experience. Expect most early Apps to be integrations and workflow wrappers rather than full standalone products. That’s fine. For many teams, reaching users inside ChatGPT will be worth the narrower feature set.

AgentKit: Builder, ChatKit, Guardrails, and Evals

AgentKit is the practical scaffolding teams have been asking for. Instead of building custom state machines and ad hoc safety checks, you can use tools designed for agent workflows.

  • Agent Builder provides a visual way to wire steps, APIs, and decision points. Useful for teams that want predictable orchestration without hand-rolled state logic.
  • ChatKit reduces boilerplate for turn management, context windows, and UX primitives for chat experiences.
  • Guardrails are safety controls and policy hooks that standardize filtering, rejection conditions, and auditing.
  • Evals close the loop on regression testing and quality measurement for agent behavior.

Two operational recommendations: first, treat Guardrails as part of your CI pipeline. Run Evals on any change that touches prompts, connectors, or tool logic. Second, design your agents with observability in mind: logs, traceable tool calls, and replayable transcripts are the easiest ways to debug agent regressions.

Codex GA — practical tooling for coding agents

Codex reaching GA comes with an SDK, Slack integration, and an admin dashboard. That admin experience matters for engineering organizations: org-level usage, access controls, and policy mapping make it usable for regulated teams.

If you plan to build coding assistants that operate in team chat or small internal UIs, Codex GA becomes the natural option to evaluate first. Combine it with Evals to prevent silent performance regressions on your key dev tasks. For implementation detail and prompting guidance, see my Complete Guide to GPT-5-Codex API and Prompting.

New models — what each one is actually for

GPT-5 Pro

  • Purpose: hard reasoning and long, multi-turn dialog that needs a very large context window.
  • API: Responses API with default reasoning.effort set high. Use background mode for long runs.
  • Modalities: text in/out, image input allowed. No audio or video, and no Code Interpreter.
  • Limits: 400,000 token context window and 272,000 max output tokens.
  • Pricing per 1M text tokens: Input $15, Output $120.
  • Performance: roughly 40% faster on the Priority tier.

When to pick GPT-5 Pro: large-document assistants, enterprise research agents that must keep long histories, and workflows that need deep multi-hop reasoning. It is expensive. Use it where the value of fewer API calls and huge context outweighs cost.

gpt-realtime-mini

  • Purpose: cost-efficient realtime text and audio for WebRTC, WebSocket, and SIP integration.
  • Modalities: text and audio in/out, image input only. No video. No function-calling or structured outputs on this preview model.
  • Context/output caps: 32,000 context window and 4,096 max output tokens.
  • Pricing: text output is inexpensive compared to GPT-5 Pro. Audio pricing is higher but competitive for realtime use.
  • Rate limits: scale across five tiers up to high throughput for live apps.

gpt-realtime-mini is the practical choice for low-cost voice agents and call center assistants. But plan architecture around its feature limits. If you need reliable JSON responses or function calling, either build a post-processing layer or pick a different model.

gpt-image-1-mini

A lower-cost image model for image understanding and lightweight generation. Useful for thumbnails, metadata extraction, visual search hints, or draft creatives where cost matters more than top-tier fidelity.

Sora 2 and Sora 2 Pro

Video models for API and ChatGPT with per-second pricing. They enable richer creative workflows, but they also increase responsibility for provenance and rights. If your product uses generated video for public-facing content, you must bake in rights checks, consent collection, and watermarking or provenance metadata.

Read my Sora posts for more practical notes on render quality and account-level caveats.

Service health, snapshots, and production stability

Snapshotting model aliases is a small feature with big operational impact. Locking to gpt-realtime-mini-2025-10-06 means reproducible behavior for tests and regression checks. For regulated environments, pin versions in your CI and only upgrade after running Evals across your critical paths.

Use the service health dashboard to detect incidents early and automate fallback strategies. For example, degrade to a smaller model with a shorter context for non-critical responses, while throttling expensive Priority-tier calls.

Cost control and architecture patterns

GPT-5 Pro is expensive for a reason. Here are tactics to control costs while getting value:

  • Chunk large inputs and cache embeddings or intermediate summaries where possible.
  • Use smaller models for routine interactions and reserve GPT-5 Pro for heavy-duty analysis runs.
  • Apply background mode for long-running jobs and charge internal teams or users for these runs when possible.
  • Use snapshot aliases in testing to prevent surprise budget regressions from model updates.

For realtime voice agents: accept that you may need an orchestrator between the client and the model. gpt-realtime-mini is cheap for text and voice streams, but you will want server-side logic to handle function calls, structured data collection, and business-rule enforcement because the model cannot do function calling on that tier.

Design patterns for agents

Practical patterns I recommend:

  • Tool-first agents: Make external business logic the source of truth. Keep the model stateless for decisioning where you can.
  • Guarded tool calls: Use Guardrails to validate calls, sanitize inputs, and block risky operations before invoking external systems.
  • Replayable transcripts: Store transcripts and tool invocations to reproduce failures and run Evals against known-edge cases.
  • Test harnesses: Build Evals into your CI so a model update fails your deploy if accuracy on core tasks degrades.

What to build now

  • Voice-first support assistant: gpt-realtime-mini for live calls with server-side business logic and Guardrails to enforce policies.
  • Large-document research assistant: GPT-5 Pro with background runs, intelligent chunking, and cached summaries.
  • Developer Slack bot: Codex GA integrated into Slack with Evals for regression checks on code generation tasks. See my SWE-bench comparison for how models stack on engineering tasks.
  • Low-cost image pipeline: gpt-image-1-mini for thumbnails and metadata plus Sora 2 for final video cuts if you require generated video outputs.

Community reaction and practical trade-offs

Developers are positive about Apps SDK and AgentKit because they reduce custom work. Realtime Mini gets praise as a cheap voice stack but preview limits are real. Sora 2 draws scrutiny for IP and deepfake risk, and teams shipping video need provenance and rights workflows. GPT-5 Pro draws interest but also concern about cost and when to actually use it.

Final notes and recommendations

Dev Day 2025 is less about a single breakthrough and more about putting a usable stack in front of teams. If you are shipping an agent product, the combination of Apps SDK for distribution, AgentKit for orchestration, Guardrails for safety, and Evals for test coverage is a useful baseline to adopt.

Start small: prototype with gpt-realtime-mini for voice flows or Codex GA for coding assistants. Run Evals early and pin snapshots for predictable behavior. Only move to GPT-5 Pro when large context or deep reasoning delivers clear ROI. And if you use Sora 2 in production, plan for legal and provenance requirements from day one. Links to deeper reading and practical guides are scattered above and in my other posts on Sora and Codex.

Cost comparison

Cost comparison. Use smaller models for bulk interactions and reserve GPT-5 Pro for expensive reasoning tasks.

My take: OpenAI has assembled a pragmatic set of pieces for shipping agent-driven products. The challenge for teams will be choosing the right model for each job, building guardrails early, and using Evals and snapshots to keep behavior stable as the platform and models evolve.

Further reading and related posts: Complete Guide to GPT-5-Codex API and Prompting, Sora 2 is here, Sora 2 Pro Review, and SWE-bench Verified Models Compared.