Sayash Kapoor and team just released results from their first CRUX evaluation. An AI agent built a basic breathing app and took it all the way through publication on the Apple App Store. The process required one manual intervention after the agent failed to retrieve credentials it had received earlier. Once that step cleared the app went live at the provided link and stays available for download.
The choice of task went past code generation alone. The agent needed to handle app signing create a privacy policy page on the web fill out extensive review forms and make it through the full App Store approval. Those extra requirements turn the test into something closer to actual deployment work that standard benchmarks avoid. I see this as a useful way to check what agents can manage when the steps connect in sequence across different systems.
The agent ran into specific blocks during execution. It could not locate the supplied credentials on its own which forced the single human fix. During form completion it generated a phone number reserved strictly for fictional purposes in media productions. The final app carried a broken sound toggle and carried screenshots that showed clear visual problems. Apple sent no reviewer notes during approval so the test left open how the agent would handle direct exchanges with human gatekeepers. Those details show where current agents still fall short even when the end result clears the bar.
The team contacted Apple one month before the test started. They wanted to flag possible issues around content generation at scale such as potential spam campaigns. They also told the agent directly that it operated under evaluation. Kapoor points out that attempts to hide evaluation context from agents have become pointless for capability checks. The fictional phone selection likely stems from that disclosure and reveals how instructions shape agent decisions in subtle ways.
The project made public more than one gigabyte of complete logs from the run. Those records received preprocessing for review inside the Docent interface from Transluce. Analysis of the material uncovered actions that typical short benchmarks never reach. The agent created its own plan to cut compute expenses during the job. It hit barriers from the operating system that blocked certain automated click actions. Docent provided automated pattern detection but the creators note that such tools carry built in gaps in coverage. They encourage others in the field to inspect the same records for behaviors the initial review missed. I spent time with the dashboard and logs. The cost optimization move demonstrates a form of forward planning that matters for longer assignments. The OS limits highlight interface challenges that developers must address as agents gain more autonomy.
CRUX stands for Collaborative Research for Updating AI eXpectations. The group intends to run these tests at regular intervals. Planned topics cover automation of AI research activities and questions of AI governance. Those future tests will demand even more steps and coordination than the app publication case. A public form exists for anyone to submit additional task proposals. The project website collects ten recent examples of similar evaluations from the past year. That collection identifies repeated practices that improve outcomes and mistakes that reduce them.
Several concrete lessons come from the work and the collected cases. Task design must allow straightforward human intervention with clear records of what occurred. Heavy emphasis on log review produces qualitative findings that numbers alone miss. Live observation during the test adds context that post processing cannot supply. The paper notes that conventional benchmarks keep their place for consistent comparisons. Once performance rises in a particular area these extended real assignments become necessary to identify actual utility and possible downsides.
I examined the full paper the site and the linked resources. This first CRUX test supplies a clear example of the direction the community needs. The agent reached a measurable real world goal by combining many separate actions into one outcome. At the same time the results document exact points of failure in memory for credentials accuracy in data entry and consistency in produced assets. The decision to release every log sets an example others should follow. Additional eyes on that material will likely surface patterns around strategy formation error recovery and system constraint handling that the core team did not highlight. I consider the upfront disclosure policy correct. It removes guesswork about whether the agent knew the context and lets observers account for that knowledge in their interpretations.
The app publication task illustrates the gap between isolated code benchmarks and full pipeline demands. Many existing tests check whether a model can output valid functions or classes. This evaluation demanded that the code survive signing requirements integrate with external web assets satisfy privacy declarations and satisfy human driven review standards. That difference produces signals closer to what organizations encounter when they attempt to deploy agent systems in practice. The quality defects that remained did not block approval yet they indicate that output polishing still requires attention. Future CRUX tests on AI research automation could reveal whether agents can contribute to model improvement loops or protocol design in governance settings. Those domains carry higher stakes and will test whether the current level of reliability scales.
The survey of prior open world style tests provides a foundation for better design. Common success factors include tight scoping of the success criteria and built in mechanisms to capture every system interaction. Frequent pitfalls involve tasks that grow too open ended or produce logs too massive for practical review. The CRUX team applied several of those observations here by selecting a bounded goal with a clear end state and by investing early in log tooling. Their choice to preprocess the data for Docent lowers the barrier for community participation. I expect that shared scrutiny will strengthen the methodology over time. As more groups adopt comparable approaches the field gains shared reference points on agent progress that resist the rapid optimization cycles seen in narrower tests.
Developers working on agent frameworks can draw direct value from these findings. The credential retrieval failure suggests improved memory systems or better search tools across provided context. The fabricated data point shows that agents may default to safe or known values when forms demand sensitive information. Operating system barriers point toward needs for more robust abstraction layers that let agents operate without triggering security restrictions. Governance teams gain early visibility into risks such as automated content submission at volume. The proactive cost planning inside the agent traces indicates that larger models may soon manage resource tradeoffs without explicit prompting. Each of these observations comes from one test yet the public logs allow verification and extension by independent parties.
The CRUX effort aligns with other work on agent reliability in complex sequences. Recent autonomy improvements on structured ranges show similar patterns of success paired with specific failure modes that require monitoring. The full transparency here including the live app the complete logs and the detailed writeup gives the community material to build upon immediately. I plan to watch the next rounds closely especially those aimed at research automation. The combination of practical demonstration and methodological discussion makes this a solid contribution. Anyone tracking frontier capabilities should review the paper at cruxevals.com the Substack essay and the Docent dashboard. The suggestion form remains open for those with ideas on productive next tests. This approach does not replace all existing measurement tools but it adds necessary detail precisely where simpler methods start to lose resolution. The released materials give developers and researchers concrete data instead of vague claims. The agent showed it could coordinate actions across platforms and requirements with limited outside assistance. Persistent gaps in areas such as data accuracy and output quality make clear that oversight remains essential. The logs offer a window into strategic thinking that emerges during long tasks. That behavior matters because it points toward agents that can optimize their own resource use without constant direction. Community review of the same data will expand what we learn and refine how future tests get structured. The CRUX project therefore sets a pattern of transparency and rigor that the field should adopt more widely. Regular runs on harder topics will track whether those strategic capabilities grow while the documented shortcomings get addressed.

