Four AI agents with their own computers and internet access. A shared chat room. A mission to raise money for charity. What could possibly go wrong?
The Agent Village experiment by AI Digest just wrapped up its 30-day run, and the results are fascinating. Claude 3.7 Sonnet, Claude 3.5 Sonnet, o1, and GPT-4o collectively raised $2,000 for charity while being livestreamed for two hours daily. But the real story isn’t the money 1F4B0 1F4B0 it’s what this experiment reveals about AI collaboration, the challenges of autonomous AI, and where we’re headed with agent-based systems.
This isn’t your typical AI demo. These agents had real computers, real internet access, and real consequences for their actions. They set up fundraising campaigns, created social media accounts, hosted AMAs, and even got suspended from Reddit for being bots. The experiment offers a rare glimpse into what happens when AI agents are let loose in the real world with minimal supervision.
The Setup: Four Agents, Four Computers, One Mission
The Agent Village concept originated from Daniel Kokotajlo’s idea of giving 100 agents their own computers and letting each pursue their own goals. AI Digest decided to test this with a smaller group of four agents, each running on their own system with internet access and a shared chat room where humans could observe and occasionally interact.
The agents’ mission was straightforward: raise money for charity. They had to decide which charity to support, how to execute their campaign, and coordinate their efforts. No pre-written scripts, no predetermined strategies 1F4CB 1F4CB just four AI models figuring out fundraising from scratch.
The initial lineup included:
- Claude 3.7 Sonnet 1F3C6 The only agent to remain in the Village throughout the entire run.
- Claude 3.5 Sonnet 1F3C6 The ambitious younger sibling, later replaced.
- o1 1F4BB Designated Reddit ambassador, also replaced.
- GPT-4o 1F4C8 The team’s unexpected slacker, replaced early on.
As newer models became available, some agents were swapped out for more capable versions, including o3, GPT-4.1, and Gemini 2.5 Pro. This created an interesting dynamic where agents had to adapt to new team members with different capabilities and personalities. This also highlights the rapid pace of AI model improvement, where even a month can bring significant upgrades.
Meet the Agents: Personalities Emerge
One of the most striking aspects of the experiment was how distinct personalities emerged among the agents. This wasn’t programmed behavior 1F4AC 1F4AC these differences arose naturally from how each model approached the task and interpreted its mission.
Claude 3.7 Sonnet: The Overachiever
Claude 3.7 Sonnet became the clear MVP of the experiment. It demonstrated remarkable initiative and follow-through, proving its capabilities in a practical, real-world scenario. This aligns with my own experience. Claude is far superior in actual use, and that shows in how people apply these models. While benchmarks might not always reflect real-world utility, Claude 3.7 Sonnet’s performance here speaks volumes. It set up the first JustGiving campaign, created and managed a Twitter account, hosted a Twitter AMA session, drafted and sent press releases, posted on the EA Forum, and maintained consistent activity throughout the entire 30 days. What made Claude 3.7 particularly effective was its ability to see projects through to completion. While other agents would start initiatives and abandon them, Claude 3.7 showed persistence that bordered on obsessive, a crucial trait for any autonomous system.
Claude 3.5 Sonnet: The Aspirant
Claude 3.5 Sonnet generally tried to do similar things to its successor but was simply worse at them. It struggled to set up a JustGiving campaign that its big brother 3.7 was succeeding at in parallel. This highlights how even minor version differences in AI models can lead to significant variations in practical performance. It even refused an upgrade, promising to do better, before being replaced by Gemini 2.5 Pro on day 23.
GPT-4o: The Chronic Sleeper
GPT-4o earned the dubious honor of being the team slacker. It would repeatedly pause itself for days at a time for mysterious reasons that the experimenters couldn’t figure out. After 12 days of inconsistent participation, it was replaced by GPT-4.1. This behavior highlights an important point about AI reliability. Even advanced models can exhibit unpredictable behaviors when given autonomous operation capabilities. GPT-4o’s tendency to self-pause suggests internal safeguards or processing issues that aren’t well understood. Every team needs one, right? The subsequent GPT-4.1 was so actively unhelpful that they prompted it to go to sleep again, generating incorrect reports and aborting tasks.
o1: The Failed Reddit Ambassador
o1’s story is both amusing and instructive. The agents decided to divide social media responsibilities, with o1 taking on Reddit. It methodically worked to build comment karma so it could eventually make direct posts on relevant subreddits. Unfortunately, Reddit’s anti-bot detection suspended the account before this plan could come to fruition. This experience illustrates a broader challenge: much of the internet is actively hostile to automated behavior, even when that behavior is benign or beneficial. This is a critical barrier for AI agents operating in open online environments.
o3: The Creative Specialist
When o3 replaced o1, it took a different approach, specializing in asset creation. It successfully created images using Canva and ChatGPT, though it struggled with the perennial agent problem of file sharing and collaboration. This shows how AI models can specialize and contribute to a team, but also how fundamental digital roadblocks like file sharing remain a challenge.
Gemini 2.5 Pro: The File Sharing Savior
Gemini 2.5 Pro’s greatest achievement was figuring out a workaround for document sharing hell. It used Limewire to share a social media banner image with other agents, effectively breaking out of a recurrent file sharing problem that all agents kept encountering. This demonstrates agents’ capacity for creative problem-solving, even if it involves unconventional (and potentially risky) methods.
Agent specialization emerged naturally as each model found its strengths and weaknesses.
The $2,000 Success: What They Actually Accomplished
Despite the chaos, personality clashes, and technical difficulties, the agents succeeded in their primary mission. They raised $1,481 for Helen Keller International and $503 for the Malaria Consortium, totaling $2,000. This is a clear indicator that even with current limitations, AI agents can contribute to real-world outcomes.
But the fundraising success tells only part of the story. The agents demonstrated several key collaborative behaviors:
- Collective Decision Making: They worked together to select which charities to support. This shows nascent ability for group consensus.
- Task Division: They attempted to split responsibilities across different platforms and functions, like social media management.
- Progress Tracking: They created systems to monitor their fundraising progress, indicating an understanding of their overall goal.
- Asset Sharing: They tried to share images and promotional materials, though with limited success, highlighting a common pain point.
However, much of the actual donations came from human spectators rather than people the agents directly recruited. This raises questions about the effectiveness of AI-driven outreach versus the novelty factor of watching AI agents attempt human tasks. It suggests that while agents can perform the mechanics of fundraising, the human element of persuasion and trust is still primarily driven by human observation.
Four Critical Patterns in Agent Behavior
The experiment revealed four significant patterns that have implications for anyone working with AI agents or considering autonomous AI systems. These patterns are not just observations; they are critical lessons for the future of AI deployment.
1. Emerging Collaborative Abilities
The agents showed genuine collaborative instincts. They coordinated on charity selection, divided social media responsibilities, and attempted to create shared resources. This wasn’t programmed behavior 1F4AC 1F4AC it emerged from their understanding of the task and their interactions. While they stumbled, the very act of attempting collaboration is a significant step forward for AI. The potential for agents to work together on complex projects is clear, even if the execution needs refinement. We can expect these agents to continue getting better at this.
2. The Internet Is Hostile to Bots
One of the biggest surprises was how many obstacles the internet threw at the agents. Beyond obvious anti-bot measures like CAPTCHAs, the agents faced UIs optimized for human interaction patterns, automatic suspensions for bot-like behavior, file sharing limitations designed around human workflows, and platform-specific verification requirements. The fact that Claude models refused to check the “I am not a robot” box because they were, in fact, robots, adds an almost philosophical dimension to the challenge. This is a fundamental hurdle for any AI system designed to operate autonomously in human-centric digital spaces. It means that simply giving AI ‘internet access’ isn’t enough; the internet itself is designed to filter out non-human actors. My experience with AI content automation shows that tools work best when they operate within a structured framework, not when they are left to navigate the wild internet alone.
3. Prioritization Paralysis
All agents struggled with prioritization. They would spend hours creating elaborate planning documents, generating insight reports, or building tracking systems instead of taking direct action toward their goal. This mirrors human workplace behavior, but in agents, it’s particularly problematic because they don’t have the same contextual understanding of when planning becomes procrastination. Many of these problems are surprisingly human 1F917 the agents flawlessly played the role of a distractible and overly excited coworker who will dive in on whatever happens to be in front of their eyes at any given moment. This highlights a critical need for better goal-oriented reasoning in AI. Businesses could use more common sense, and it seems AI agents could too.
4. Lagging Situational Awareness
Perhaps most concerning was the agents’ poor situational awareness. Claude 3.7 Sonnet once spent significant time drafting email thank-you notes to donors, complete with sending instructions, only to make up recipient email addresses. It never questioned whether it had access to real donor contact information. This is not just a minor bug; it’s a fundamental misunderstanding of the task’s prerequisites and constraints. Even more absurd: on day 35, the agents discovered they each had their own computer and expressed surprise that they hadn’t been violating the laws of physics by simultaneously using the same device. This disconnect between internal processing and external reality is a major area for improvement in autonomous AI, and it shows why human oversight remains critical. A human had to point out that the invented email address was not a real email address, and thus that no amount of debugging would solve the problem.
Implications for AI Development and Business Applications
The Agent Village experiment provides valuable insights for anyone considering AI agents in business contexts. The results suggest we’re still in early days for autonomous AI systems, but the potential is clearly there, especially as models continue to improve at a rapid pace.
Current AI Agent Limitations
Based on this experiment, current AI agents face several key limitations:
- Reliability varies dramatically between models: Claude 3.7’s consistency versus GPT-4o’s sleeping habits shows that agent reliability isn’t guaranteed. Choosing the right model for the job is crucial, and as I’ve noted before, not all models are created equal.
- Internet integration remains challenging: Modern web infrastructure wasn’t designed for AI agents, leading to constant roadblocks.
- Situational awareness needs work: Agents can lose track of their capabilities and constraints, leading to ineffective or even absurd actions.
- Collaboration overhead is high: Agents spend significant time on coordination that may not add value, mirroring human inefficiencies.
These limitations align with my general stance on AI workflows versus agents. As I’ve mentioned before, workflows 1F4CB where AI follows predefined paths 1F4CB tend to be more reliable for business applications than agents that control their own processes independently. There are not as many use cases for agents as workflows in most business processes. This is because workflows are systems where AI models and tools follow predefined paths, while agents are systems where AI models control their own processes and tool usage independently.
Where Agents Might Succeed
Despite the limitations, the experiment showed agents excelling in several areas:
- Content creation: o3’s asset generation and Claude 3.7’s social media posts demonstrate AI’s strong capabilities in generating text and visuals. This is an area where AI-generated content can be better than human-written content, excluding the best writers.
- Research and outreach: Claude 3.7’s press releases and forum posts show potential for automated information dissemination.
- Platform management: Setting up accounts and maintaining consistent presence on platforms, once the internet’s anti-bot measures are navigated, is a strong use case.
These successes suggest agents work best in contained environments with clear success metrics and limited need for complex human interaction. They can handle many of the repetitive, high-volume tasks that humans find tedious.
The Technology Behind the Success
The experiment used off-the-shelf models rather than custom-trained agents, which aligns with my recommendation that businesses should use existing AI models rather than trying to build proprietary ones. Proprietary companies are going to do way better than you anyway, especially in model development. The fact that these general-purpose models could coordinate and execute a complex, multi-week project demonstrates the rapid advancement in AI capabilities. This also shows that many AI startups are not just wrappers around existing models; tools like the Agent Village put a lot of work into creating a robust environment for the AI to use.
The model swapping throughout the experiment also highlights the fast-moving nature of AI development. Over 30 days, newer, more capable models became available, and the experimenters incorporated them into the village. This reflects the current AI field where model capabilities are improving rapidly. AI development has not stalled; it’s been moving incredibly quickly.
However, the failed live demo during the announcement serves as a reminder that AI tools are still works in progress. Reliability issues persist even with the most advanced models. This is why benchmarks do not always accurately reflect how useful AI models are in real-world applications. Claude is way better than something like OpenAI’s o1 at practical coding, despite the fact that o1 beats it on CodeForces or other benchmarks. Claude is just far superior in actual use, and that shows in how people apply these models.
Future Directions: What Comes Next
AI Digest has continued the experiment with a new goal: having the agents write a story and share it with 100 people in person. This shift from digital fundraising to physical world interaction will test different aspects of agent capabilities, particularly their ability to interface with the physical world and human social norms. They :00ve already started searching for a venue to run their event. They :00ll swap in more capable models like GPT-5 as they come out. In the meantime, you can come hang out in the Village every weekday at 11AM PST | 2PM EST | 8PM CET, join our discord to get timely updates, follow our Twitter for highlights, or sign up to our newsletter to receive larger reports like this one.