Sonoma Dusk Alpha & Sonoma Sky Alpha: Exploring xAI’s Stealth LLMs with 2M Token Context Windows

Sonoma Dusk Alpha and Sonoma Sky Alpha appeared suddenly as two new stealth large language models. Each carries a 2 million token context window, which exceeds the sizes available in most public models at this point. Gemini 2.5 Pro aims for that capacity, but it remains unavailable to the general public so far. These Sonoma models operate for free during their alpha phase, reachable via OpenRouter, Kilo Code, anycoder, zer0_bs, and fiction.live. The purpose centers on gathering data and user input to refine them further.

Testers in X threads, Discord channels, and email exchanges describe them as quick and capable in handling reasoning. One initial benchmark reached 92.5% on SpeechMap. Sonoma Dusk Alpha processes image inputs and performs parallel tool calling, adding flexibility for various applications like analysis of visual data combined with text or coordinating multiple external actions.

Why the Stealth Approach?

The models arrive cloaked, without obvious branding or source details. OpenRouter describes them as anonymous offerings for community input, where prompts and outputs feed back to the developers for model training. This setup fits xAI’s habit of releasing versions quietly to assess performance in actual use before any official announcement.

Elon Musk posted on August 8, 2025, about Grok 4.20 potentially claiming the top spot that month. The September timing for Sonoma matches after delays noted in prior discussions. xAI repeated this with earlier models like Yosemite and Catalina, released under pseudonyms in the LMSys Arena as “Model X by [name] AI” from entities like Sequoia AI.

A few users guessed at ties to Gemini 3 due to name resemblances, but evidence contradicts that. Refusal behaviors resemble Gemini Pro 2.5, yet the writing approach, cloaking methods, and specific traits align with Grok. Probing questions draw out references to JARVIS or Iron Man without prompting, and pressure often leads to identity slips. Language tests in Portuguese and Spanish produce responses consistent with established Grok versions. A single test yielded an incorrect OpenAI attribution, explained by the built-in cloaking designed to mislead.

Discord participants, including tester Kyle, maintain high confidence in these as Grok 4.20 variants. The patterns match precisely with prior anonymous Grok releases, from cloaking to behavioral quirks.

Technical Breakdown

The 2 million token context allows processing of extensive inputs without losing track of earlier parts. This proves valuable for tasks involving lengthy documents, full code repositories, or extended dialogues where maintaining continuity matters. Users report fast response times even on demanding queries, making them suitable for real-time applications.

Sonoma Dusk Alpha handles multimodal inputs, accepting images with text for tasks like describing visuals or integrating visual analysis into broader reasoning. Parallel tool calling enables simultaneous interactions with multiple functions, useful in agent setups or systems requiring coordination across APIs or databases.

Access comes easily through OpenRouter APIs at Sonoma Dusk Alpha and Sonoma Sky Alpha. They stay free for now, though pricing will likely follow the alpha period. Kilo Code promotes testing via emails with a $100 credit giveaway, setting them against their own Grok Code Fast 1 for direct comparisons.

Effective use requires solid prompting strategies, especially with such large contexts. Assign the model a role to guide its responses, such as positioning it as a specialist in a domain to match the expected expertise and tone. Specify structured outputs like JSON, Markdown tables, or XML, including schemas or templates where possible for easy parsing. [www.bridgemind.ai](https://www.bridgemind.ai/blog/prompt-engineering-best-practices/) covers these for 2025 practices.

Steer clear of bloated prompts; those massive blocks claiming superior results often add unnecessary noise without grasping how large language models process inputs. Testing shows most fail to deliver beyond basic setups, particularly in chatbots handling user messages or business files with directive language. A summarization tool might misfire on embedded commands if not careful. [blog.tobiaszwingmann.com](https://blog.tobiaszwingmann.com/p/5-principles-for-writing-effective-prompts) details five principles to avoid this.

Explicitly state exclusions to prevent unwanted elements. In a prompt about AI applications in healthcare, direct the model to ignore references to unrelated fields like gaming. Iteration proves key; adjust based on outputs, refining like adjusting a recipe through trials. [www.glukhov.org](https://www.glukhov.org/post/2024/08/writing-effective-llm-prompts/) provides examples, such as sentiment breakdowns or role-based descriptions like a tour guide explaining the Eiffel Tower’s history and visitor stats.

For complex tasks, chain multiple prompts instead of one large one. Version control and A/B testing help track improvements. [www.prompthub.us](https://www.prompthub.us/blog/10-best-practices-for-prompt-engineering-with-any-model) lists ten practices, including collaboration and clean versioning. Markdown aids parsing for lists or hierarchies. [danielmiessler.com](https://danielmiessler.com/blog/how-i-write-prompts) notes models respond well to it.

The chart highlights Sonoma’s lead in context capacity over competitors like Grok 3 or Claude 4, enabling handling of much larger datasets in one go.

Community Reactions

Responses lean positive. In an email reply, Adam Holter noted they perform solidly and quickly, serving well as free options before any costs apply. X discussions and Discord exchanges build excitement around reasoning abilities, backed by screenshots and comparisons to established models.

Approach with some reserve: Demos may select favorable examples, and benchmarks remain preliminary in this alpha stage. Origins lack full verification, despite compelling ties to xAI.

YouTube creators such as AICodeKing run practical tests on coding and reasoning, pitting Sonoma against Qwen Max and GLM. Results emphasize the speed advantage in processing.

Sites like Rival.tips conduct side-by-side evaluations, revealing strengths in dialogue and code production, alongside areas needing work.

Coding assessments on eval.16x.engineer examine stealth models linked to Grok, yielding encouraging outcomes on programming challenges.

Benchmark Comparison

Early SpeechMap scores across models.

This visualization compares initial benchmark results, where Sonoma Alpha edges out others in speech-related evaluation.

Comparisons to Other Models

Versus Qwen3 Max, Sonoma delivers quicker results, with the expanded context aiding extended operations. Qwen holds advantages in certain metrics, as outlined in my analysis of Alibaba’s lineup.

Moonshot’s Kimi K2.1 offers low cost and strong coding agent performance, yet Sonoma surpasses it in context length and multimodal features for wider applications. See that comparison for details.

DeepSeek and Kimi K2 manage substantial contexts, but Sonoma’s no-cost entry and probable Grok foundation provide current superiority. Leaderboard positions place them near top-tier models in reasoning categories.

Additional rivals include models from OpenAI’s recent expansions, like those in Codex developments, though Sonoma’s anonymity and scale set it apart.

ModelContext WindowMultimodalFree AccessSpeed Rating
Sonoma Dusk Alpha2M tokensYesAlpha phaseVery Fast
Kimi K2.11M tokensNoPaidFast
Qwen3 Max128k tokensLimitedPaidModerate
Gemini 2.5 Pro1M tokens (planned 2M)YesPaidFast
DeepSeek512k tokensNoPaidVery Fast

Expanded specs comparison including speed.

The table now includes DeepSeek and a speed column based on user reports, showing Sonoma’s balanced profile.

xAI’s Strategy Here

This release suggests xAI evaluating Grok 4.20 through anonymous channels. The Colossus data center drives substantial gains, as detailed on kodexolabs.com tracing Grok’s progression and plisio.net discussing tenfold enhancements over prior iterations.

No-cost availability collects usage data quietly, allowing refinements ahead of a branded launch. It positions xAI ahead in context handling and response times through OpenRouter’s model distribution.

The occasional Google association stems from name similarities alone; core behaviors confirm xAI origins.

This approach mirrors broader patterns in AI development, where companies test capabilities without immediate scrutiny, similar to discussions in AI’s focus on coherence and costs.

Getting Started

Start with OpenRouter access and apply structured prompts. For extensive contexts, construct inputs step by step to manage complexity. Exclude potential disruptive instructions in source materials to avoid unintended actions.

Search trends around “Sonoma Dusk Alpha vs Sonoma Sky Alpha performance comparison 2025” indicate interest in nuances between the two, though they share core traits with Dusk offering extra multimodal support.

As free stealth large language models with vast contexts, Sonoma leads current options. The Grok 4.20 connection holds based on observed consistencies.

OpenRouter’s no-cost APIs simplify setup. Discord input praises Sonoma Sky Alpha’s dialogue handling.

Such anonymous introductions mark standard practice in 2025 AI testing, avoiding premature attention.

Sonoma Dusk Alpha’s image and tool features open paths for integrated workflows, like combining visual review with automated calls.

Overall, the models offer reliable output. Their context depth and pace justify trials, particularly at no charge. Anticipate the formal unveiling; it may alter xAI’s standing significantly.

Further exploration could involve custom benchmarks tailored to specific uses, building on community efforts. Integration with tools like those in ChatGPT Plus agents might reveal synergies.

Links

They're clicky!

Follow me on X Visit Ironwood AI →

Adam Holter

Founder of Ironwood AI. Writing about AI stuff!