Grok 4 is out, and I had high hopes based on my rubric from Angel Bogado’s original. I figured itll hit around 50 points, but it landed at 34. The big miss? No native image generation at all. Thats a whole section wiped out, costing it 21 points straight up. Lets break this down category by category, with the actual scores, evidence, and why it fell short of what I thought itd deliver.
The Rubric Recap and My Expectations
Back in my previous post, I laid out this rubric for Grok 4 and GPT-5, adapting Angel Bogado’s setup. I expected Grok 4 to nail native image generation for about 21 points, strong voice mode for another 30, a big context window, and more. Total guess: 50. But xAI skipped the image part entirely, and other areas didnt fully deliver. Heres how it actually shook out. If youre curious about the full rubric setup, check my earlier take here.
| Category | Expected Score | Actual Score | Key Notes |
|---|---|---|---|
| Native Image Generation | 21 | 0 | Completely absent; only analyzes images, no generation. |
| Advanced Voice Mode | 30 | 17 | Got points for expressive voices and singing, but no background thinking or unlimited duration. |
| Context Window | 6-8 | 0 | 256k tokens, below 1M threshold. |
| Other Inputs | 20 | 0 | No audio or video uploads exposed. |
| Knowledge Cut-off | 5 | 5 | Trained to 2025, meets the tier. |
| Pricing | 3-5 | 0 | More expensive than GPT-4o, with $300/month tier. |
| Misc (Modern UI when coding) | 10 | 10 | Strong on structured code output. |
| Extra Credit | 4-10 | 2 | Multi-agent mode as a good feature. |
Grok 4’s rubric scores: Expected vs. Actual, highlighting the gaps.
Deep Dive: Native Image Generation The Big Zero
This category was supposed to be a slam dunk for Grok 4. The rubric gives points for near-perfect consistency (8), no yellow tint (5), recognizable faces from new angles (3), any aspect ratio (3), and sharp small text (2). Total 21. But Grok 4 gets zilch because there’s no text-to-image generation at all. xAI’s launch materials confirm it handles analysis only you can upload an image and ask about it, but creating one from scratch? Not happening.
Why does this matter? Image generation is table stakes now for top models. GPT-4o does it natively, Claude has partnerships, and even open-source options like Stable Diffusion are everywhere. Without it, Grok 4 feels incomplete for creative tasks. I expected xAI to push boundaries here, especially with Elon Musk’s hype, but they didn’t. That’s 21 points gone, dropping my 50 expectation hard.
Visualizing Grok 4’s image generation gap: A big red zero.
In practice, this means if you’re building content or prototypes needing visuals, Grok 4 forces you to switch tools. That’s a workflow killer. I tested it myself for content generation, and while it’s good at text, the lack of images means extra steps elsewhere. This directly impacts how useful Grok 4 is for real-world content creation workflows. If you’re a copywriter producing blog posts or marketing materials, you’ll still need a separate tool for any visual elements, adding friction.
Advanced Voice Mode: Partial Win at 17 Points
Voice was another area I thought would shine. The rubric has multi-speaker ID (3), smarter vibes (5), background thinking (5), no duration limit (3), answers when addressed (4), assistant-initiated convo (3), screen-share (2), and singing/SFX (5). I expected near-full marks, but Grok 4 got 17.
Strong points: New expressive voices like Eve, who sang an impromptu opera about Diet Coke in demos. That’s fun and shows real prosody and SFX capability. It also answers only when addressed in chat, avoiding babble. But no background thinking that inner voice layer isn’t there. Duration limits are implied, no assistant-started talks, and no screen-share. It’s decent, but not the full package I anticipated.
From my tests, the voice feels more natural than older models, but without unlimited sessions or proactive engagement, it’s limited for long interactions. Compare to what I expect from GPT-5 this is where Grok 4 could have pulled ahead but didn’t. The lack of screen-sharing also means it can’t assist with visual tasks on your screen, which is a major capability in other leading models. This limits its utility for real-time problem-solving or interactive tutorials. While the singing demo was a ‘wow’ moment, it doesn’t make up for the missing core functionalities that would make voice truly transformative for productivity.
Context Window and Other Inputs: Zeros That Sting
I pegged context at 1M tokens for 6-8 points, but Grok 4’s 256k gets zero. That’s solid for many tasks, but below the bar for massive projects. No plus-tier boost either. For comparison, some top models are pushing well beyond 1M tokens, enabling them to process entire books or extensive codebases. Grok 4’s 256k, while respectable, means it will struggle with truly long-form content or complex multi-document analysis. This limits its use cases for researchers, legal professionals, or developers working on large projects.
Audio and video uploads? Expected 20 points, got zero. Not exposed yet, so no dice for multimedia workflows. This makes Grok 4 less versatile than competitors like Claude or GPT-4o, which handle these inputs seamlessly. Imagine trying to debug a video issue or transcribe an audio meeting without direct input capabilities it’s a significant hurdle. This omission means Grok 4 can’t function as a true multimodal assistant in the way some users might expect, particularly those who rely on visual or auditory data for their work. The inability to process video is a particular drawback given the increasing prevalence of video content in both personal and professional contexts.
Knowledge Cut-off and Pricing: Mixed Bag
Knowledge to 2025 earns the full 5 points that’s current enough for most uses. This score is a definite positive, ensuring Grok 4 has up-to-date information, which is critical for accuracy in a rapidly changing world. It means users can rely on it for recent events, trends, and data, unlike models with older cut-offs.
But pricing is a bust. At $3/M input, $15/M output, plus $300 for SuperGrok Heavy, it’s pricier than GPT-4o. I expected at least parity for 3 points, but zero here. In my tests, it used way more tokens up to 5x Claude Sonnet for similar tasks. This high token usage, combined with the per-token cost, makes Grok 4 a more expensive option for heavy users or businesses operating on tight budgets. The $300/month SuperGrok Heavy tier is a significant commitment, especially if the base model already consumes tokens at a high rate. For many, this cost will be a barrier to adoption, especially when more cost-effective alternatives exist that offer comparable or superior performance in certain areas.
Speed at 75 tokens/s is okay, faster than Claude Opus but slower than GPT-4o. While not the fastest, its speed is acceptable for most conversational tasks, but power users might notice the difference compared to quicker models. This puts it in a middle ground for performance, which, when combined with its higher cost, makes its value proposition less clear for certain applications.
Misc and Extra Credit: Bright Spots
Full 10 for modern UI in coding it outputs structured, front-end-ready code nicely. Devs on X praise it. This is a real strength. Grok 4 consistently delivers clean, well-formatted code that’s ready for deployment, making it a strong contender for developers. The ease of use and quality of its code output are significant advantages, streamlining the development workflow. This focus on developer experience is a smart move, as code generation is a high-value application for AI.
Extra 2 for multi-agent mode, scoring 44.4% on Humanity’s Last Exam. That’s a good feature, but no wow or innovative ones for more points. The multi-agent ‘study group’ approach for benchmarks is interesting and shows promise for complex problem-solving. Achieving 44.4% on Humanity’s Last Exam is a strong result, demonstrating its capability in advanced reasoning tasks. However, for ‘wow’ features, I’m looking for something truly groundbreaking that shifts the paradigm, not just an improvement on existing capabilities.
My Hands-On Test: Content Generation Strengths and Flaws
I ran Grok 4 through my content generator setup. It held my brand voice well, cutting fluff that plagues other models. For my style, it was spot-on direct. This ability to adapt to a specific brand voice and produce concise content without unnecessary jargon is a major plus for content creators. It means less editing and a more consistent output, which saves time and effort. It’s a testament to its fine-tuning for specific stylistic requirements, something many generic models struggle with.
But it hallucinated once, claiming AI only does recombinant creativity, which I don’t buy. Not proven, and I never said that. Could be a fluke, but worth noting. This highlights the persistent issue of AI confidence in making unverified claims, even when producing otherwise high-quality content. It’s a reminder that human oversight is still critical, especially for factual accuracy and alignment with specific viewpoints. As I’ve said before, AI tools are only as good as the framework and expertise guiding them. This incident, while isolated, underscores the need for careful vetting of AI-generated content.
Token usage is high tasks cost 5x more than Claude Sonnet with thinking enabled. For heavy users, that’s a hit. The cost implication is substantial. If you’re running frequent or large-scale content generation tasks, these costs will add up quickly, making it potentially unsustainable for some businesses. This is a critical factor for adoption, as economic viability often trumps raw performance for many practical applications.
Still, for coding UI, it’s strong. Outputs are clean, and the multi-agent ‘study group’ helps on tough benchmarks. Despite the cost and occasional hallucination, its coding capabilities and performance on complex reasoning tasks demonstrate significant underlying power.
Why the Lower Score Matters for AI’s Future
This 34 vs. 50 shows Grok 4 is capable but incomplete. Missing image gen limits creativity, high costs hurt accessibility, and partial voice misses full interaction potential. Yet, it beats benchmarks in areas like multi-agent reasoning. For devs, the coding UI is a win. Compared to my GPT-5 expectations (around 70), Grok 4 needs more to compete. It’s a solid contender in certain niches, but it’s not a general-purpose AI powerhouse that can do everything. The market is increasingly demanding multimodal capabilities, and Grok 4’s absence of native image generation is a glaring omission that holds it back from truly competing at the top tier.
If xAI adds image gen soon, scores jump. But right now, it’s a reminder: hype doesn’t always match delivery. As I said in my post on AI evaluation challenges, rubrics like this cut through to real capabilities. It’s not enough to just release a model; it needs to deliver on a broad range of features that users expect from a top-tier AI.
Overall, Grok 4 is decent better than expected in voice expressiveness and coding, worse in multimedia and cost. I’ll keep testing, but for now, it’s not replacing my go-tos. If you’re into AI battles, watch for updates; this space moves quick. The competition is fierce, and models need to bring their A-game across the board to truly stand out. Grok 4 has potential, but it needs to close these gaps to truly claim a top spot in the AI race. The future of AI is about comprehensive capabilities, not just isolated strengths. This is a good model, but it’s not the game changer some might have hoped for, at least not yet. The journey for Grok 4 is just beginning, and I’m curious to see how xAI addresses these limitations in future iterations. For now, the score speaks for itself.