OpenAI launched its GPT-4.1 family of models on April 14, 2025, featuring three new API-only models: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano. As someone who consistently evaluates AI models, I wanted to test this new set against industry leaders, notably Claude 3.7 Sonnet, which has performed strongly in recent months. Let’s explore how these models compare in specific tasks.
GPT-4.1 Model Overview
Let’s clarify some basic information regarding the GPT-4.1 series before we get into the test results:
- All models have a 1 million token input context window.
- The knowledge cutoff goes up to June 2024.
- Text and image input is allowed.
- The coding and logic foundation has been strengthened.
- Availability is restricted to API only — not usable within ChatGPT.
- The GPT-4.5 Preview API will be deprecated shortly, beginning July 14, 2025.
The main GPT-4.1 model shows benchmark scores of 54.6% on the SWE-bench, exceeding GPT-4o on Aider diffs by a factor of two and achieving 38.3% on the Scale MultiChallenge. Numbers are nice to see but I typically don’t trust benchmarks. Let’s see what the test results say.
Practical Testing Methodology: GPT-4.1 Against Claude 3.7 Sonnet
I gave the models a series of assignments ranging in difficulty from light creative tasks to difficult coding problems. A summary of the results is presented here:
| Task | Difficulty | Claude 3.7 Sonnet | GPT-4.1 | Winner |
|---|---|---|---|---|
| Self-Generating PRG | Very Hard | Pass (Excellent) | Pass (Good) | Claude |
| Styled FAQ Widget | Medium | Pass (Very Good) | Pass (Decent) | Claude |
| Snake w/ AI Opponent | Hard | OK | Fail | Claude |
| SaaS Landing Page | Medium | Pass (Outstanding) | Pass (Great) | Claude |
| City SVG | Hard | Pass (Amazing) | Pass (Good) | Claude |
| 3D Mario Voxel Art | Medium | OK | Fail | Claude |
| Cool Animation | Easy | Pass | Pass | Tie |
| Brand Voice Writing | Medium | Pass (Good) | Fail | Claude |
| Humor & Creativity | Easy | Pass (Good) | Pass (Good) | Tie |
Main Points:
1. Self-Generating PRG Exam
This was one of the tests with the highest degree of difficulty. The models had to create an HTML/JavaScript-based text adventure game. Claude 3.7 Sonnet delivered an exceptional finished game, whereas GPT-4.1 provided a functional although imperfect version. Claude’s version excelled in sophisticated game mechanics, intriguing plotlines, and reduced flaws.
2. SaaS Landing Page Creation
While both models performed admirably, Claude 3.7 Sonnet’s output was much more polished and experienced-seeming. GPT-4.1 gave a solid landing page, but some of the design complexities and elements which focused on conversions that were present in Claude’s work were lacking.
3. Coding Tasks
The “Snake with AI Opponent” task showed some weaknesses in GPT-4.1’s coding proficiency. While Claude 3.7 Sonnet developed a functional A* pathfinding algorithm, GPT-4.1 could not make a playable game. Multiple attempts to fix the problems did not improve their status.
Claude was able to correct an initial issue in the 3D Mario voxel art challenge with a simple fix request; nevertheless, GPT-4.1 kept having trouble even after several correction tries.
4. Visual and Creative Assignments
Both models performed effectively on the City SVG assignment, Claude’s output stood out due to its high level of detail and aesthetic appeal. Both models performed admirably in less complex visual jobs, such as creating animations.
5. Content and Brand Integration
The brand voice writing test was quite telling. Claude 3.7 Sonnet produced a great piece that correctly expressed the brand voice and context. However, GPT-4.1’s content was too succinct, did not reflect the brand, and lacked contextual integration.
How Does GPT-4.1 Compare to GPT-4.5?
GPT-4.1 beat GPT-4.5 on multiple tasks, which I thought was significant. It shows that numbers don’t always equal a particular level of accomplishment. Upon its debut GPT-4.5 was regarded as a disappointment and OpenAI seems to have dramatically improved upon the 4.1 version.
We’ve observed similar patterns across other product releases where results are obtained from focused improvements rather than global incremental upgrades. See OpenAI’s model lineup if you want to know more.
GPT-4.1 vs. GPT-4.1 Mini vs. GPT-4.1 Nano
My testing primarily pitted Claude against the regular GPT-4.1 model, but it’s important to remember the overall GPT-4.1 lineup presents other distinct models with their own strengths and pricing:
GPT-4.1 mini is a deal, especially with its balance of performance and low cost ($0.40/$1.60 per million tokens), saving around 83% compared to GPT-4o and boosting the speed by roughly 2x. Although suited for everyday activities, it still offers results on measures such as 80.1% MMLU and 50.3% GPQA.
Lastly, GPT-4.1 nano is reasonably priced at only $0.10/$0.40 per million tokens, which focuses on duties needing low latency such as classification and autocompletion. Comparing the models on instruction following, graphwalks, and academic benchmarks indicates that it pairs evenly with GPT-4.1 for some benchmarks but lower for others. This could be caused by lower training and testing data. It might also come from parameter selection during training.
GitHub Copilot now includes GPT-4.1 integration for the users of AI-assisted coding, which should enhance the suggestions and completions.
Developer Implications
These are the most crucial takeaways that originate from my tests:
1. Claude 3.7 Sonnet still leads in areas.
In these instances, Claude 3.7 Sonnet is still the superior decision:
- Difficult coding tasks, especially complete app creation.
- SVG and other forms of graphic design.
- Brand awareness and recall through in-depth content projects.
2. GPT-4.1 Has Advantages
GPT-4.1 performs best in things like:
- Creating content with humor.
- Generating animations.
- Producing Saas landing pages, despite the work being unrefined.
3. Consider Prices and Intentions Behind Them
In this instance, the pricing has something to say:
- GPT-4.1: $2.00 input, $8.00 output for each million tokens
- GPT-4.1 mini: $0.40 input, $1.60 output for each million tokens
- GPT-4.1 nano: $0.10 input, $0.40 output for each million tokens
- Gemini 2.0 Flash Comparison: $0.10 input, $0.40 output for each million tokens
With prompt caching down by 75%, money spent repeating prompts drop off considerably.
The Effect of Long Context
With approximately a million tokens per data input, the new models can handle significantly more information than previous models. From what I can see, the GPT-4.1 models handled large contexts OK, with Claude 3.7 Sonnet demonstrating much better context awareness in even the most difficult of tasks. For prompts with considerable context, I personally noticed improvements using clear start and end instructions paired with XML tags.
With these larger-context abilities, the following is also achievable with increased efficiency:
- Analysis and summarizing is much improved with higher character count.
- Reading, understanding, and interacting with code repositories.
- Multi-step thought processes.
In Summary, the Current Status of Artificial Intelligence in 2025
While impressive, GPT-4.1’s release doesn’t result in the massive leap ahead that some would have hoped to see. Claude 3.7 Sonnet beat it on nearly every task that I tested during my experiments. For those in business, that simply means that you have more to consider than you thought. Also, GPT-4.1 has an impressive context window along with improvements over OpenAI’s previous models. However, there are plenty of tasks where Claude provides better results still.
Perhaps the better aspects of the GPT-4.1 family are the mini and nano models that provide businesses with better opportunities based on price and potential.
With higher specialization in each field, the market is also not completely decided as varying models beat one another depending on the specific task at hand. As a consequence, competition drives continued invention, giving consumers even more choices.
I recommend that, for now, you use Claude 3.7 Sonnet for anything with coding or creativity to focus on and look toward GPT-4.1 mini and nano for tasks that will save both money and time whilst not taking a dive in potential.

