GPT-4.1 has officially landed in the OpenAI API, introducing a trio of models—GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano—that outperform their predecessors in nearly every dimension. These models are designed for developers who need better coding skills, stronger instruction following, and massive long-context comprehension, all while reducing latency and cost. The flagship model now supports up to 1 million tokens of context and features a fresh knowledge cutoff of June 2024.
What’s new with GPT-4.1?The GPT-4.1 family is a direct upgrade over GPT-4o and GPT-4.5, offering improved performance across benchmarks while optimizing for real-world developer use. GPT-4.1 scores 54.6% on SWE-bench Verified, making it one of the top models for coding tasks. On Scale’s MultiChallenge benchmark, it sees a 10.5% absolute improvement over GPT-4o in instruction following. For long context tasks, it sets a new state-of-the-art score of 72% on the Video-MME benchmark.
The models are also optimized across the latency curve. GPT-4.1 mini delivers nearly the same performance as GPT-4o while cutting latency in half and reducing cost by 83%. GPT-4.1 nano is OpenAI’s fastest and most affordable model yet, built for classification and autocomplete tasks while still supporting 1 million token context windows.
Coding capabilities take a leapFrom generating cleaner frontend interfaces to following diff formats more reliably, GPT-4.1 proves itself as a highly capable coding assistant. On the SWE-bench Verified benchmark, it completes over half of the tasks correctly—up from 33.2% with GPT-4o. It also outperforms GPT-4o and even GPT-4.5 on Aider’s polyglot diff benchmark, offering developers precise edits across multiple programming languages without rewriting entire files. For file-level rewrites, output token limits have been expanded to 32,768 tokens.
In internal comparisons, GPT-4.1 websites were preferred 80% of the time over GPT-4o’s outputs. Extraneous edits in code dropped from 9% to just 2%, reflecting better context understanding and tool usage.
Early adopters highlight real-world winsWindsurf reported a 60% improvement in internal benchmarks, while Qodo found GPT-4.1 provided better suggestions in 55% of GitHub pull requests. These improvements translate directly into better code review accuracy, fewer unnecessary suggestions, and faster iteration cycles for teams.
Sharper instruction following across scenariosGPT-4.1 performs significantly better in instruction reliability. It scores 87.4% on IFEval and 38% on the MultiChallenge benchmark, showcasing gains in handling complex formats, rejecting forbidden instructions, and sorting or ranking outputs. OpenAI’s own evaluation showed that GPT-4.1 is more precise on hard prompts and better at multi-turn instruction tracking, an essential feature for building reliable conversational systems.
Blue J and Hex both tested GPT-4.1 against domain-specific tasks. Blue J saw a 53% accuracy improvement in complex tax scenarios, while Hex reported nearly double the performance in SQL tasks, reducing debugging overhead and improving production-readiness.
1 million token context window sets a new barAll three models in the GPT-4.1 family now support up to 1 million tokens of context—over 8 times the React codebase. This enables powerful new use cases in legal document analysis, financial research, and long-form software workflows. In OpenAI’s “needle in a haystack” test, GPT-4.1 reliably retrieved relevant content regardless of where it appeared in the input.
The OpenAI-MRCR benchmark further confirmed this by testing the model’s ability to distinguish between near-identical prompts scattered across a massive context window. On the Graphwalks benchmark, which involves reasoning across nodes in a synthetic graph, GPT-4.1 scored 62%, significantly ahead of GPT-4o’s 42%.
Thomson Reuters reported a 17% boost in legal document review accuracy using GPT-4.1 in its CoCounsel system, while Carlyle saw a 50% improvement in extracting granular financial data from complex files.
GPT-4.5 out-humans humans in new test
Faster inference and better image understandingOpenAI has reduced time to first token using improvements in its inference stack. GPT-4.1 nano responds in under five seconds on 128K-token prompts. For multimodal tasks, GPT-4.1 mini shows stronger image comprehension than GPT-4o across benchmarks like MMMU and MathVista.
On visual benchmarks like CharXiv-Reasoning and Video-MME, GPT-4.1 consistently leads, scoring 72% on the latter without subtitles. This makes it a top choice for video understanding and scientific chart interpretation.
Price cuts and transition plansAll three GPT-4.1 models are now available in the API, with a significant price drop. GPT-4.1 is 26% cheaper for median queries compared to GPT-4o. Prompt caching discounts have increased to 75%, and there are no extra charges for long-context inputs. The GPT-4.5 preview will be deprecated by July 14, 2025, in favor of the more efficient GPT-4.1 family.
Pricing per 1M tokens for GPT-4.1 is set at $2 for input, $0.50 for cached input, and $8 for output. GPT-4.1 nano drops those to $0.10, $0.025, and $0.40 respectively—making it the most affordable option to date.