Introducing GPT-4.1 for the API

A new generation of GPT models offering major enhancements in coding, instruction-following, and long-context capabilities—plus our first-ever nano model.

Explore it in the Playground • Share

Overview

We’re excited to introduce three new models via the API: GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano. These additions outpace GPT‑4o and GPT‑4o mini on all fronts, showing particularly strong advances in coding and instruction-following skills. They also offer significantly larger context windows—up to 1 million tokens—and are better at leveraging that context, thanks to improved long-context comprehension. Additionally, they feature an updated knowledge cutoff of June 2024.

Highlights

Coding: GPT‑4.1 achieves 54.6% on SWE-bench Verified, marking a 21.4% absolute improvement over GPT‑4o and 26.6% absolute improvement over GPT‑4.5. This makes GPT‑4.1 a top-performing coding model.
Instruction following: On Scale’s MultiChallenge (a standard instruction-following benchmark), GPT‑4.1 scores 38.3%, which is 10.5% higher than GPT‑4o.
Long context: On Video-MME (testing multimodal long-context understanding), GPT‑4.1 sets a new benchmark with a 72.0% score on long, subtitle-free video tasks—improving upon GPT‑4o by 6.7% absolute.

Although benchmarks illustrate substantial gains, the driving goal behind GPT‑4.1 was to excel in real-world applications. We collaborated closely with the developer community to fine-tune these models for practical use cases that matter most to you.

In short, the GPT‑4.1 family offers outstanding performance at lower cost—continuing to push the boundaries of capability and efficiency.

The GPT-4.1 Family

Performance vs. Latency

The GPT‑4.1 lineup delivers strong results across different latency tiers:

GPT‑4.1: Exceptional performance at a cost that is 26% lower than GPT‑4o for typical queries.
GPT‑4.1 mini: Strikes a balance between high-level capability and reduced latency; it often surpasses GPT‑4o on benchmarks, while cutting response times by nearly half.
GPT‑4.1 nano: The fastest and most cost-effective model in the series, offering a 1 million token context window and surprisingly strong accuracy in various benchmarks—ideal for tasks that require minimal delay, such as classification or autocomplete.

These improvements make GPT‑4.1 models notably more robust for powering “agentic” systems that can handle tasks independently. Combined with primitives like the Responses API, developers can build agents that more effectively tackle software engineering, large-document analysis, customer support, and other complex operations.

Note: GPT‑4.1 is only available via the API. In ChatGPT, many instruction-following, coding, and intelligence upgrades have gradually been integrated into the latest GPT‑4o release. We will continue to carry over more GPT‑4.1 innovations in upcoming ChatGPT updates.

Deprecation of GPT‑4.5 Preview

We will begin deprecating GPT‑4.5 Preview in the API. GPT‑4.5 Preview will be shut down on July 14, 2025, giving developers time to transition. GPT‑4.1 provides similar or better performance, plus reduced latency and cost. GPT‑4.5 was initially a research preview for large, compute-intensive models, and we’re carrying forward its strengths—creativity, writing quality, humor, nuance—into future models.

Benchmark Performance

Below, we break down GPT‑4.1’s improvements on standard benchmarks, along with real-world use cases from early adopters like Windsurf, Qodo, Hex, Blue J, Thomson Reuters, and Carlyle. These examples highlight how GPT‑4.1 performs on tasks ranging from coding to analyzing lengthy legal documents.

Coding

GPT‑4.1 exhibits substantial gains in:

Agentic coding tasks: Higher success on open-ended coding challenges.
Frontend development: Generating more functional and visually appealing web apps.
Adherence to diff formats: Creating more reliable output in various patch or diff formats.
Reduced extraneous edits: Making fewer unwarranted changes to code.

SWE-bench Verified

On SWE-bench Verified (real-world software engineering tasks), GPT‑4.1 successfully completes 54.6% of tasks, well above GPT‑4o’s 33.2%. This reflects its better grasp of navigating codebases, producing functional code, and passing tests.

Model	SWE-bench Verified Accuracy
GPT-4.1	54.6%
GPT-4o (2024-11-20)	33.2%
GPT-4.5	38.0%
GPT-4.1 mini	23.6%
GPT-4o mini	8.7%
OpenAI o1 (high)	41.0%
OpenAI o3-mini (high)	49.3%

We excluded 23 out of 500 tasks that couldn’t be run under our infrastructure. Assigning 0% to these tasks drops GPT‑4.1’s accuracy slightly, from 54.6% to about 52%.

Aider’s Polyglot Diff Benchmark

GPT‑4.1 also excels at code diffs. On Aider’s polyglot diff benchmark—requiring a model to produce accurate edits in different programming languages—GPT‑4.1 more than doubles GPT‑4o’s score and even outperforms GPT‑4.5 by an 8% margin. This benchmark highlights both coding skill and strict adherence to patch formats.

Model	Whole-file Format	Diff Format
GPT-4.1	51.6%	52.9%
GPT-4o (2024-11-20)	30.7%	18.2%
GPT-4.5	38.0%	32.0%
GPT-4.1 mini	34.7%	31.6%
GPT-4.1 nano	9.8%	6.2%
GPT-4o mini	3.6%	2.7%
OpenAI o1 (high)	64.6%	61.7%
OpenAI o3-mini (high)	66.7%	60.4%

By outputting only the changed lines (rather than the full file), developers can save time and cost. For those who prefer rewriting entire files, GPT‑4.1’s output token limit is 32,768 tokens, double GPT‑4o’s 16,384.

Frontend Coding

GPT‑4.1 significantly outperforms GPT‑4o in creating user interfaces and web apps. In controlled tests where paid human evaluators assessed head-to-head website generation, GPT‑4.1’s results were judged superior 80% of the time.

Example Prompt:
“Make a flashcard web application…”
(Details omitted for brevity.)

GPT‑4o Output	GPT‑4.1 Output
Basic UI and partial features	More polished UI, better animations, better code structure

Reduced Extraneous Edits

Internal tests showed that GPT‑4.1 commits unnecessary edits only 2% of the time, down from 9% with GPT‑4o. This leads to more focused and efficient coding sessions.

Real-World Coding Examples

Windsurf: GPT‑4.1 outperforms GPT‑4o by 60% on their coding benchmark, correlating strongly with code changes accepted in initial reviews. They also found GPT‑4.1 used tools 30% more efficiently, made fewer repetitive edits, and delivered faster iteration cycles.
Qodo: In an internal head-to-head with 200 real GitHub pull requests, GPT‑4.1 beat leading models 55% of the time on generating high-quality code reviews. It excelled in both precision (avoiding unnecessary recommendations) and comprehensiveness (identifying real concerns).

Instruction Following

GPT‑4.1 is markedly more reliable at following instructions, reflected in multiple benchmarks and internal evaluations.

Internal Instruction-Following Eval

We developed an internal test covering:

Format compliance (XML, Markdown, etc.)
Negative instructions (avoiding certain language or topics)
Ordered instructions (following steps in a specified sequence)
Content requirements (always including specific details)
Ranking (sorting outputs in a particular order)
Overconfidence checks (admitting if info is unavailable)

GPT‑4.1 shows a dramatic jump in success on the “hard” subset of prompts, from 29% with GPT‑4o to 49%.

Model	Accuracy (Hard Subset)
GPT-4.1	49%
GPT-4o (2024-11-20)	29%
GPT-4.5	54%
GPT-4.1 mini	45%
GPT-4.1 nano	32%
GPT-4o mini	27%
OpenAI o1 (high)	51%
OpenAI o3-mini (high)	50%

Multi-Turn Instruction Adherence

MultiChallenge from Scale tests how well a model handles multi-turn conversations while retaining details from earlier prompts. GPT‑4.1 improves by 10.5 percentage points over GPT‑4o, scoring 38.3%.

Model	MultiChallenge
GPT-4.1	38%
GPT-4o (2024-11-20)	28%
GPT-4.5	44%
GPT-4.1 mini	36%
GPT-4.1 nano	15%
GPT-4o mini	20%
OpenAI o1 (high)	45%
OpenAI o3-mini (high)	40%

IFEval

IFEval tests whether a model can comply with various instructions (content length, tone, avoiding specific words, etc.). GPT‑4.1 scores 87.4%, up from 81.0% with GPT‑4o.

Model	IFEval Accuracy
GPT-4.1	87%
GPT-4o (2024-11-20)	81%
GPT-4.5	88%
GPT-4.1 mini	84%
GPT-4.1 nano	75%
GPT-4o mini	78%
OpenAI o1 (high)	92%
OpenAI o3-mini (high)	94%

Real-World Examples

Blue J: On a challenging tax scenario benchmark, GPT‑4.1 was 53% more accurate than GPT‑4o. Blue J’s platform benefits from GPT‑4.1’s ability to handle intricate regulatory frameworks and nuanced instructions—leading to faster, more reliable research.
Hex: GPT‑4.1 nearly doubled accuracy on Hex’s most difficult SQL tasks, more reliably picking the correct tables from large, ambiguous schemas. This cut down on debugging time and accelerated production workflows.

Long Context

All three GPT‑4.1 models—standard, mini, and nano—support up to 1 million tokens of context, a significant jump from GPT‑4o’s 128,000 tokens. This means GPT‑4.1 can effectively handle entire codebases or large document sets while maintaining a strong grasp of relevant details and ignoring distractors.

Needle in a Haystack Eval

In our internal “needle in a haystack” test (finding a single piece of hidden info anywhere in up to 1 million tokens), GPT‑4.1 consistently retrieves the correct detail at any position in the text. Its accuracy remains stable across the entire context window.

OpenAI-MRCR (Multi-Round Coreference)

We are introducing OpenAI-MRCR, an evaluation that tests how well a model can locate and disambiguate multiple “needles” hidden among numerous near-duplicate requests. GPT‑4.1 surpasses GPT‑4o in performance for context lengths up to 128K tokens and continues to perform well at 1 million tokens, though the task remains challenging.

Graphwalks

Graphwalks is a dataset requiring multi-hop reasoning within large context windows. It involves BFS (breadth-first search) queries through a huge directed graph. GPT‑4.1 matches the performance of OpenAI o1 (high) at 61.7% accuracy, significantly beating GPT‑4o at 42%.

Real-World Long-Context Use Cases

Thomson Reuters: GPT‑4.1 boosted multi-document review accuracy by 17% in CoCounsel, its legal AI assistant. The model excelled at understanding context across numerous sources and pinpointing subtle overlaps—crucial for legal tasks.
Carlyle: GPT‑4.1 achieved 50% better information extraction from massive, data-heavy documents. It overcame tricky pitfalls like “needle-in-the-haystack” retrieval and multi-hop cross-referencing—previously major blockers for large-scale data analysis.

Latency Enhancements

We’ve reworked our inference stack to reduce time-to-first-token. Prompt caching can further minimize wait times while reducing your costs. Under typical conditions, GPT‑4.1’s p95 latency for 128,000 tokens is around 15 seconds for the first token, and roughly 30 seconds for 1 million tokens. GPT‑4.1 mini and nano are faster still, with GPT‑4.1 nano often returning the first token in under five seconds at 128,000 tokens.

Vision

The GPT‑4.1 family excels at image-based tasks, with GPT‑4.1 mini often outperforming GPT‑4o on vision-oriented benchmarks.

Benchmark	GPT-4.1	GPT-4o (2024-11-20)	GPT-4.5	GPT-4.1 mini	GPT-4.1 nano	GPT-4o mini
MMMU	74.8%	68.7%	75.2%	72.7%	55.4%	56.3%
MathVista	72.2%	61.4%	72.3%	73.1%	56.2%	56.5%
CharXiv-Reasoning	56.7%	52.7%	55.4%	56.8%	40.5%	36.8%

In Video-MME tasks with 30–60 minute videos and no subtitles, GPT‑4.1 reaches 72.0%, up from GPT‑4o’s 65.3%, a new best-in-class result for video comprehension.

Pricing

We’re pleased to offer these new GPT‑4.1 models at reduced rates compared to GPT‑4o, thanks to optimizations in our inference infrastructure. Additionally, GPT‑4.1 nano is now our cheapest and fastest model. We also increased the prompt caching discount to 75% (up from 50%) for all GPT‑4.1 models, and there’s no added cost for long-context usage beyond the standard per-token price.

Model	Input	Cached Input	Output	Blended Pricing*
gpt-4.1	$2.00	$0.50	$8.00	$1.84
gpt-4.1-mini	$0.40	$0.10	$1.60	$0.42
gpt-4.1-nano	$0.10	$0.025	$0.40	$0.12

* Blended pricing is estimated from typical ratios of input, cached input, and output tokens.

All three models are available at an additional 50% discount when used in our Batch API.

Conclusion

GPT‑4.1 delivers significant improvements in real-world applications—especially in coding tasks, instruction-following, and handling extensive context. Its advanced capabilities empower developers to build more sophisticated, reliable applications and AI-powered “agents.” We can’t wait to see how the community leverages GPT‑4.1’s strengths to create the next generation of intelligent systems.

Appendix: Full Benchmark Results

Below is a comprehensive table summarizing GPT‑4.1 performance across academic tests, coding evals, instruction-following tasks, long-context benchmarks, vision, and function-calling evaluations.

Academic Knowledge

Category	GPT-4.1	GPT-4.1 mini	GPT-4.1 nano	GPT-4o (2024-11-20)	GPT-4o mini	OpenAI o1 (high)	OpenAI o3-mini (high)	GPT-4.5
AIME '24	48.1%	49.6%	29.4%	13.1%	8.6%	74.3%	87.3%	36.7%
GPQA Diamond 1	66.3%	65.0%	50.3%	46.0%	40.2%	75.7%	77.2%	69.5%
MMLU	90.2%	87.5%	80.1%	85.7%	82.0%	91.8%	86.9%	90.8%
Multilingual MMLU	87.3%	78.5%	66.9%	81.4%	70.5%	87.7%	80.7%	85.1%

Coding Evals

Category	GPT-4.1	GPT-4.1 mini	GPT-4.1 nano	GPT-4o (2024-11-20)	GPT-4o mini	OpenAI o1 (high)	OpenAI o3-mini (high)	GPT-4.5
SWE-bench Verified	54.6%	23.6%	–	33.2%	8.7%	41.0%	49.3%	38.0%
SWE-Lancer	$176K (35.1%)	$165K (33.0%)	$77K (15.3%)	$163K (32.6%)	$116K (23.1%)	$160K (32.1%)	$90K (18.0%)	$186K (37.3%)
SWE-Lancer (IC-Diamond)	$34K (14.4%)	$31K (13.1%)	$9K (3.7%)	$29K (12.4%)	$11K (4.8%)	$29K (9.7%)	$17K (7.4%)	$41K (17.4%)
Aider’s polyglot: whole	51.6%	34.7%	9.8%	30.7%	3.6%	64.6%	66.7%	–
Aider’s polyglot: diff	52.9%	31.6%	6.2%	18.2%	2.7%	61.7%	60.4%	44.9%

Instruction-Following Evals

Category	GPT-4.1	GPT-4.1 mini	GPT-4.1 nano	GPT-4o (2024-11-20)	GPT-4o mini	OpenAI o1 (high)	OpenAI o3-mini (high)	GPT-4.5
Internal (hard subset)	49.1%	45.1%	31.6%	29.2%	27.2%	51.3%	50.0%	54.0%
MultiChallenge	38.3%	35.8%	15.0%	27.8%	20.3%	44.9%	39.9%	43.8%
MultiChallenge (o3-mini grader)	46.2%	42.2%	31.1%	39.9%	25.6%	52.9%	50.2%	50.1%
COLLIE	65.8%	54.6%	42.5%	50.2%	52.7%	95.3%	98.7%	72.3%
IFEval	87.4%	84.1%	74.5%	81.0%	78.4%	92.2%	93.9%	88.2%
Multi-IF	70.8%	67.0%	57.2%	60.9%	57.9%	77.9%	79.5%	70.8%

Long Context Evals

Category	GPT-4.1	GPT-4.1 mini	GPT-4.1 nano	GPT-4o (2024-11-20)	GPT-4o mini	OpenAI o1 (high)	OpenAI o3-mini (high)	GPT-4.5
OpenAI-MRCR: 2 needle (128k)	57.2%	47.2%	36.6%	31.9%	24.5%	22.1%	18.7%	38.5%
OpenAI-MRCR: 2 needle (1M)	46.3%	33.3%	12.0%	–	–	–	–	–
Graphwalks BFS (<128k)	61.7%	61.7%	25.0%	41.7%	29.0%	62.0%	51.0%	72.3%
Graphwalks BFS (>128k)	19.0%	15.0%	2.9%	–	–	–	–	–
Graphwalks Parents (<128k)	58.0%	60.5%	9.4%	35.4%	12.6%	50.9%	58.3%	72.6%
Graphwalks Parents (>128k)	25.0%	11.0%	5.6%	–	–	–	–	–

Vision Evals

Category	GPT-4.1	GPT-4.1 mini	GPT-4.1 nano	GPT-4o (2024-11-20)	GPT-4o mini	OpenAI o1 (high)	GPT-4.5
MMMU	74.8%	72.7%	55.4%	68.7%	56.3%	77.6%	75.2%
MathVista	72.2%	73.1%	56.2%	61.4%	56.5%	71.8%	72.3%
CharXiv-Reasoning	56.7%	56.8%	40.5%	52.7%	36.8%	55.1%	55.4%
CharXiv-D	87.9%	88.4%	73.9%	85.3%	76.6%	88.9%	90.0%

Function Calling Evals

Category	GPT-4.1	GPT-4.1 mini	GPT-4.1 nano	GPT-4o (2024-11-20)	GPT-4o mini	OpenAI o1 (high)	OpenAI o3-mini (high)	GPT-4.5
ComplexFuncBench	65.5%	49.3%	57.0%	66.5%	38.6%	47.6%	17.6%	63.0%
Taubench Airline	49.4%	36.0%	14.0%	42.8%	22.0%	50.0%	32.4%	50.0%
Taubench Retail	68.0%	55.8%	22.6%	60.3%	44.0%	70.8%	57.6%	68.4%

Thank you for exploring GPT‑4.1! We look forward to seeing how you harness these new models to build powerful applications and advanced AI-driven systems.