Benchmarks & Milestones

37 tracked

Track the latest AI achievements, model releases, and industry milestones

February 2026

Claude Opus 4.6 — 1M context, agent teams

Features a 1-million-token context window and introduces agent teams — groups of agents that can split larger tasks into segmented jobs. Outperformed GPT-5.2 on several benchmarks.

Feb 5

Modelhighanthropicclaude-opus-4.6

September 2025

Claude Sonnet 4.5 — autonomous agent benchmark leader

State-of-the-art on SWE-bench Verified and OSWorld (61.4%). Can work autonomously for 30+ hours (up from 7 hours with Opus 4). Launched alongside the Claude Agent SDK.

Sep 29

Modelhighanthropicclaude-sonnet-4.5

May 2025

Claude Sonnet 4 and Opus 4 launch

Claude Sonnet 4 delivers superior coding and reasoning as a hybrid model. Claude Opus 4 is declared the world's best coding model, leading SWE-bench (72.5%) and Terminal-bench (43.2%).

May 22

Modelhighanthropicclaude-4

April 2025

o3 — OpenAI's smartest reasoning model with tool use

OpenAI releases o3, its most capable reasoning model, making 20% fewer major errors than o1. The first reasoning model that can agentically use all ChatGPT tools (web search, Python, image generation) simultaneously.

Apr 16

Modelhighopenaio3

Llama 4 Scout and Maverick — MoE + 10M context

First Llama models with native multimodality and mixture-of-experts architecture. Scout supports a 10M-token context window on a single H100. Maverick (128 experts) beats GPT-4o and Gemini 2.0 Flash.

Apr 5

Open Sourcehighmetallama-4

March 2025

Gemini 2.5 Pro — Google's first thinking model

Google's first 'thinking model' that reasons through steps before responding. Debuted at #1 on LMArena by a significant margin. Described as Google's most intelligent AI model.

Mar 25

Breakthroughhighgooglegemini-2.5-pro

February 2025

Claude 3.7 Sonnet — first hybrid reasoning model

The first hybrid reasoning model on the market, able to produce near-instant responses or engage in visible extended step-by-step thinking. Launched alongside Claude Code, an agentic coding CLI tool.

Feb 24

Breakthroughhighanthropicclaude-3.7-sonnet

Grok 3 — trained on 200K-GPU Colossus supercluster

Trained with 10x the compute of Grok-2 on the 200,000-GPU Colossus supercluster. Features a 1M-token context window and large-scale reinforcement learning for reasoning.

Feb 17

Modelhighxaigrok-3

January 2025

o3-mini — fast, affordable reasoning

OpenAI releases o3-mini, a cost-efficient reasoning model optimized for STEM. Excels in science, math, and coding, showing that reasoning capabilities can be delivered in lightweight form factors.

Jan 31

Modelmediumopenaio3-mini

DeepSeek R1 — open-source reasoning model shakes markets

An open-source reasoning model rivaling OpenAI o1. Its mobile app briefly surpassed ChatGPT as #1 free app on the US App Store. The release caused an 18% drop in Nvidia's share price.

Jan 20

Open Sourcehighdeepseekr1

December 2024

DeepSeek V3 — trained for $6M, rivals GPT-4o

A 671B MoE model (37B active per token) trained for only ~$6 million. Competitive with GPT-4o and Claude 3.5 Sonnet on most benchmarks, demonstrating radical training cost efficiency.

Dec 26

Modelhighdeepseekv3

Gemini 2.0 Flash — built for the agentic era

Outperforms Gemini 1.5 Pro on key benchmarks at twice the speed. First Gemini with multimodal output (native image generation + text-to-speech). Positioned for the 'agentic era.'

Dec 11

Modelhighgooglegemini-2.0-flash

Llama 3.3 70B — 405B performance at 70B cost

A 70B text-only model delivering performance comparable to the much larger Llama 3.1 405B at a fraction of the serving cost, demonstrating significant efficiency gains through improved training.

Dec 6

Open Sourcemediummetallama-3.3

o1 full release with ChatGPT Pro ($200/mo)

Full release of o1 with 34% fewer major mistakes and 50% faster than the preview, now multimodal. Launched alongside the $200/month ChatGPT Pro subscription tier featuring o1-pro mode.

Dec 5

Producthighopenaio1

November 2024

Claude 3.5 Haiku — Opus performance at Haiku speed

Claude 3.5 Haiku matches the performance of the prior Claude 3 Opus at the speed and cost of the smaller Haiku tier. Scored 40.6% on SWE-bench Verified, outperforming many agents using larger models.

Nov 4

Modelmediumanthropicclaude-3.5-haiku

October 2024

Upgraded Claude 3.5 Sonnet + computer use beta

Upgraded Claude 3.5 Sonnet with across-the-board improvements, especially in coding. Introduces computer use in public beta — the first frontier AI to control a computer (cursor, clicks, typing) via API.

Oct 22

Breakthroughhighanthropicclaude-3.5-sonnet

September 2024

Llama 3.2 — first multimodal + edge models

Meta's first multimodal open models (11B and 90B vision LLMs) plus lightweight models (1B and 3B) for edge and mobile devices. Announced at Meta Connect 2024.

Sep 25

Open Sourcehighmetallama-3.2

o1-preview — OpenAI's first reasoning model

OpenAI launches o1-preview, its first model trained with reinforcement learning to think step-by-step before answering. Ranked 89th percentile on Codeforces and placed among top 500 US students on AIME.

Sep 12

Breakthroughhighopenaio1

August 2024

Grok 2 — xAI enters the frontier race

xAI's first competitive frontier model, achieving performance on par with leading models on graduate-level science (GPQA), general knowledge (MMLU), and math. Includes image generation via Flux.

Aug 14

Modelmediumxaigrok-2

July 2024

Llama 3.1 405B — first open frontier model

The first openly available frontier-level model at 405 billion parameters, rivaling top proprietary models in general knowledge, math, tool use, and multilingual translation. 128K context.

Jul 23

Open Sourcehighmetallama-3.1

GPT-4o mini — cost-efficient intelligence

OpenAI releases GPT-4o mini, a smaller and significantly cheaper variant that outperforms GPT-4 Turbo on many benchmarks while costing a fraction to run.

Jul 18

Modelmediumopenaigpt-4o-mini

June 2024

Claude 3.5 Sonnet — faster than Opus, cheaper too

Claude 3.5 Sonnet outperforms Claude 3 Opus on key benchmarks at 2x the speed and 1/5 the cost. Introduces Artifacts, an interactive workspace for AI-generated content alongside conversations.

Jun 20

Modelhighanthropicclaude-3.5-sonnet

May 2024

Gemini 1.5 Flash — speed-optimized with 1M context

A speed-optimized Gemini model for high-frequency tasks where latency matters most, while still supporting 1 million tokens of multimodal context (text, images, audio, video).

May 14

Modelmediumgooglegemini-1.5-flash

GPT-4o launches — 2x faster, half the price

OpenAI releases GPT-4o ('omni'), which natively processes and generates text, audio, and images. 2x faster and 50% cheaper than GPT-4 Turbo. Made free to all ChatGPT users.

May 13

Modelhighopenaigpt-4o

DeepSeek V2 — MoE with radical cost efficiency

A MoE model with 236B total parameters but only 21B active, pioneering efficient inference with Multi-head Latent Attention. Dramatically undercut competitor pricing.

May 6

Modelmediumdeepseekv2

April 2024

Llama 3 — most capable open LLM at launch

The most capable openly available LLM at launch, with significantly improved reasoning, coding, and instruction-following over Llama 2. Available in 8B and 70B parameter sizes.

Apr 18

Open Sourcehighmetallama-3

March 2024

Claude 3 family — Opus, Sonnet, Haiku

Anthropic launches three models: Opus (most intelligent), Sonnet (balanced), and Haiku (fastest). Claude 3 Opus outperforms GPT-4 on most benchmarks. Anthropic's first model with vision capabilities.

Mar 4

Modelhighanthropicclaude-3

February 2024

Mistral Large — European frontier alternative

Mistral's first true flagship model competing with GPT-4. Features 32K context, specialized RAG support, and best-in-class multilingual performance across French, German, Spanish, and Italian.

Feb 26

Modelmediummistralmistral-large

Gemini 1.5 Pro — 1 million token context window

Gemini 1.5 Pro achieves a breakthrough 1-million-token context window, capable of processing 1 hour of video, 11 hours of audio, or 700K+ words in a single prompt.

Feb 15

Breakthroughhighgooglegemini-1.5-pro

December 2023

Mixtral 8x7B — open MoE beats Llama 2 70B

A pioneering open-weight sparse mixture-of-experts model that outperforms Llama 2 70B on most benchmarks with 6x faster inference. Released under the permissive Apache 2.0 license.

Dec 11

Open Sourcehighmistralmixtral

Google launches Gemini 1.0

Google DeepMind releases Gemini 1.0 in Ultra, Pro, and Nano sizes. The Ultra model is the first to claim human-expert performance on MMLU (90.0%). Natively multimodal from training.

Dec 6

Modelhighgooglegemini

November 2023

GPT-4 Turbo announced at DevDay with 128K context

OpenAI announces GPT-4 Turbo at its first DevDay conference, featuring a 128K context window, JSON mode, and significantly lower pricing — 3x cheaper input, 2x cheaper output vs GPT-4.

Nov 6

Modelhighopenaigpt-4-turbo

July 2023

Meta open-sources Llama 2

Meta releases Llama 2 in partnership with Microsoft — one of the first major open-weight foundation models for commercial use. Available in 7B, 13B, and 70B parameter sizes.

Jul 18

Open Sourcehighmetallama-2

Claude 2 launches with 100K context to the public

Anthropic releases Claude 2 with a 100K token context window, allowing users to process entire books in a single conversation. The first Anthropic model available to the general public via claude.ai.

Jul 11

Modelmediumanthropicclaude-2

March 2023

GPT-4 released with multimodal capabilities

OpenAI launches GPT-4, a large multimodal model that accepts image and text inputs. It scores in the 90th percentile on the bar exam and demonstrates significant reasoning improvements over GPT-3.5.

Mar 14

Modelhighopenaigpt-4

Anthropic introduces Claude

Anthropic launches its first public AI assistant, positioned as helpful, honest, and harmless. Available in Claude and Claude Instant variants, initially via API only.

Mar 14

Productmediumanthropicclaude

November 2022

ChatGPT launches and reaches 100M users

OpenAI releases ChatGPT, a conversational AI assistant based on GPT-3.5. It becomes the fastest-growing consumer app in history, reaching 100 million users in two months.

Nov 30

Producthighopenaichatgpt

Model Comparison

Score Breakdown

MMLU

90.8

89.1

HumanEval

MATH

78.3

73.5

GPQA

56.3

SWE-bench Verified

—

Claude Sonnet 4.5

Provider: anthropic

Context: 200k

Pricing: $3/$15 per M

balancedcodingfast

Gemini 2.0 Flash

Provider: google

Context: 1M

Pricing: $0.1/$0.4 per M

fastcost-efficientlong-context