Model Reference — Updated March 2026

Major LLMs compared:
context, cost, capability.

Model choice affects output quality as much as prompt design. This page compares eight major LLMs on the dimensions that actually matter for production use: context window, reasoning ability, cost, and the tasks each model handles best.

Disclaimer

Pricing and context windows change frequently. Verify current specs on each provider's pricing page before making architecture decisions. Figures here are approximate as of March 2026.

01 Full Model Comparison

Major LLMs — context window, estimated cost, and primary strengths as of March 2026
Model	Provider	Context Window	Input Cost (per 1M tokens)	Primary Strengths
Claude Opus 4	Anthropic	200k tokens	~$15	Long-context reasoning, instruction following, coding
Claude Sonnet 4	Anthropic	200k tokens	~$3	Balanced capability/cost, structured output, agentic tasks
GPT-4o	OpenAI	128k tokens	~$5	Multimodal input, function calling, broad benchmark coverage
o3	OpenAI	200k tokens	~$10	Deep reasoning, math, science — slower and pricier than GPT-4o
Gemini 2.0 Ultra	Google DeepMind	1M tokens	~$7	Massive context, document-level tasks, native Google integration
Gemini Flash 2.0	Google DeepMind	1M tokens	~$0.07	High-volume, low-latency, cheapest frontier-quality option
Llama 3.3 70B	Meta (open weights)	128k tokens	Free (self-hosted) / ~$0.9 (hosted)	Open weights, customisable, strong for fine-tuning
Mistral Large 2	Mistral AI	128k tokens	~$2	European compliance, multilingual, function calling

02 Decision Guide — Which Model for Which Task?

Task-to-model guidance — recommended models by use case
Use Case	Recommended	Why
Prototype / exploration	Claude Sonnet 4 or GPT-4o	Best capability-per-dollar ratio; good instruction following for iterating prompts
Complex reasoning / math	o3 or Claude Opus 4	Extended thinking / reasoning chains outperform standard models on multi-step problems
Document summarisation (>100k tokens)	Gemini 2.0 Ultra	1M context window handles full codebases, books, or long contract sets
High-volume, low-latency inference	Gemini Flash 2.0	Lowest cost per token at frontier quality; sub-second latency at scale
On-premise or air-gapped deployment	Llama 3.3 70B	Open weights; can be quantized and run locally without external API calls
European data residency required	Mistral Large 2	Hosted in EU; easier GDPR compliance; strong multilingual support
Agentic / tool-use workflows	Claude Sonnet 4	Reliable tool-call execution, low hallucination on structured output, good instruction following
Fine-tuning for custom domain	Llama 3.3 70B	Open weights allow full fine-tuning; 70B is the smallest model that retains strong zero-shot capability after fine-tuning

03 How Model Choice Affects Prompting Strategy

Larger models need less hand-holding

With GPT-4o, Claude Opus 4, or Gemini Ultra, a simple zero-shot instruction often works where smaller models would need few-shot examples or an explicit chain-of-thought trigger. However, even large models benefit from explicit output format instructions.

Reasoning models (o3, extended-thinking modes) change the CoT calculus

Models with built-in reasoning chains — OpenAI's o3 series, Claude's extended thinking mode — should not be prompted with explicit step-by-step instructions. The model handles internal reasoning automatically; adding CoT phrasing can actually degrade performance by interfering with the model's own process.

Open models require more defensive prompting

Llama and Mistral variants tend to be more sensitive to prompt format. Few-shot examples help significantly. Output format instructions must be very explicit — these models are more likely to add conversational preamble before a JSON block, breaking downstream parsers.

Context window size changes chunking strategy

With a 1M-token context, the full document strategy (pass everything in one call) is viable for most documents. With 128k or less, plan for retrieval-augmented generation (RAG) or map-reduce summarisation patterns.

Last updated: March 22, 2026

Major LLMs compared:context, cost, capability.

01 Full Model Comparison

02 Decision Guide — Which Model for Which Task?

03 How Model Choice Affects Prompting Strategy

Larger models need less hand-holding

Reasoning models (o3, extended-thinking modes) change the CoT calculus

Open models require more defensive prompting

Context window size changes chunking strategy

Major LLMs compared:
context, cost, capability.