Model Reference — Updated

Major LLMs compared:
context, cost, capability.

Model choice affects output quality as much as prompt design. This page compares eight major LLMs on the dimensions that actually matter for production use: context window, reasoning ability, cost, and the tasks each model handles best.

Disclaimer

Pricing and context windows change frequently. Verify current specs on each provider's pricing page before making architecture decisions. Figures here are approximate as of .

01 Full Model Comparison

Major LLMs — context window, estimated cost, and primary strengths as of March 2026
Model Provider Context Window Input Cost (per 1M tokens) Primary Strengths
Claude Opus 4 Anthropic 200k tokens ~$15 Long-context reasoning, instruction following, coding
Claude Sonnet 4 Anthropic 200k tokens ~$3 Balanced capability/cost, structured output, agentic tasks
GPT-4o OpenAI 128k tokens ~$5 Multimodal input, function calling, broad benchmark coverage
o3 OpenAI 200k tokens ~$10 Deep reasoning, math, science — slower and pricier than GPT-4o
Gemini 2.0 Ultra Google DeepMind 1M tokens ~$7 Massive context, document-level tasks, native Google integration
Gemini Flash 2.0 Google DeepMind 1M tokens ~$0.07 High-volume, low-latency, cheapest frontier-quality option
Llama 3.3 70B Meta (open weights) 128k tokens Free (self-hosted) / ~$0.9 (hosted) Open weights, customisable, strong for fine-tuning
Mistral Large 2 Mistral AI 128k tokens ~$2 European compliance, multilingual, function calling

02 Decision Guide — Which Model for Which Task?

Task-to-model guidance — recommended models by use case
Use Case Recommended Why
Prototype / exploration Claude Sonnet 4 or GPT-4o Best capability-per-dollar ratio; good instruction following for iterating prompts
Complex reasoning / math o3 or Claude Opus 4 Extended thinking / reasoning chains outperform standard models on multi-step problems
Document summarisation (>100k tokens) Gemini 2.0 Ultra 1M context window handles full codebases, books, or long contract sets
High-volume, low-latency inference Gemini Flash 2.0 Lowest cost per token at frontier quality; sub-second latency at scale
On-premise or air-gapped deployment Llama 3.3 70B Open weights; can be quantized and run locally without external API calls
European data residency required Mistral Large 2 Hosted in EU; easier GDPR compliance; strong multilingual support
Agentic / tool-use workflows Claude Sonnet 4 Reliable tool-call execution, low hallucination on structured output, good instruction following
Fine-tuning for custom domain Llama 3.3 70B Open weights allow full fine-tuning; 70B is the smallest model that retains strong zero-shot capability after fine-tuning

03 How Model Choice Affects Prompting Strategy

Larger models need less hand-holding

With GPT-4o, Claude Opus 4, or Gemini Ultra, a simple zero-shot instruction often works where smaller models would need few-shot examples or an explicit chain-of-thought trigger. However, even large models benefit from explicit output format instructions.

Reasoning models (o3, extended-thinking modes) change the CoT calculus

Models with built-in reasoning chains — OpenAI's o3 series, Claude's extended thinking mode — should not be prompted with explicit step-by-step instructions. The model handles internal reasoning automatically; adding CoT phrasing can actually degrade performance by interfering with the model's own process.

Open models require more defensive prompting

Llama and Mistral variants tend to be more sensitive to prompt format. Few-shot examples help significantly. Output format instructions must be very explicit — these models are more likely to add conversational preamble before a JSON block, breaking downstream parsers.

Context window size changes chunking strategy

With a 1M-token context, the full document strategy (pass everything in one call) is viable for most documents. With 128k or less, plan for retrieval-augmented generation (RAG) or map-reduce summarisation patterns.