Model Reference — Updated
Major LLMs compared:
context, cost, capability.
Model choice affects output quality as much as prompt design. This page compares eight major LLMs on the dimensions that actually matter for production use: context window, reasoning ability, cost, and the tasks each model handles best.
Pricing and context windows change frequently. Verify current specs on each provider's pricing page before making architecture decisions. Figures here are approximate as of .
01 Full Model Comparison
| Model | Provider | Context Window | Input Cost (per 1M tokens) | Primary Strengths |
|---|---|---|---|---|
| Claude Opus 4 | Anthropic | 200k tokens | ~$15 | Long-context reasoning, instruction following, coding |
| Claude Sonnet 4 | Anthropic | 200k tokens | ~$3 | Balanced capability/cost, structured output, agentic tasks |
| GPT-4o | OpenAI | 128k tokens | ~$5 | Multimodal input, function calling, broad benchmark coverage |
| o3 | OpenAI | 200k tokens | ~$10 | Deep reasoning, math, science — slower and pricier than GPT-4o |
| Gemini 2.0 Ultra | Google DeepMind | 1M tokens | ~$7 | Massive context, document-level tasks, native Google integration |
| Gemini Flash 2.0 | Google DeepMind | 1M tokens | ~$0.07 | High-volume, low-latency, cheapest frontier-quality option |
| Llama 3.3 70B | Meta (open weights) | 128k tokens | Free (self-hosted) / ~$0.9 (hosted) | Open weights, customisable, strong for fine-tuning |
| Mistral Large 2 | Mistral AI | 128k tokens | ~$2 | European compliance, multilingual, function calling |
02 Decision Guide — Which Model for Which Task?
| Use Case | Recommended | Why |
|---|---|---|
| Prototype / exploration | Claude Sonnet 4 or GPT-4o | Best capability-per-dollar ratio; good instruction following for iterating prompts |
| Complex reasoning / math | o3 or Claude Opus 4 | Extended thinking / reasoning chains outperform standard models on multi-step problems |
| Document summarisation (>100k tokens) | Gemini 2.0 Ultra | 1M context window handles full codebases, books, or long contract sets |
| High-volume, low-latency inference | Gemini Flash 2.0 | Lowest cost per token at frontier quality; sub-second latency at scale |
| On-premise or air-gapped deployment | Llama 3.3 70B | Open weights; can be quantized and run locally without external API calls |
| European data residency required | Mistral Large 2 | Hosted in EU; easier GDPR compliance; strong multilingual support |
| Agentic / tool-use workflows | Claude Sonnet 4 | Reliable tool-call execution, low hallucination on structured output, good instruction following |
| Fine-tuning for custom domain | Llama 3.3 70B | Open weights allow full fine-tuning; 70B is the smallest model that retains strong zero-shot capability after fine-tuning |
03 How Model Choice Affects Prompting Strategy
Larger models need less hand-holding
With GPT-4o, Claude Opus 4, or Gemini Ultra, a simple zero-shot instruction often works where smaller models would need few-shot examples or an explicit chain-of-thought trigger. However, even large models benefit from explicit output format instructions.
Reasoning models (o3, extended-thinking modes) change the CoT calculus
Models with built-in reasoning chains — OpenAI's o3 series, Claude's extended thinking mode — should not be prompted with explicit step-by-step instructions. The model handles internal reasoning automatically; adding CoT phrasing can actually degrade performance by interfering with the model's own process.
Open models require more defensive prompting
Llama and Mistral variants tend to be more sensitive to prompt format. Few-shot examples help significantly. Output format instructions must be very explicit — these models are more likely to add conversational preamble before a JSON block, breaking downstream parsers.
Context window size changes chunking strategy
With a 1M-token context, the full document strategy (pass everything in one call) is viable for most documents. With 128k or less, plan for retrieval-augmented generation (RAG) or map-reduce summarisation patterns.
Last updated: