Advanced Technique Reference

Beyond the basics.
Production-grade patterns.

These techniques go beyond single-prompt interactions. RAG grounds answers in your data. ReAct lets models act and reason in a loop. Prompt chaining breaks complex tasks into auditable steps. Each pattern adds power — and complexity. Use only what the task requires.

00 When to use which pattern

Situation Pattern Cost vs. basic
Answer must come from your documents, not training data RAG Medium
Model needs real-time data or to take actions in external systems Tool Use Medium
Multi-step task where reasoning must adapt based on intermediate results ReAct High
Complex task with known sequential steps; auditability required Prompt Chaining Medium–High
Problem requires exploration, backtracking, or comparing multiple approaches Tree of Thoughts Very high
Ongoing dialogue; user refines query across multiple turns Multi-Turn Low–Medium

01 Retrieval-Augmented Generation (RAG)

RAG retrieves relevant documents from an external source and injects them into the prompt before the model answers. The model is instructed to use only the provided context, grounding its answer in your data rather than its training weights.

When to use it

Use RAG when the answer must come from a specific, controlled corpus: internal documentation, product manuals, legal policies, or any domain where training-data knowledge is insufficient or untrustworthy. RAG is the primary pattern for production enterprise Q&A systems.

How the prompt is structured

Every RAG call has two layers. The system prompt sets grounding rules. The user prompt concatenates retrieved context blocks followed by the query.

SYSTEM:
You are a helpful assistant. Answer using ONLY the information in the context below.
Do not use outside knowledge. Do not infer beyond what is explicitly stated.
If the context is insufficient, say: "I don't have enough information to answer this."

Context:
[DOC_1: source="policy-handbook-v3.pdf", page=12]
Employees are entitled to 20 days of annual leave per calendar year...

[DOC_2: source="policy-handbook-v3.pdf", page=14]
Leave requests must be submitted at least 5 business days in advance...

USER:
How much notice do I need to give before taking annual leave?

Structural best practices

  • Pass context as discrete, labelled blocks — not one raw text blob. This enables source citation and makes multi-hop retrieval easier.
  • Include source metadata (filename, page, date) in each block so the model can attribute claims.
  • Match chunk size to task: smaller chunks (<512 tokens) improve precision for factoid Q&A; larger chunks work better for summarisation.
  • Use RAG-Fusion — run multiple reformulations of the query and combine retrieved results — to improve recall on ambiguous queries.

Advanced patterns

PatternWhat it doesUse when
RAG-Fusion Generates multiple query reformulations, retrieves for each, merges Query intent is ambiguous
HyDE Generates a hypothetical ideal answer, embeds it to retrieve similar real docs Semantic gap between question and document wording
Multi-hop decomposition Breaks multi-part question into sub-questions, retrieves for each Answer requires chaining multiple facts
Security Note

PoisonedRAG (2024 research): injecting as few as 5 malicious documents into a corpus of millions caused models to return attacker-specified false answers 90% of the time on targeted queries. Sanitise and validate all documents added to your retrieval corpus.

Known failure modes

  • Hallucination despite retrieval — weak system prompts let models blend retrieved content with training knowledge. Use explicit grounding constraints.
  • Prompt injection via retrieved content — malicious instructions embedded in retrieved web pages or user-uploaded documents can hijack the system prompt.
  • Chunk-query mismatch — wrong chunk granularity causes relevant content to be scattered across multiple chunks, none retrieved individually.
  • Stale index — outdated documents produce confidently wrong answers. Set a re-indexing schedule for time-sensitive corpora.

02 Tool Use / Function Calling

Tool use lets the model output a structured specification of which external function to call and with what parameters, rather than generating a natural-language answer. The host system executes the tool and returns the result; the model then incorporates it into its response.

When to use it

  • Real-time or dynamic data the model can't know from training (stock prices, weather, user account state)
  • Computation the model does poorly natively: arithmetic, database lookups, code execution
  • Actions in external systems: send email, create ticket, update record
  • Responses that must feed into downstream systems in a strict schema

Tool definition structure

Tools are defined as JSON schemas passed in every request. The description field is the most critical part — it determines whether the model picks the right tool.

{
  "name": "search_knowledge_base",
  "description": "Search internal product documentation for policies, features,
                  or troubleshooting guides. Use when the user asks about specific
                  product behaviour or company policy. Do NOT use for general
                  knowledge questions.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "Focused search phrase or question."
      },
      "category": {
        "type": "string",
        "enum": ["billing", "technical", "policy", "general"],
        "description": "Narrow the search scope."
      }
    },
    "required": ["query"]
  }
}

Best practices

  • Descriptions drive tool selection. Specify when to use and when NOT to use each tool. Vague descriptions produce wrong tool selection.
  • Keep schemas flat. Nested JSON schemas beyond 2 levels degrade output quality significantly. Flatten whenever possible.
  • Limit active tools per call. Tool definitions consume tokens and confuse models when too many are present. Expose only tools relevant to the current task context.
  • Use API-level structured outputs. GPT-4o (2024-08-06+) with Structured Outputs achieves near-100% schema compliance. Prompt-only JSON mode is far less reliable.

Known failure modes

  • Wrong tool selection — ambiguous or similar tool names/descriptions cause the model to invoke the wrong tool.
  • Parameter type errors — model generates parameters in the wrong type (string instead of integer, serialised string instead of array).
  • Infinite tool loops — in agentic settings, the model gets stuck repeatedly calling the same tool with slight variations. Add a maximum iteration limit.
  • Over-calling tools — model invokes tools even when context already contains the answer, wasting latency and tokens.
  • Token budget — tool definitions count against the context limit. In large toolsets, definitions can crowd out conversation history.

03 ReAct (Reasoning + Acting)

ReAct interleaves reasoning traces (Thought) with executable actions (Action) and their results (Observation) in a loop. The model reasons about what to do, acts, observes the result, then reasons again — until it reaches an answer. It combines the accuracy of chain-of-thought with the information access of tool use.

When to use it

Use ReAct for multi-step tasks where reasoning must adapt based on what tools return. It's the foundational pattern for agentic systems. It's also more auditable than opaque agentic frameworks because every reasoning step is visible in the Thought trace.

The Thought–Action–Observation loop

Question: What is the population of the capital of the country that won the
          2022 FIFA World Cup?

Thought: I need to find out which country won the 2022 FIFA World Cup.
Action: Search[2022 FIFA World Cup winner]
Observation: Argentina won the 2022 FIFA World Cup, defeating France in the final.

Thought: Argentina's capital is Buenos Aires. Now I need its population.
Action: Search[Buenos Aires population 2024]
Observation: Buenos Aires has a population of approximately 3.1 million in the
             city proper, 15.6 million in the greater metro area.

Thought: I have the answer.
Action: Finish[The capital of Argentina (Buenos Aires) has a city population of
              ~3.1 million and a metro population of ~15.6 million.]

Prompt template

You are an assistant that answers questions using these tools:
- Search[query]: Returns a web search snippet.
- Lookup[term]: Looks up a term in the current document.
- Finish[answer]: Returns the final answer.

Format strictly as:
Thought: [your reasoning]
Action: ToolName[input]
Observation: [tool result — provided by the system]

Repeat until you call Finish.

Question: {USER_QUERY}
Research Note

Yao et al. (2022) "ReAct: Synergizing Reasoning and Acting in Language Models" showed ReAct outperformed chain-of-thought on knowledge-intensive tasks by grounding claims in retrieved evidence. However, chain-of-thought with self-consistency outperformed ReAct on HotpotQA — ReAct is not universally superior.

Known failure modes

  • Thought–action loops — model repeats the same Thought and Action without progress. Add a maximum step limit and a fallback handler.
  • Non-informative observation recovery — when a search returns unhelpful results, the model cannot reformulate its query. Prompt it explicitly: "If the observation is unhelpful, try a different search formulation."
  • Reasoning-action misalignment — the Thought articulates one plan but the Action doesn't match it.
  • Cost — ReAct generates significantly more tokens per query than single-pass approaches. Can be 5–10× more expensive.
  • Model dependency — works well with frontier models (GPT-4 class, Claude 3+). Degrades significantly with smaller models.

04 Prompt Chaining

Prompt chaining breaks a complex task into a sequence of simpler prompts where each output feeds into the next as input. Instead of one monolithic prompt, you decompose the task into discrete, auditable steps — each with its own model call, temperature setting, and validation gate.

When to use it

  • Task is too complex for reliable single-pass completion
  • Different steps benefit from different instructions or models
  • Intermediate outputs need to be inspected, logged, or validated
  • Error in one step should not silently propagate — you want an explicit break point

Do not chain simple tasks. If a capable model handles it reliably in one pass, the overhead is not justified.

Chain patterns

PatternStructureUse when
Linear / Sequential A → B → C Strict logical ordering required
Parallel A → B, C, D → merge Independent subtasks on the same input
Feedback / Iterative A → critique → A (revised) → loop Quality gate needed; refinement improves output
Conditional A → route → B or C Step determines which path to follow

Sequential chain template (3-step example)

# Step 1 — Extract
Extract all action items from the meeting transcript below.
Return a JSON array: [{"owner": str, "task": str, "due_date": str|null}]
Transcript: {INPUT}

# Step 2 — Validate (receives Step 1 JSON)
Review these action items for completeness.
Flag any item missing an owner or due_date with "needs_review": true.
Items: {STEP_1_OUTPUT}

# Step 3 — Summarise (receives Step 2 JSON)
Write a professional follow-up email summarising the action items below.
Group by owner. Flag items marked needs_review as pending clarification.
Items: {STEP_2_OUTPUT}
Research Note

ACL 2024 Findings showed feedback-loop chaining (generate → critique → revise) outperformed single-prompt "draft + critique + refine" by up to 15.6% accuracy, because each step operates on a clean context rather than a crowded single prompt.

Known failure modes

  • Error propagation — a hallucination or schema error in Step 1 cascades through all downstream steps. Validate output format at each handoff.
  • Semantic drift — each hand-off can introduce subtle interpretation shifts. By Step 5, output may diverge from original intent. Carry forward a brief task description in every step.
  • Simulated refinement — in feedback loops, models sometimes regenerate from scratch rather than genuinely revising. Use diff-based prompts ("Here is the original and the critique — show only what changed").
  • Latency accumulation — a 5-step sequential chain with 2-second steps takes 10 seconds minimum. Parallelise independent steps.

05 Tree of Thoughts (ToT)

Tree of Thoughts generates multiple candidate reasoning paths at each step, evaluates them, and uses a search algorithm (BFS or DFS) to explore the most promising branches — backtracking from dead ends. Where chain-of-thought follows one path, ToT maintains an explicit tree and selects the best branch through evaluation.

When to use it

Use ToT only when the problem genuinely requires exploration and backtracking: complex planning, math puzzles, multi-step strategy, and creative tasks requiring comparison of distinct approaches. For tasks frontier models handle in a single pass, the overhead is not justified.

Performance Evidence

Game of 24 benchmark (Yao et al., NeurIPS 2023): GPT-4 with chain-of-thought solved 4% of tasks. GPT-4 with Tree of Thoughts solved 74%. The gap is most pronounced on problems requiring non-trivial search — not on common NLP tasks.

Two-prompt implementation (simplified ToT)

Full ToT requires custom BFS/DFS orchestration. For most use cases, this two-prompt approximation captures most of the benefit:

# Prompt 1 — Generate candidates
Imagine three expert reasoners each independently proposing their next step
to solve this problem. Write each expert's proposal separately.

Problem: {PROBLEM}
Expert 1's next step:
Expert 2's next step:
Expert 3's next step:

# Prompt 2 — Evaluate and select
Three experts proposed the following approaches to this problem.
Evaluate each for correctness and promise. Select the best one and explain why.

Problem: {PROBLEM}
Expert 1: {PROPOSAL_1}
Expert 2: {PROPOSAL_2}
Expert 3: {PROPOSAL_3}

Best approach:

Evaluation strategies

MethodHow it worksBest for
Value scoring Assign a scalar (1–10) or label (sure / likely / impossible) to each candidate Tasks with verifiable intermediate states
Majority vote Generate candidates, ask the model to vote across 5+ runs, take the majority Creative tasks without a single correct answer

Known failure modes

  • Cost — ToT can require 10–100× more tokens and LLM calls than a single-pass approach. Use only where the quality gain justifies it.
  • Evaluation quality dependency — the entire approach depends on the model correctly evaluating candidate thoughts. If the evaluator is unreliable, ToT prunes the correct path.
  • Diminishing returns on extended-thinking models — o1, o3, and Claude 3.7 Sonnet with extended thinking internalise ToT-like reasoning natively. For many tasks previously requiring explicit ToT, a single inference with an extended-thinking model now matches or exceeds ToT results at lower implementation cost.
  • Implementation complexity — full ToT requires custom BFS/DFS orchestration, loop control, and evaluation scoring. Not achievable with a single API call.

06 Multi-Turn Context Management

Multi-turn prompting manages an ongoing dialogue where each model response depends on prior turns. The core challenge is that LLMs have a finite context window — and performance degrades as conversations grow. Research (2025) shows an average 39% performance drop in multi-turn vs. single-turn on the same generation tasks.

Context management strategies

StrategyHow it worksTrade-off
Sliding window Keep only the N most recent turns; drop the oldest Simple, but loses long-range context permanently
Contextual summarisation Compress turns older than N into a summary; keep recent turns verbatim Preserves key facts; summary quality affects recall
Recap injection Prepend a 1–3 sentence state summary to each new user turn Forces re-anchoring; adds tokens per turn
External memory Extract facts from turns into a vector store; retrieve relevant facts per query No token limit; requires additional infrastructure

Contextual summarisation template

SYSTEM:
You are a helpful assistant. The conversation summary below captures all prior
context. Use it alongside the recent messages to answer accurately.

[CONVERSATION SUMMARY — updated automatically]
The user is debugging a Python FastAPI application. They identified that the
issue is in the authentication middleware, specifically the JWT expiry check.
They want to preserve backward compatibility with v1 tokens.

[RECENT MESSAGES — last 10 turns verbatim]
User: Can we add a grace period for expired tokens?
Assistant: Yes, you can check the `exp` claim and allow a configurable grace
           window before rejecting...
[...]

User: {CURRENT_MESSAGE}

Known failure modes

  • Premature assumption lock-in — models make incorrect assumptions in early turns and fail to correct them when contradicting information arrives later.
  • Irrecoverable wrong turns — research (arXiv 2505.06120, 2025) shows that once a model goes in the wrong direction in a multi-turn conversation, it typically does not self-correct. Add explicit correction prompts: "Disregard your previous approach. Start fresh from…"
  • Persona drift — over many turns the model gradually drifts from system-prompt constraints, especially under adversarial user pressure. Reinforce critical constraints in the system prompt with repetition and high specificity.
  • Recency bias — the model over-weights the most recent user message and under-weights earlier context when the context window is nearly full.