CODING180
Chapter 1 of 515m

Reasoning Prompts: Chain‑of‑Thought and Self‑Consistency

Reasoning Prompts: Chain‑of‑Thought and Self‑Consistency

In prompt engineering, few‑shot examples help pattern matching. They can still miss multi‑step logic.

Consider parity: “Is the sum of odd numbers in this list even?” A model may memorize “even + even = even.” It might still pick the wrong odd numbers, or miscount.

We saw a support bot invent a “premium downgrade clause.” Unverified reasoning can sound confident and still be wrong.

This chapter shows how to reveal or manage reasoning with chain‑of‑thought (CoT) and self‑consistency. These are two practical tools for your prompt engineering toolkit.

If you need a refresher on clarity and zero‑shot setup, revisit Foundations of Prompt Engineering: Clarity, Structure, and Zero‑Shot. For creating better examples, see Few‑Shot Prompting for Prompt Engineering: Formats, Labels, and Examples.

Learning goals
- Understand limits of standard few‑shot prompting on multi‑step reasoning
- Apply chain‑of‑thought prompting and self‑consistency sampling
- Keep rationales concise to control verbosity, cost, and privacy
- Run a small accuracy log across 5 and 10 queries
- Know when to escalate to tools, retrieval (RAG), or fine‑tuning

Why few‑shot alone struggles on multi‑step reasoning

Few‑shot gives patterns, not guaranteed logic. Take the parity task. We are deciding whether the sum of the odd numbers in a list is even.

Reference examples (True = sum is even; False = sum is odd):
- 4, 8, 9, 15, 12, 2, 1 → odd numbers: 9, 15, 1 → sum: 25 → False
- 17, 10, 19, 4, 8, 12, 24 → odd numbers: 17, 19 → sum: 36 → True
- 16, 11, 14, 4, 8, 13, 24 → odd numbers: 11, 13 → sum: 24 → True
- 17, 9, 10, 12, 13, 4, 2 → odd numbers: 17, 9, 13 → sum: 39 → False

Key rule: the sum of odd numbers is even exactly when the count of odd numbers is even. Models without an explicit reasoning scaffold often:
- Misidentify odd numbers
- Lose track when counting
- Confuse the decision rule (e.g., assume “sum of many numbers tends to be even”)

Try this fifth set as a quick check (don’t look up the answer yet): 15, 32, 5, 13, 82, 7, 1.

Chain‑of‑Thought (CoT): reveal the steps

Chain‑of‑thought prompting asks the model to show intermediate steps before the final answer. In math word problems (e.g., GSM8K), CoT can raise accuracy. It encourages step‑by‑step computation and verification.

When to use CoT
- Multi‑step arithmetic and unit conversions
- Multi‑hop factual questions (reason across statements)
- Policy checks where rules must be applied sequentially

Keep the rationale concise
- Ask for “a brief chain of thought (3, 5 steps).”
- For sensitive contexts, log only the final answer or a one‑line justification (see privacy below).
- Use bullets or numbered mini‑steps; avoid restating the entire prompt.

For a concise foundation on structuring reasoning prompts and managing verbosity and privacy, see our guide on crafting effective prompts for LLMs.

Helpful overviews and research roundups on CoT are available from trusted sources like IBM and community paper hubs (see Additional Resources). These compare prompt engineering vs. RAG vs. fine‑tuning and summarize research such as self‑consistency.

Self‑Consistency: sample multiple chains and vote

Self‑consistency runs your CoT prompt multiple times with slight sampling variation, then selects the majority answer. This often lifts accuracy because the model explores reasoning variants and converges on a stable final answer.

Practical settings
- Temperature: 0.5, 0.9 to generate diverse chains
- Samples per query: start at 5; increase until the majority agreement plateaus
- Logging: keep each chain for audit, but mask sensitive data

See a summary of key papers (e.g., Self‑Consistency Improves CoT, Auto‑CoT) in this community collection: promptingguide.ai papers.

Worked example: zero‑shot vs few‑shot vs CoT

Problem
A candy shop has 3 boxes with 12, 18, and 27 candies. You eat 1/3 of the first box, 1/2 of the second, and 1/9 of the third. How many candies remain in total?

Baseline zero‑shot prompt

You are a helpful solver. Answer with a number only.
A candy shop has 3 boxes with 12, 18, and 27 candies. You eat 1/3 of the first box, 1/2 of the second, and 1/9 of the third. How many remain?

Common failure modes: forgets to subtract, mixes fractions, or calculates 1/9 of 27 incorrectly.

Few‑shot prompt (no explicit reasoning asked)

You are a helpful solver. Answer with a number only.
Example: Box A 10, eat 1/2 → remaining 5
Example: Box B 30, eat 1/3 → remaining 20
Now solve: 12, 18, 27; eat 1/3, 1/2, 1/9; total remaining?

It is often better than zero‑shot. It is still prone to slip on multi‑step arithmetic.

Chain‑of‑Thought prompt (brief rationale)

You are a careful math solver. Show a brief 3, 5 step chain of thought, then give the final number after “Answer:”.
Problem: 3 boxes: 12, 18, 27. Eat 1/3, 1/2, 1/9 respectively. How many remain in total?
Constraints: be concise; no repetition; compute exactly.

Likely steps: compute eaten amounts (4, 9, 3). Subtract to get (8, 9, 24). Sum = 41. Answer: 41.

Mini‑benchmark (example results; your numbers will vary)
- Setup: general‑purpose LLM, temperature 0.7, no tools.
- Over 5 runs
- Zero‑shot: 2/5 correct
- Few‑shot: 3/5 correct
- CoT (single sample): 4/5 correct
- Over 10 runs
- Zero‑shot: 4/10 correct
- Few‑shot: 6/10 correct
- CoT (single sample): 8/10 correct
- CoT + Self‑Consistency (5 samples per query): 9/10 correct with majority vote

After you compare zero‑shot, few‑shot, and CoT and log accuracy over multiple runs, apply a systematic prompt iteration workflow from our comprehensive prompt optimization guide.

Implementation: Self‑Consistency sampling (Python)

from collections import Counter

def solve_with_cot(prompt, samples=5, temperature=0.7):
    answers = []
    chains = []
    for _ in range(samples):
        out = llm(prompt, temperature=temperature)
        chains.append(out)              # store rationale for audit (mask PII!)
        # extract final answer after "Answer:" or last number
        ans = extract_final_number(out)
        answers.append(ans)
    maj = Counter(answers).most_common(1)[0][0]
    agreement = answers.count(maj) / len(answers)
    return maj, agreement, chains

cot_prompt = """
You are a careful math solver. Show a brief 3, 5 step chain of thought, then give the final number after "Answer:".
Problem: 3 boxes: 12, 18, 27. Eat 1/3, 1/2, 1/9 respectively. How many remain in total?
Constraints: be concise; no repetition; compute exactly.
"""
ans, agree, chains = solve_with_cot(cot_prompt, samples=5, temperature=0.7)
print(ans, agree)

Diagram (mental model)
- Zero‑shot: single arrow from problem → answer (may skip steps)
- Few‑shot: problem → pattern recall → answer (better, still brittle)
- CoT: problem → steps → answer (more reliable; auditable)
- CoT + Self‑Consistency: multiple step‑paths → voting → answer (most reliable of these)

Controlling verbosity, cost, and privacy

Verbosity controls (tokens = cost + latency)
- Ask for “brief 3, 5 steps” or “bullet steps only.”
- Set a hard cap: “No more than 5 lines of reasoning.”
- Use “Answer:” + final format to extract outputs cleanly.

Privacy and sensitive reasoning
- Avoid logging raw CoT when prompts contain customer data, source code, or secrets.
- Prefer “latent CoT”: “Think step by step, then respond with the final answer only.”
- Redact or hash inputs in logs; store only final answers and agreement scores.
- Add policy reminders: “Do not include PII, credentials, or proprietary details.”

Example: privacy‑aware latent CoT

Solve the problem. Think step by step internally, but only output the final answer.
Final format: Answer: <value>

Cost levers
- Short rationales reduce tokens (and cost) significantly.
- Use self‑consistency only when needed; start with 3, 5 samples.
- Cache answers for repeated prompts; drop temperature to reduce retries.

When CoT isn’t enough: broaden your prompt engineering toolkit

  • Tool use (calculator/code): Offload arithmetic and parsing to tools/APIs. CoT then orchestrates tool calls.
  • Retrieval‑Augmented Generation (RAG): Ground reasoning steps in external sources; cite evidence to cut hallucinations.
  • Fine‑tuning: Teach domain‑specific procedures, formats, or policies when prompts alone can’t close the gap.
  • Other strategies: least‑to‑most decomposition, tree‑of‑thoughts (branch and evaluate), and auto‑generated exemplars.

To evaluate these strategies with guardrails and A/B tests, jump to Prompt Engineering Tools and Evaluation: Templates, Guardrails, and A/B Tests. Later, you’ll add production‑ready patterns to your library in Build a Beginner Prompt Library: Classification, Extraction, and Q&A.

Practical exercise

1) Parity drill
- Task: Decide if “sum of odd numbers is even” for 20 random lists, including the fifth set above: 15, 32, 5, 13, 82, 7, 1.
- Run 10 zero‑shot trials, 10 few‑shot, 10 CoT; log accuracy.
- Hint: Teach the rule explicitly in CoT: “Count odd numbers; if count is even → even sum.”

2) Word‑problem benchmark
- Use the candy problem (and 2 similar ones you create). For each, run:
- Zero‑shot (temperature 0)
- Few‑shot (2 examples, no reasoning)
- CoT (brief 3, 5 steps)
- CoT + self‑consistency (5 samples, majority vote)
- Track per‑method accuracy over 5 and 10 queries, plus tokens and latency.

Expected outcomes
- CoT > few‑shot > zero‑shot on multi‑step tasks
- Self‑consistency adds a further lift when single‑pass CoT is unstable
- Shorter rationales keep costs manageable without large accuracy loss

Tips
- Fix one variable at a time (temperature, sample count).
- Save prompts and results for comparison; note common error patterns.
- For privacy, prefer latent CoT or redact logs.

Summary

  • Few‑shot alone can stumble on counting and multi‑step logic.
  • Chain‑of‑thought prompts expose the steps; self‑consistency stabilizes answers.
  • Control verbosity, tokens, and privacy with concise steps and latent CoT when needed.
  • When CoT plateaus, scale with tools, RAG, or fine‑tuning, and evaluate rigorously.

In short, prompt engineering gives you levers, clarity, examples, reasoning, and sampling, to boost reliability on reasoning tasks while balancing cost and privacy. Next, we’ll systematize evaluation and guardrails in Prompt Engineering Tools and Evaluation: Templates, Guardrails, and A/B Tests.

Additional Resources