CODING180
Chapter 1 of 518m

Few‑Shot Prompting for Prompt Engineering: Formats, Labels, and Examples

Few‑Shot Prompting for Prompt Engineering: Formats, Labels, and Examples

In Chapter 1, we focused on clarity and zero‑shot basics.

Now we will turn up the dials with few‑shot prompting. You will add a few examples to your prompt. This helps the model learn from context.

You will see how format and label choices shape performance. Then you will build a small sentiment classifier and measure accuracy.

Learning objectives
- Understand why few‑shot prompting works (and its limits)
- Use a consistent Instruction / Input / Output template
- Pick 3, 5 representative examples that match your real label distribution
- Run a hands‑on experiment: baseline vs randomized‑label demonstrations
- Apply best practices and avoid common pitfalls

If you want a refresher on clarity and structure before diving in, revisit Foundations of Prompt Engineering: Clarity, Structure, and Zero‑Shot.

Why few‑shot works: research notes for prompt engineering

  • Early demonstrations: Brown et al. (2020) showed that large models are few‑shot learners, the more capable the model, the better it can use in‑prompt demonstrations. See the original paper for methods and scaling trends: Language Models are Few‑Shot Learners (Brown et al., 2020).
  • What matters most in examples: Min et al. (2022) found that even when labels are randomized, models can still improve if the example format is consistent and the inputs reflect the true distribution and label space. Random labels sampled from the true label distribution outperform uniformly random labels.
  • Limits for complex reasoning: Few‑shot alone may not fix multi‑step logic errors. For tasks that need deliberate reasoning, you’ll use techniques from Reasoning Prompts: Chain‑of‑Thought and Self‑Consistency next.

The few‑shot template: Instruction, Input, Output

A reliable few‑shot prompt uses a simple, consistent schema.

Keep everything tight and predictable.

Template pattern

Instruction: <clear task in 1, 2 sentences>

Example 1
Input: <one input>
Output: <one output>

Example 2
Input: <one input>
Output: <one output>

Example K
Input: <one input>
Output: <one output>

Now solve this.
Input: <new input>
Output:

Formatting tips
- Use delimiters like triple backticks or to separate sections.
- Use a fixed label set; keep outputs one line when possible.
- Keep examples short to save tokens and reduce distraction.

For a deeper walkthrough of template structure and formatting choices, see this guide to crafting effective prompts for LLMs.

Choosing 3, 5 representative examples

  • Match label priors: If your real data has 60% Neutral, 30% Positive, 10% Negative, mirror that roughly in your K examples (as much as the small K allows).
  • Diversity within constraints: Cover different phrasings, tones, and edge cases, but keep formats identical.
  • Order matters: Models may overweight later items, consider placing your most informative example last.
  • Diminishing returns: You’ll see big gains up to ~2, 3 examples; 3, 5 is typical. More than 8 can hurt quality and cost.

Hands‑on project: Few‑shot sentiment classifier

You will build a 3‑shot classifier using the Instruction / Input / Output template.

Then you will test it on 10 new items and record accuracy.

Step 1, Draft the baseline few‑shot prompt
Copy and paste into your LLM playground.

Replace placeholders if needed.

Instruction: Classify the sentiment of the Input as one of {Positive, Neutral, Negative}. Respond ONLY with one label exactly.

Example 1
Input: "Loved the battery life, lasted all weekend."
Output: Positive

Example 2
Input: "It’s fine, does what it says."
Output: Neutral

Example 3
Input: "App kept crashing during checkout."
Output: Negative

Now solve this.
Input: "Shipping was on time, nothing special."
Output:

Notes
- Labels are case‑consistent (Capitalized) and exactly one of the set.
- Examples are short and representative (one Positive, one Neutral, one Negative). If your domain is skewed toward Neutral, you might swap in two Neutral and one Positive/Negative.

Step 2, Create a 10‑item test set
Use this starter set.

Or write your own from your domain (support tickets, reviews, comments).

Test items
1) "The update fixed my issue immediately."
2) "Not impressed, still lags a lot."
3) "Exactly what I expected."
4) "Customer service went above and beyond!"
5) "Meh. It works."
6) "The screen arrived cracked."
7) "Setup was straightforward."
8) "Refund took forever and I’m annoyed."
9) "Great quality for the price."
10) "No complaints so far."

Step 3, Evaluate
- Run each test item through your prompt by replacing the final Input value each time.
- Record predictions and compute accuracy (# correct / 10). If you don’t have gold labels, annotate them yourself first.
- Keep a simple table in your notes or spreadsheet. We’ll discuss lightweight tracking in Prompt Engineering Tools and Evaluation: Templates, Guardrails, and A/B Tests.

Surprise test: Randomize labels, keep the format

Min et al. (2022) showed that models benefit from the example format and input distribution, even if labels are randomized, especially when random labels follow the true label proportions.

Step 4, Randomize labels
Re‑run the same 3 examples.

Shuffle labels while preserving the exact format and label set. Example:

Instruction: Classify the sentiment of the Input as one of {Positive, Neutral, Negative}. Respond ONLY with one label exactly.

Example 1
Input: "Loved the battery life, lasted all weekend."
Output: Neutral   # randomized

Example 2
Input: "It’s fine, does what it says."
Output: Negative  # randomized

Example 3
Input: "App kept crashing during checkout."
Output: Positive  # randomized

Now solve this.
Input: "Shipping was on time, nothing special."
Output:

Step 5, Test again on the same 10 items
- Record accuracy and compare to your baseline.
- You may see a smaller drop than you expect. Why? Because:
- The model still sees the label space and formatting constraints.
- The inputs in the examples reflect the domain distribution.
- The consistent schema acts like a contract for outputs.

Caution: Randomized labels are not a good practice for production. This exercise reveals model sensitivities and why consistent format matters in prompt engineering.

Best practices for prompt engineering few‑shot prompts

  • Use consistent delimiters: Backticks, XML‑style tags, or clear section headings.
  • Keep examples short: 1, 2 sentences each to reduce noise and token cost.
  • Match label priors: Reflect real‑world proportions when possible.
  • Fix the label space: List allowed labels explicitly and enforce exact casing.
  • Be order‑aware: Put your strongest or most frequent pattern last.
  • Separate instruction from examples: Make it scannable for the model and for you.
  • Constrain outputs: “Respond ONLY with one label exactly.”

To refine few-shot setups and avoid common pitfalls, explore this comprehensive prompt optimization guide for practical checklists and tactics.

Common pitfalls and how to fix them

  • Too many examples: More than ~5, 8 can reduce accuracy and increase cost. Trim to the most representative 3, 5.
  • Inconsistent casing or spacing: If labels vary (Positive vs positive), models may drift. Standardize.
  • Distribution mismatch: If your examples are all Positive but the real world is mostly Neutral, you’ll bias outputs. Re‑balance your examples.
  • Overfitting to phrasing: Avoid examples that are nearly identical; vary wording but keep the same schema.
  • Hidden instructions: Burying the Instruction inside examples can confuse the model. Keep Instruction explicit.
  • Reasoning overload: For multi‑step logic, few‑shot won’t always help, prefer structured reasoning prompts; see Reasoning Prompts: Chain‑of‑Thought and Self‑Consistency.

Practical extensions

  • 5‑shot variant: Add two more examples that cover tricky edge cases (e.g., sarcastic positive, polite negative) while maintaining label balance.
  • Calibrate outputs: If your model leans too Positive, swap one Positive example with a Neutral and put a strong Neutral example last.
  • Build your library: Save your best templates and example sets; we’ll expand them in Build a Beginner Prompt Library: Classification, Extraction, and Q&A.

Summary

  • Few‑shot prompting is a practical prompt engineering technique that uses a handful of demonstrations to nudge models toward the right structure and label space.
  • Format, input distribution, and explicit label sets matter, sometimes more than the correctness of labels in the examples.
  • Keep examples short, balanced, and consistent; 3, 5 is usually enough.
  • Test with a small set, track accuracy, and iterate. For complex reasoning, move to structured approaches in the next chapter.

Next up: We’ll unpack step‑by‑step reasoning strategies in Reasoning Prompts: Chain‑of‑Thought and Self‑Consistency.

Additional Resources