CODING180
Chapter 1 of 515m

Build a Beginner Prompt Library: Classification, Extraction, and Q&A

Build a Beginner Prompt Library: Classification, Extraction, and Q&A

In this chapter, you will turn your prompt practice into a small, reusable library.

We will build on clarity and zero‑shot basics from Foundations of Prompt Engineering: Clarity, Structure, and Zero‑Shot. We will also use patterns from Few‑Shot Prompting for Prompt Engineering: Formats, Labels, and Examples. Together, we will assemble three production‑ready prompts. Then we will test, score, and iterate. We will also reference reasoning tactics from Reasoning Prompts: Chain‑of‑Thought and Self‑Consistency. Finally, we will borrow simple evaluation habits from Prompt Engineering Tools and Evaluation: Templates, Guardrails, and A/B Tests.

What you’ll build in ~15 minutes:
- A few‑shot sentiment classifier (labels: Positive, Neutral, Negative)
- A structured JSON extraction prompt with an explicit schema
- A concise Q&A prompt that produces short answers with citations you can later pair with RAG
- A reusable template to document each prompt, plus a tiny test suite and scoring recipes

Learning objectives:
- Apply prompt engineering fundamentals in three practical tasks
- Document prompts (goal, inputs, output schema, examples, parameters, evaluation, known failure modes)
- Compute accuracy and format adherence, then version from v0.1 → v0.2

Mini‑project: a prompt engineering starter library

You will create three prompts. You will run them on small test sets of 10, 20 items. You will track results.

For a concise foundation on structuring instructions, choosing few‑shot examples, and enforcing output schemas, see this guide on crafting effective prompts for LLMs.

Quick mental model (text diagram):
- Input → Prompt (instructions + examples + constraints) → Model Output → Validator (accuracy/format checks) → Notes → v0.2

Prompt Spec Template (copy this)

Use this template to document each prompt.

Paste into a doc or repo, then fill it in.

name: <short-name>
version: v0.1
owner: <you>

goal:
  - What problem does this prompt solve?

inputs:
  - Variables: <e.g., {text}, {context}, {question}>
  - Assumptions/constraints: <e.g., English only, max 200 words>

output_schema:
  - Type: <label|string|json>
  - Format contract: <e.g., one of {Positive, Neutral, Negative}; valid JSON matching schema>

examples:  # include zero‑ or few‑shot examples as needed
  - input: <example input>
    output: <example output>
  - input: <example input>
    output: <example output>

parameters:
  model: <model name if applicable>
  temperature: 0.0, 0.3 for deterministic tasks
  max_tokens: <e.g., 64/256>
  additional: <e.g., top_p, stop sequences>

evaluation_checklist:
  - Accuracy metric: <e.g., classification accuracy; field‑level F1>
  - Format adherence: <e.g., JSON parseable; keys present>
  - Constraints honored: <e.g., max 3 sentences>

known_failure_modes:
  - <e.g., sarcasm misclassified; missing field for noisy text>

notes_and_tradeoffs:
  - <e.g., lower temperature improves consistency but can reduce nuance>

version_history:
  - v0.1: <initial spec>
  - v0.2: <changes after evaluation>

Prompt 1, Few‑Shot Sentiment Classification

Task: Classify a short user message as Positive, Neutral, or Negative.

Prompt (v0.1):

System: You are a careful classifier. Only output one of: Positive, Neutral, Negative.
User: Classify the sentiment of the given message. Consider valence and intent. If mixed, prefer Neutral.

Examples:
- Text: “Thrilled with the update, works flawlessly!”
  Label: Positive
- Text: “It’s okay, nothing special.”
  Label: Neutral
- Text: “Support is unresponsive and I’m frustrated.”
  Label: Negative

Now classify:
Text: {text}
Answer with exactly one label.

Parameters:
- temperature: 0.0, 0.2
- max_tokens: 5
- stop: newline

Evaluation checklist:
- Accuracy ≥ 80% on starter test set
- Output is one of three labels, no extra text
- Consistent decisions on similar items (spot‑check)

Known failure modes:
- Sarcasm (“Great job…” used negatively)
- Domain terms (e.g., “sick” as Positive in slang)
- Mixed sentiment (defaulting to Neutral may hide nuance)

Starter test set (10 items with gold labels):

1. “Absolutely love the new design!” → Positive
2. “Meh, it works but I don’t care.” → Neutral
3. “The app keeps crashing. I’m done.” → Negative
4. “Thanks for the quick fix!” → Positive
5. “It’s fine.” → Neutral
6. “Terrible customer service, never again.” → Negative
7. “Pretty decent for the price.” → Positive
8. “Not sure yet, still evaluating.” → Neutral
9. “This is the worst update so far.” → Negative
10. “Smooth installation, no issues.” → Positive

Compute accuracy (example Python):


preds = [
  # fill in model predictions in order, e.g., "Positive", ...
]
gold = ["Positive","Neutral","Negative","Positive","Neutral","Negative","Positive","Neutral","Negative","Positive"]
acc = sum(p==g for p,g in zip(preds,gold)) / len(gold)
print(f"Accuracy: {acc:.2%}")

Upgrade to v0.2 ideas:
- Add 2, 3 counterexamples (sarcasm, slang)
- Clarify tie‑breaking: “If both positive and negative cues exist, classify by overall intent of final sentence.”
- Note trade‑off: stricter rules increase consistency but may over‑simplify edge cases

Prompt 2, Structured JSON Extraction with a Schema

Task: Extract fields from short support tickets into validated JSON.

JSON schema (keep it small):

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["customer_name", "issue_type", "urgency", "product"],
  "properties": {
    "customer_name": {"type": "string"},
    "issue_type": {"type": "string", "enum": ["billing", "bug", "feature_request", "how_to"]},
    "urgency": {"type": "string", "enum": ["low", "medium", "high"]},
    "product": {"type": "string"},
    "actions_requested": {"type": "string"}
  },
  "additionalProperties": false
}

Prompt (v0.1):

System: You extract structured data from short emails.
User: Read the ticket and return ONLY minified JSON matching the schema. Do not include code fences or commentary.
Rules:
- If a field is missing, infer conservatively; if impossible, use an empty string.
- issue_type ∈ {billing, bug, feature_request, how_to}
- urgency ∈ {low, medium, high}

Schema fields: customer_name, issue_type, urgency, product, actions_requested

Ticket:
{ticket}

Parameters:
- temperature: 0.1, 0.3
- max_tokens: 120

Evaluation checklist:
- JSON parses successfully
- Keys exactly match schema; no extra fields
- Enums are valid; empty string only when truly unknown

Known failure modes:
- Adds extra keys (“email”, “timestamp”)
- Returns explanations or code fences
- Over‑confident inference of urgency without clues

Starter test set (10 tickets):

1) “Hi, I’m Maya Chen. I was double‑charged this month for Pro. Please refund.”
2) “Hello, buttons are unclickable after the latest update to Mobile Starter.”
3) “Could you add SSO for Enterprise?, Alex”
4) “Where do I find invoices in the dashboard? Product: Pro.”
5) “This is urgent, uploads fail with error 504 on Web Basic., Priya”
6) “Need to change my credit card on file for Pro plan., Dan”
7) “Feature idea: dark mode on Mobile Starter. Not urgent.”
8) “How do I export data from Pro?, Lina”
9) “The app crashes when I tap settings on iOS.”
10) “Please escalate: billing shows the wrong total for Enterprise.”

Format adherence check (Python):

import json, re

def is_minified_json(s):
    return not bool(re.search(r"```|\n\s*[A-Za-z]+:", s))

valid = 0
for s in outputs:
    try:
        obj = json.loads(s)
        keys_ok = set(obj.keys()) <= {"customer_name","issue_type","urgency","product","actions_requested"}
        enums_ok = obj.get("issue_type") in {"billing","bug","feature_request","how_to"} and \
                   obj.get("urgency") in {"low","medium","high"}
        if keys_ok and enums_ok and is_minified_json(s):
            valid += 1
    except Exception:
        pass
print(f"Format adherence: {valid/len(outputs):.2%}")

Upgrade to v0.2 ideas:
- Prepend: “Return exactly these keys in this order …”
- Add 2 positive/2 negative examples showing correct vs incorrect JSON
- Trade‑off: stricter formatting improves parse rate but can reduce recall (model may omit uncertain fields)

Prompt 3, Concise Q&A with Citation Style (RAG‑ready)

Task: Answer questions concisely from provided context and cite sources like [1], [2].

Prompt (v0.1):

System: You answer only using the provided context. If the answer is not in the context, say "Insufficient information." Include citations like [1] referencing the source list.
User: Using the context and source list, answer in ≤3 sentences. Put citations at the end of the sentence they support.

Context:
{context}

Sources:
[1] {source_1}
[2] {source_2}
[3] {source_3}

Question: {question}
Output format:
- If answerable: a short paragraph (≤ 50 words) including [n] citations that exist in the Sources list.
- If not answerable: "Insufficient information."

Parameters:
- temperature: 0.1, 0.3
- max_tokens: 120
- stop: optional to curb rambling

Evaluation checklist:
- Word count ≤ 50; ≤ 3 sentences
- Every [n] exists in Sources; no hallucinated sources
- No content beyond given context

Known failure modes:
- Missing citations or wrong numbers
- Over‑confident answers when context is insufficient
- Exceeding length or adding marketing fluff

Shared mini‑corpus for testing (use the same Sources list across questions):

Context:
Acme Battery v2 charges to 80% in 30 minutes and lasts 12 hours under normal use.
Acme Charger Pro supports USB‑C PD 65W and is compatible with Acme Battery v2.
The return window is 30 days with receipt; batteries have a 1‑year warranty.

Sources:
[1] Acme Battery v2 spec sheet
[2] Acme Charger Pro compatibility list
[3] Acme Warranty and Returns policy

Starter test questions (10):

1. How fast does Acme Battery v2 charge to 80%?
2. Is Acme Charger Pro compatible with the battery?
3. What’s the battery life under normal use?
4. How long is the warranty for batteries?
5. Can I return after 45 days with no receipt?
6. Does Charger Pro support USB‑C PD 65W?
7. Are there details about wireless charging speed?
8. What percentage is mentioned for the quick charge milestone?
9. Are there any conditions for returns?
10. Is Acme Battery v2 compatible with 100W chargers?

Automated format checks (Python):

import re

def check_answer(ans):
    # ≤ 50 words
    words_ok = len(ans.split()) <= 50
    # citations present and valid numbers 1-3
    cits = re.findall(r"\[(\d+)\]", ans)
    cits_ok = all(c in {"1","2","3"} for c in cits) and (len(cits) > 0 or "Insufficient information." in ans)
    # ≤ 3 sentences (naive)
    sentences_ok = len(re.split(r"[.!?]+\s+", ans.strip().rstrip(".!?"))) <= 3
    return words_ok and cits_ok and sentences_ok

print(check_answer("Charges to 80% in 30 minutes [1]."))

Upgrade to v0.2 ideas:
- Add one positive and one “insufficient info” example
- Explicitly require: “Cite each claim at sentence end”
- Trade‑off: tighter length control improves readability but risks truncating nuance

Evaluate, iterate, and version your prompts

Now run each prompt on its test set.

Record metrics and notes in your prompt spec.

To design your evaluation checklist, track accuracy and format adherence. Document failure modes and iterate versions from v0.1 to v0.2. To follow a detailed process, use this comprehensive prompt optimization workflow.

Simple scoring plan:
- Sentiment: overall accuracy on 10 items; inspect disagreements
- Extraction: format adherence rate; optionally hand‑label 5 items for field‑level F1
- Q&A: proportion passing format checks; manual spot‑check faithfulness vs context

Versioning tips:
- v0.1: minimum viable prompt that runs end‑to‑end
- v0.2: apply 2, 3 targeted edits informed by errors
- Keep a CHANGELOG and note trade‑offs (e.g., lower temperature → more consistent formatting; fewer creative phrasings)

Troubleshooting:
- If outputs include extra text, add “Return only … No commentary.” and reduce temperature
- If JSON breaks, add negative examples showing wrong formats and enforce enums
- If Q&A hallucinates, restate “Answer only from context; otherwise say Insufficient information.”

Linking back to the series:
- For instruction clarity and zero‑shot baselines, revisit the sections on task decomposition in Foundations of Prompt Engineering: Clarity, Structure, and Zero‑Shot
- For example curation and label wording, see Few‑Shot Prompting for Prompt Engineering: Formats, Labels, and Examples
- For reasoning prompts when classification is tricky, consult Reasoning Prompts: Chain‑of‑Thought and Self‑Consistency
- For tools and A/B testing ideas, check Prompt Engineering Tools and Evaluation: Templates, Guardrails, and A/B Tests

Practical exercise

  • Build: Implement the three prompts with v0.1 specs above
  • Test: Run each on the provided test items
  • Evaluate: Compute accuracy/format scores using the code snippets
  • Iterate: Ship a v0.2 for each prompt with 2, 3 edits and a short note on trade‑offs

Expected outcome:
- A beginner prompt library you can reuse
- Documented specs (template completed), plus metrics and a CHANGELOG

Tips for success:
- Keep temperatures low for deterministic tasks
- Be explicit about output schemas and length limits
- Add a couple of adversarial examples to your few‑shot lists

Summary

You just shipped a mini prompt library. It includes a few‑shot sentiment classifier, a JSON extractor with a schema, and a concise, citation‑aware Q&A prompt.

You practiced prompt engineering fundamentals. You created a documentation template and built small test sets. You computed accuracy and format adherence. You also versioned from v0.1 to v0.2. Keep iterating. Small, measured changes compound quickly.