Programmatic Prompt Optimization: Replacing Intuition with Algorithms

I’d built plenty of successful POCs with LLM prompts. Impressive in demos, but when it came time to deploy at scale, precision wasn’t good enough… recall wasn’t high. I’d tried methods from many prompt tutorials from openai, claude etc.

Manual prompt engineering is fundamentally trial and error. What works today might fail tomorrow and there’s no systematic way to improve it.

My task at hand was getting insights from unstructured customer interaction data. I’d spent three weeks fine-tuning a single prompt by hand, iterating through variations, hoping to stumble on something that worked consistently. It wasn’t cutting it. Even if this worked now, what about model upgrades? What about maintenance?

The key observation

LLMs are remarkably good at giving feedback. They can look at their own outputs and tell you exactly what went wrong. So why couldn’t they convert that feedback into better prompts automatically?

I searched for tools that could achieve this out of pure desperation. That’s when I stumbled upon DSPy and the concept of prompt optimizers: systems that treat prompt engineering as an optimization problem rather than an art. When GEPA was released, I knew I had to test it.

Think of it like compilation: you write high-level code (your task definition), and the compiler transforms it into optimized machine instructions (a prompt that actually works). You don’t hand-tune assembly, so why hand-tune prompts?

The Pilot

I needed a simple test to see if GEPA could work for my use case. So I created a synthetic dataset of 27 sales call transcripts that represented a real challenge we face: detecting presence of required behaviors and predicting call quality (good/bad). The transcripts were hand-labeled across 7 behavior categories (introduction, needs, value proposition, objection handling, benefit reinforcement, risk reduction, and closing). Small enough to iterate fast, realistic enough to validate the approach—and representative of a problem I’d hit repeatedly: intent extraction and call evaluation look easy for a few cases, but precision and recall tank at scale.

I expected weeks of iteration. Instead, I got meaningful results in a single run. Usually a show piece like this would be carefully selected sample to show the power of the approach. Here this is literally first attempt, that in itself tells the power of the approach.

📞 Input: Call Transcript

agent: Hi, good afternoon! This is Maya calling from Citi Corp. Am I speaking with Jordan Lee?

customer: Yes, this is Jordan.

agent: Great, Jordan. How are you doing today?

customer: I'm good, thanks. Busy afternoon, but I have a few minutes.

agent: I appreciate you taking the time...

📊 Output: Analysis

Call Quality

✓ Good

Detected Categories

Introduction/Rapport ✓ Need Assessment ✓ Value Proposition ✗ Objection Handling ✓ Benefit Reinforcement ✓ Risk Reduction ✓ Call to Action ✓

Results

Approach	Cost	Time	Accuracy
Manual prompt engineering	$100-1000 (engineer time)	Days to weeks	72%
GEPA	~$2	10 hours	81%
GEPA with error analysis	~$0.5	3 hours	90%

The optimizer ran for about 10 hours, cost roughly $2, and explored over 200 prompt variants. Through genetic mutation and Pareto selection, it whittled those down to 9 “survivors”—prompts that excelled at different subsets of the problem. The best performer jumped from 72% to 81% accuracy, a lift I hadn’t achieved in months of manual tuning.

The intermediate prompts evolving caught my eye. I could see the optimizer discovering nuances I’d never thought to include: explicit definitions for each category, step-by-step rules for edge cases, domain-specific guidance about soft pulls versus hard pulls. The quality of the reasoning it produced while iterating was genuinely impressive.

Hopefully I have convinced you that this method is powerful, lets see how i did it and you can follow similar steps for yours as well.

Detailed Implementation of GEPA

Let’s first understand what GEPA does, then dive into code for this specific usecase. If you want deeper dive on how GEPA works, i have previously written a detailed piece here: GEPA

Code-first folks: here’s the notebook: github link

We’ll use DSPy to run GEPA. If you’re new to DSPy, it’s a framework that treats prompts as code you can optimize programmatically. For background, see The Data Quarry’s guide.

How GEPA Works

GEPA (Genetic-Pareto Algorithm) differs from traditional optimization in three key ways:

Reflective Mutation: The LLM reads failure feedback and proposes targeted improvements. It’s not random guessing—it’s reasoning about what went wrong.
Pareto Selection: Instead of keeping only the single best prompt, GEPA maintains a “frontier” of diverse specialists. One prompt might excel at detecting objection handling; another at predicting outcomes. This prevents catastrophic forgetting.
Text-as-Feedback: Traditional RL uses scalar rewards. GEPA exploits rich textual feedback (“You incorrectly marked this as rapport-building because…”) to guide mutations precisely.

Prerequisites

To Use GEPA, we need 3 components.

Component	What it does	Why it matters
DSPy Signature	Your baseline prompt defining the task	The “prompt” being optimized
Metric & Feedback	Returns score + textual feedback	Tells optimizer what “good” looks like and why
Dataset	Labeled examples (train/val/test)	Ground truth for evaluation

The DSPy Signature

In DSPy, A Signature defines input/output schema; the instructions in the docstring become part of the prompt. My initial prompt was embarrassingly simple—just two lines:

class CallAnalysis(dspy.Signature):
    """
    Read the provided call transcript and analyze it comprehensively.
    Determine both: (1) which categories the agent displayed, and 
    (2) whether the call will lead to conversion or customer retention.
    """
    message: str = dspy.InputField()
    categories: List[Literal["introduction_rapport_building", "need_assessment_qualification", 
                             "value_proposition_feature_mapping", "objection_handling", 
                             "benefit_reinforcement", "risk_reduction_trust_building", 
                             "call_to_action_closing"]] = dspy.OutputField()
    final_result: Literal['good', 'bad'] = dspy.OutputField()

program = dspy.ChainOfThought(CallAnalysis)

Metric Function

A Metric tells us whether we are moving in the right direction. In this case, accuracy of categories detected and final prediction that whether call was good or bad - both were important. Hence metric will be mean of both the entities

def call_qual_metric(gold, pred):
    return 1.0 if gold == pred else 0.0

def category_qual_metric(gold, pred):
    """Compute score for categories using set operations."""
    pred_set = set(pred)
    gold_true = {k for k, v in gold.items() if v}
    gold_false = {k for k, v in gold.items() if not v}
    
    correct = len(gold_true & pred_set) + len(gold_false - pred_set)
    return correct / len(gold)

def comb_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """Overall metric combining both scores."""
    call_qual = call_qual_metric(gold.final_result, pred.final_result)
    category_qual = category_qual_metric(gold.categories, pred.categories)
    return (call_qual + category_qual) / 2

Adding Feedback

This is the key enabler. A basic metric only returns a score but not why it failed. With feedback, the optimizer can reason about failures and propose targeted fixes.

❌ Metric (Score Only)

def comb_metric(example, pred):
    gold_cat = json.loads(example['answer'])
    gold_final = example['final_result']

    # Category score
    correct = sum(1 for k, v in gold_cat.items() 
                  if (v and k in pred.categories) or 
                     (not v and k not in pred.categories))
    cat_score = correct / len(gold_cat)

    # Final result score
    final_score = 1.0 if gold_final == pred.final_result else 0.0

    return (cat_score + final_score) / 2

Problem: Optimizer only knows "0.7" — no idea why it failed.

✅ Metric with Feedback

def comb_metric_with_feedback(example, pred, pred_name=None):
    # ... same scoring logic as above ...

    # Generate textual feedback
    if gold_final != pred.final_result:
        fb = f"Incorrect: predicted {pred.final_result}, actual {gold_final}"
    else:
        fb = f"Correct: {gold_final}"

    if incorrectly_included:
        fb += f"\nFalse positives: {incorrectly_included}"
    if incorrectly_excluded:
        fb += f"\nMissed categories: {incorrectly_excluded}"

    return dspy.Prediction(score=score, feedback=fb)

Benefit: Optimizer sees "Missed: intro_rapport" → can propose targeted fix.

Code example:

def call_qual_feedback(gold, pred):
    """ Generate feedback for final result module. """
    if gold == pred:
        fb = f"You correctly classified the sales call as `{gold}`. This sales call is indeed `{gold}`."
    else:
        fb = f"You incorrectly classified the sales call as `{pred}`. The correct sales call is `{gold}`. Think about how you could have reasoned to get the correct sales call label."
    return fb

def category_qual_feedback(gold, pred):
    """Generate feedback using set operations."""
    pred_set = set(pred)
    gold_true = {k for k, v in gold.items() if v}
    gold_false = {k for k, v in gold.items() if not v}
    
    correctly_included = gold_true & pred_set
    incorrectly_included = gold_false & pred_set
    incorrectly_excluded = gold_true - pred_set
    correctly_excluded = gold_false - pred_set
    
    score = (len(correctly_included) + len(correctly_excluded)) / len(gold)
    
    if score == 1.0:
        return f"Perfect. Correctly identified: `{correctly_included}`.", score
    
    fb = f"Correctly identified: `{correctly_included}`.\n"
    if incorrectly_included:
        fb += f"False positives: `{incorrectly_included}`.\n"
    if incorrectly_excluded:
        fb += f"Missed: `{incorrectly_excluded}`.\n"
    return fb

def comb_metric_with_feedback(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """
    Computes a score and provides feedback for the call analysis prediction.
    Returns total score if pred_name is None, otherwise returns dspy.Prediction with score and feedback.
    """
    # Compute feedback and scores
    cal_fb = call_qual_feedback(gold.final_result, pred.final_result)
    cat_fb = category_qual_feedback(gold.categories, pred.categories)
    fb = cal_fb + '\n' + cat_fb
    score = comb_metric(gold, pred)
    return dspy.Prediction(score=score, feedback=fb)

Running GEPA

With prerequisites in place, optimization is straightforward:

from dspy import GEPA

optimizer = GEPA(
    metric=comb_metric_with_feedback,
    auto="light",
)

optimized_program = optimizer.compile(program, trainset=tset, valset=vset)

Post GEPA run

GEPA ran for 10 hrs on my PC. I saw the 2-line prompt evolve into ~1,500 words of instruction discovering nuances like “a bare greeting isn’t rapport-building; look for warmth and time acknowledgment”.

Pay Attention

In some cases, it came up with better instructions and heuristics than me. Kindah felt fearful yet exciting.

Initial Prompt

Read the provided call transcript and analyze it comprehensively.
Determine both: (1) which categories the agent displayed, and (2) whether the call will lead to conversion or customer retention.

Optimized Prompt (GEPA)

New Instructions for Analyzing Banking/Card Transaction Call Transcripts

Overview
You are an analysis assistant whose job is to evaluate sales/transact‑ion-focused call transcripts in the banking/credit-card domain. For each transcript, produce a compact, structured analysis with two main objectives:
(a) identify the agent behavior categories demonstrated (from the seven pillars below), and
(b) judge whether the call outcome is good or bad.

Inputs you will receive
- A complete transcript of a single call between an agent and a customer. Transcripts may include labels such as "agent:" and "customer:" and may cover topics like card offers, fees, rewards, security, and next steps.

What you must produce (three sections exactly)
1) reasoning
- Provide a concise, bullets-style justification for every pillar category you detected in the transcript.
- Include short quotes or paraphrases from the transcript to illustrate why the category applies. Do not introduce facts or assumptions beyond what is in the transcript.
- If you detect a strength/weakness signal about the outcome, include a brief, one- to two-sentence note here describing how strong the signal is and what would push it toward conversion or toward retention.
- This section may contain a small, optional note about outcome strength, but must not introduce information outside the transcript.

2) categories
- Output a Python-like list of the detected pillar categories in the exact order they first appeared in the transcript.
- Example format: ['introduction_rapport_building', 'need_assessment_qualification', ...]

3) final_result
- A single word indicating the call outcome:
- good — the call demonstrates strong agent performance and is likely to lead to conversion or retention.
- bad — the call shows weak agent performance or missed opportunities.
- Do not add any qualifiers in this field; use exactly one of the two keywords above.

Optional but encouraged: assess the strength of the outcome
- If you include it (recommendation), place this assessment only in reasoning as the optional strength_of_outcome note. Keep it concise (one or two sentences). It should address:
- How strong is the good/bad signal?
- What would most likely push the outcome toward good or toward bad?

Pillar definitions (seven bank/card-specific categories)
- introduction_rapport_building
- Includes opening greetings, courtesy, acknowledgment of time, and attempts to establish rapport.
- Examples: greetings, confirming time, polite introductions, small talk about fit or time constraints.
- need_assessment_qualification
- Involves asking about customer needs, usage, spend patterns, eligibility checks, and whether the product fits (e.g., business vs personal, employee cards, annual fees, soft vs hard pulls).
- value_proposition_feature_mapping
- Linking card features to tangible, customer-relevant benefits (rewards, protections, credits) and showing how those features align with stated needs.
- objection_handling
- Addressing concerns about price, complexity, trust, enrollment, or process obstacles. Includes acknowledging concerns and offering clarifications or mitigations.
- benefit_reinforcement
- Reiterating concrete benefits and value after objections or hesitations, often tying back to the customer's stated needs.
- risk_reduction_trust_building
- Providing security assurances, privacy protections, non-hard-pull options, guarantees, terms clarity, or brand trust signals.
- call_to_action_closing
- Concrete next steps or commitments: soft checks, secure links, email/mail options, scheduling follow-ups, or instructions to apply/get more information.

How to apply the rules
- For every transcript, read from start to finish. Mark each pillar as soon as its criteria are clearly demonstrated.
- If a single utterance clearly satisfies more than one pillar, count it under all applicable pillars.
- If a pillar is not clearly demonstrated anywhere in the transcript, do not include it in the categories list.
- Record the detected pillars in the exact order of their first appearance in the transcript.
- The final_result should reflect the overall trajectory of the call as described above.

Output constraints and format
- Do not introduce any facts not present in the transcript.
- Do not insert subjective opinions beyond what is grounded in the transcript.
- Use the exact section headings and formatting:
reasoning
categories
final_result
- Do not include extraneous content beyond the three sections above.

Domain-specific considerations
- You may encounter references to soft pulls vs hard pulls, online applications, secure links, email follow-ups, or scheduled follow-ups. Treat these as legitimate "call_to_action_closing" or "risk_reduction_trust_building" elements as appropriate.
- When quoting or paraphrasing, keep quotes brief and focused on the reason for the pillar.
- If PII appears in the transcript (e.g., partial SSN or addresses), quote minimally and do not reveal full sensitive data in your justification. You may paraphrase or reference the presence of sensitive data without reproducing it.

Example behavior (not to reproduce here)
- A transcript with strong agent behaviors and clear next steps is more likely good; a transcript with missed opportunities, customer hesitation, or weak closing is more likely bad.

End result
- Return exactly three sections for every transcript analyzed, with the content governed by the rules above. This format enables consistent, comparable, and transparent analysis across transcripts.

Result: 72% → 81% accuracy. More importantly, the process was repeatable.

For a deeper technical dive: GEPA Deepdive

Breaking Through the 81% Ceiling with Error Analysis

With the above proof, I deployed it internally and it worked reliably wherever it was implemented properly. But I wasn’t satisfied. When I manually reviewed the failing cases, something bothered me: these weren’t hard examples. Given a hint, the LLM could easily get them right. So why was it failing?

So I dug in and exported the misclassified examples to a spreadsheet and studied them systematically:

Input (truncated)	Actual	Predicted	What Went Wrong
Tyler calling from dominos… “sounds good, I’ll send that link…”	bad	good	Customer showed hesitation (“I’m really not sure…”) but agent rushed to close
Mark from JP calling about credit solutions…	rapport = false	rapport = true	Agent said “Hi, this is Mark from JP” with no warmth or rapport signals

The pattern became clear: the model was over-detecting introduction_rapport_building—treating any greeting as rapport. It also sometimes marked calls as “good” just because next steps existed, even when the customer showed clear hesitation.

🎯 GEPA Run 1

72% → 81%

→

🔍 Error Analysis

Export failures, find patterns

→

✏️ Add Feedback

Targeted hints for failures

→

🚀 GEPA Run 2

81% → 90%

The Solution: Targeted Feedback in the Metric

My hunch was simple: if I could teach the optimizer why these specific cases failed, it could learn the distinctions. I added a feedback column to my dataset and filled it only for mistagged cases—writing out precisely why my label was correct. For example:

“The agent said ‘Hi, this is Mark from JP’ without warmth, time acknowledgment, or rapport-building. A bare introduction doesn’t qualify as introduction_rapport_building.”

Example	Iter 1	Feedback	Iter 2	Feedback	Iter 3
Tyler	❌	Look for hesitation signals	✅	—	—
Mark	❌	Greeting ≠ rapport	❌	Need warmth/courtesy	✅
John	❌	Greeting ≠ rapport	✅	—	—

Then I passed this feedback directly to GEPA through the metric function:

def comb_metric_with_feedback(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """Metric that returns score + feedback, including targeted hints for known failure cases."""
    # Compute scores
    call_qual = call_qual_metric(gold.final_result, pred.final_result)
    category_qual = category_qual_metric(gold.categories, pred.categories)
    score = (call_qual + category_qual) / 2
    
    # Generate base feedback
    cal_fb = call_qual_feedback(gold.final_result, pred.final_result)
    cat_fb = category_qual_feedback(gold.categories, pred.categories)
    fb = cal_fb + '\n' + cat_fb

    # Append targeted feedback from the dataset (if present)
    fb += gold.feedback

    return dspy.Prediction(score=score, feedback=fb)

Now when GEPA’s reflective mutations analyze failures, they see specific guidance like “look for warmth signals, not just greetings” instead of generic “wrong answer” feedback.

💡 Pro tip:

You could use an LLM to generate this feedback automatically, but doing it manually gives you fine-grained control over how the final prompt evolves. The LLM might miss the exact nuance you care about. In short, you’re writing the prompt through feedback.

The Result

81% → 90% accuracy in just 3 hours and ~$0.50.

The key insight: GEPA’s genetic mutations work best when they have precise feedback to reason about. Generic “wrong answer” feedback produces generic improvements. Targeted feedback like “you’re conflating greetings with rapport” produces targeted fixes.

Lessons Learned & Recommended Workflow

The biggest lesson wasn’t technical, it was psychological. I stopped thinking of prompts as things I write and started thinking of them as things I evolve. My job shifted from “craft the perfect prompt” to “define what good looks like and let the system find it.”

When NOT to Use GEPA

GEPA only works when you can clearly define “correct.” If you can’t reliably label examples yourself or if the target is fuzzy or subjective, the optimizer will chase noise. I tried it on a sentiment task where human annotators disagreed 30% of the time. The results were inconsistent. Rule: if you can’t do it clearly as a human, don’t expect GEPA to figure it out.

My Recommended Workflow

Start with a clear metric and a personally-validated dataset. 20-50 examples you’ve labeled yourself, understanding the edge cases.
Run GEPA → Review failures → Add targeted feedback. Study misclassified examples. Write specific explanations of why your label is correct.
Run GEPA again → Stop when acceptable → Deploy. One or two iterations usually suffice. Don’t chase perfection.
Post-deployment: monitor, collect failures, iterate. Production surfaces edge cases. Add them with feedback and re-run as needed.

The complete Jupyter notebook: github link