Exploring GEPA Optimization in DSPy: A Deep Dive

By: Risheek kumar B

Exploring GEPA Optimization in DSPy: A Deep Dive into Automatic Prompt Improvement

Introduction

In my recent experiments, I’ve been exploring DSPy — a declarative framework for building and optimizing LLM programs — with a particular focus on GEPA (Gradient Evaluation & Prompt Adaptation). My goal was to understand how automated prompt improvement works under the hood and how it can be applied to real-world language model workflows.

Prompt engineering has evolved from manual iteration to algorithmic optimization, and DSPy represents this next step — it allows you to define modular, composable LLM pipelines that can learn how to prompt better over time.

Background: What Is DSPy?

DSPy is an open-source library from Stanford that enables you to declaratively write LLM programs as modules — each with inputs, outputs, and a functional definition — and then train or optimize them using strategies like teleprompters.

Instead of manually crafting system prompts, you can specify the structure and intent of your tasks, and DSPy takes care of tuning the internal prompts to improve performance metrics on validation data.

A simple DSPy pipeline might look like this:

import dspy

class Summarizer(dspy.Module):
    def forward(self, text):
        return dspy.Output(summary=dspy.Predict("Summarize the text", text=text))

This summarizer can later be optimized with DSPy’s built-in teleprompters like MIPRO, LEAP, or GEPA — each representing a different prompt optimization strategy.

Enter GEPA: Gradient Evaluation & Prompt Adaptation

GEPA stands for Gradient Evaluation and Prompt Adaptation. It’s a powerful optimization routine that treats prompt tuning like a differentiable process — estimating how changes to a prompt affect model outputs and then iteratively refining the prompt text.

While traditional prompt tuning focuses on manual trial-and-error, GEPA performs gradient-like updates in the space of prompts by leveraging evaluation feedback from datasets.

Here’s what happens conceptually:

Define a Program — you build a DSPy pipeline with composable modules.
Provide Evaluation Data — a small labeled dataset or metric function to guide optimization.
Run GEPA — it iteratively tests prompt variants, measures their effect, and updates the prompt templates.
Track Pareto Scores — DSPy logs performance trade-offs to find the best balance between generalization and accuracy.

During my exploration, I observed logs like:

INFO dspy.teleprompt.gepa.gepa: Running GEPA for approx 416 metric calls.
INFO dspy.teleprompt.gepa.gepa: Using 9 examples for tracking Pareto scores.

This means the optimization process was sampling prompts, evaluating their effectiveness, and maintaining a Pareto frontier of candidate solutions — essentially finding the best prompts across multiple trade-offs.

Experiment Setup

I used a DSPy program consisting of around 416 metric evaluations, corresponding to roughly 24 full evaluations on train + validation data. The setup involved:

A classification task (binary or multi-label),
A small evaluation set (~9 samples for Pareto tracking),
And a custom metric for assessing response quality.

Here’s a simplified sketch of my setup:

import dspy
from dspy.teleprompt.gepa import GEPA

# Define model and dataset
model = dspy.OpenAIModel("gpt-4-turbo")
train_data, val_data = load_dataset()

# Define the program
program = MyClassifier()

# Configure GEPA optimization
gepa = GEPA(metric=accuracy, max_iters=25, pareto_tracking=9)

# Run optimization
optimized_program = gepa.run(program, train_data, val_data)

The process took several hours (≈10 hours in one case) due to the large number of evaluations and gradient steps involved in exploring the prompt space.

Observations

Run Duration: The optimization was computationally intensive. Each iteration evaluates multiple prompt variants, which accumulates to several hours of processing time.
Pareto Efficiency: GEPA tracks multiple objectives (e.g., accuracy, length, coherence) to find optimal trade-offs. This approach is closer to multi-objective optimization than single-metric fine-tuning.
Granular Feedback: Logs from DSPy were detailed enough to track prompt evolution and scoring dynamics.
Few-Shot Sensitivity: GEPA benefits from having a small but diverse validation set to capture multiple linguistic and reasoning patterns.
Repeatability: Rerunning GEPA with different seeds or model backends (e.g., GPT-4 vs Claude) can lead to distinct but similarly performing prompt variants — a form of prompt diversity.

Insights & Reflections

This exploration highlighted a few key insights about automated prompt optimization:

Prompt tuning ≠ model fine-tuning.
GEPA doesn’t modify weights — it optimizes text prompts, making it lightweight and safe for most applications.
Pareto-based learning is powerful.
Instead of blindly chasing one metric, GEPA learns trade-offs, which is useful in real-world NLP tasks where metrics often conflict.
DSPy abstracts complexity beautifully.
The declarative module system makes it easy to build structured LLM programs without micromanaging prompts.
Time vs Quality Trade-off.
Long optimization runs (10+ hours) may seem excessive, but for stable pipelines or production-grade prompt systems, the resulting gains in quality justify the compute.

Applications

From my perspective as a data scientist and ML practitioner, such automated prompt optimization frameworks could have wide-ranging applications:

Sales call analysis — optimizing classification or summarization prompts for tone, objection handling, and personalization.
Resume parsers and Chrome extensions — improving autofill accuracy through optimized context prompts.
Analytic assistants — generating more precise SQL, Python, or business insights via prompt tuning.
Educational tools — adapting prompts dynamically based on student feedback or comprehension scores.

Conclusion

My exploration with DSPy and GEPA reaffirmed the potential of automated prompt engineering. Instead of handcrafting text templates, we can now let optimization algorithms learn the best phrasing patterns to guide large language models.

While the setup is computationally heavy, the outcome — robust, high-performing LLM pipelines — is well worth the effort. GEPA, in particular, bridges the gap between classical optimization and modern prompt tuning, marking a new era in self-improving LLM workflows.

Next Steps

I plan to:

Compare GEPA with MIPRO and LEAP on similar tasks,
Experiment with custom evaluation metrics (e.g., semantic similarity + factuality),
And explore hybrid pipelines combining GEPA-tuned modules with traditional fine-tuned models.

Reading blog