Evolutionary algorithms for prompt optimization

2026-05-18

Prompt optimization is a particularly difficult problem because of a lack of information; the LLM is a black box and the metric gives supervision only at the output level. However, it can be effective to treat it as an evolutionary search, with LLMs standing in for genetic operators.

The landscape

OPRO is the simplest version: show an optimizer LLM the current best candidates and their scores, ask it to propose new ones, repeat.

EvoPrompt is more explicitly evolutionary: maintain a population of prompts, select parents, apply LLM-based crossover and mutation operators, update, repeat.

MIPRO extends this to multi-prompt pipelines, where the new challenge is credit assignment: if a pipeline degrades, which stage is responsible? MIPRO uses a Bayesian surrogate model to attribute credit across stages.

SIMBA drops the surrogate: the LLM reflects on traces of its own failures and proposes rules and few-shot examples to add to the prompt, handling credit assignment implicitly.

GEPA adds a Pareto frontier, meaning that a candidate prompt is only pruned from the population if there’s some other candidate that beats it at every training example. This keeps “specialist” prompts that excel on hard subsets in play, even when another prompt is better overall.

Selection strategy experiments

For a class, I ran small-model evaluations of GEPA and its alternatives on a few question-answering benchmarks. My main finding was that Lexicase selection, which picks parents based on per-example performance rather than aggregate score, outperformed the Pareto front variants and also generalized a bit better to test data. That being said, I imagine there’s much more interesting work to be done here with long-running agents writing and updating their skills.