SkillOpt

Summary

SkillOpt is a systematic text-space optimizer for agent skills that treats skill documents as trainable external state. It uses a separate frontier optimizer model to propose bounded add/delete/replace edits based on scored rollouts, and accepts edits only when they improve a held-out validation score. Mechanisms such as a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill optimization stable. Across six benchmarks, seven target models, and three execution harnesses, SkillOpt achieves best or tied-best results on all 52 evaluated cells, significantly outperforming no-skill baselines and prior skill construction methods.

Key Points

SkillOpt formulates agent-skill learning as optimization over an external natural-language state, analogous to weight-space training.
Edits are bounded, validation-gated, and derived from trajectory analysis to ensure stable, monotonic improvement.
The system is harness-agnostic, working with direct chat, Codex-style, and Claude Code-style execution loops.
Evaluated on six diverse benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), SkillOpt is best or tied on all 52 (model, benchmark, harness) cells.
With GPT–5.5, SkillOpt lifts average accuracy by +23.5 points in direct chat, +24.8 inside Codex, and +19.1 inside Claude Code over no-skill baselines.
Optimized skills are compact (300–2,000 tokens), inspectable, and transfer across model scales, execution harnesses, and related benchmarks without re-optimization.

Concepts

Text-space optimization — Treating a natural-language skill document as an optimizable object, with edits guided by rollout feedback and validation.
Bounded skill edits — Changes to the skill are constrained by a textual learning-rate budget that prevents large semantic jumps from destabilizing optimization.
Held-out selection gate — A candidate skill is accepted only if it strictly improves performance on a validation split, preventing harmful proposals from accumulating.
Textual learning-rate budget — A limit on how much the skill document can be altered in a single update step, analogous to a learning rate in gradient-based optimization.
Rejected-edit buffer — Failed edits are retained as negative feedback for the optimizer to avoid repeating ineffective changes.
Epoch-wise slow/meta update — A mechanism that preserves editing directions across epochs, acting like a momentum term for long-horizon refinement.
Trace2Skill, TextGrad, GEPA, EvoSkill — Baseline methods for prompt optimization or skill evolution that SkillOpt outperforms.

Details

SkillOpt addresses the challenge of adapting frontier language models to domain-specific tasks by optimizing a portable skill document rather than model weights. The system operates in an iterative loop: given a target domain, an initial skill, and a frozen target model, SkillOpt samples a batch of rollouts, analyzes successes and failures, and asks a frontier optimizer model to produce structured add/delete/replace edits.

Several training-style controls ensure stability:

Rollout and reflection batch sizes control the noise in the evidence used for each edit.
Textual learning rate and schedule limit how far one skill version can deviate from the previous iteration.
Held-out selection gate evaluates the candidate skill on a held-out split and accepts it only if validation error decreases.
Rejected-edit buffer stores failed edits as negative examples for later optimization steps.
Epoch-wise slow/meta update carries stable editing directions across epochs, smoothing the optimization trajectory.

The final output is a compact best_skill.md file (roughly 300–2,000 tokens) that can be deployed without modifying the target model or execution harness.

Empirical evaluation spans six benchmarks covering question answering, spreadsheets, office documents, visual document understanding, mathematics, and embodied decision making. Seven target models from GPT–5.5 to small-scale Qwen are tested under three harnesses: direct chat, Codex agentic loop, and Claude Code. SkillOpt is best or tied on all 52 cells, outperforming human-written skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill. Notable gains include +23.5 points average over no skill on GPT–5.5 direct chat, and +24.8 and +19.1 points inside Codex and Claude Code loops, respectively.

Transfer experiments demonstrate that trained skill artifacts generalize beyond their exact training context:

A SpreadsheetBench skill trained on GPT–5.4 improves all smaller GPT variants.
A Codex-trained spreadsheet skill transfers to Claude Code with a +59.7 point gain.
An OlympiadBench skill improves performance on the nearby math benchmark Omni-MATH without additional optimization.

Ablation studies confirm that each component is critical: bounded textual learning outperforms uncontrolled rewriting, held-out gating prevents harmful edits from accumulating, the rejected-edit buffer provides useful negative feedback, and the epoch-wise slow/meta update stabilizes long-horizon refinement without bloating the skill document.

Per-benchmark case studies show that the optimized skills are compact (1–4 accepted edits), inspectable, and procedural rather than containing instance-specific memorization. This makes SkillOpt practical for real-world deployment where model weights cannot be fine-tuned and skill artifacts must be auditable and reusable.

Details (continued from existing article)

Formal Optimization Objective

SkillOpt formalizes skill optimization as a three-split procedure:

Train split (Dtr) provides rollout evidence for generating candidate skills.
Selection split (Dsel) gates updates: a candidate skill is accepted only if it improves the average score on Dsel over the current skill.
Test split (Dtest) is locked until final reporting and yields the reported performance.

The optimization problem is:

[ s^\star_{\text{sel}} = \arg\max_{s \in \mathcal{C}(D_{\text{tr}})} \frac{1}{|D_{\text{sel}}|} \sum_{x \in D_{\text{sel}}} r(s), \quad \text{Test}(s^\star_{\text{sel}}) = \frac{1}{|D_{\text{test}}|} \sum_{x \in D_{\text{test}}} r(s^\star_{\text{sel}}) ]

where (\mathcal{C}(D_{\text{tr}})) is the set of candidate skills derived from train-split rollouts.

Rollout Accumulation

The implementation supports accumulation: several rollout batches are reflected on separately and then merged into a single update. This decouples execution throughput from update frequency, allowing rapid batched execution to fill a buffer before the optimizer processes them together.

Minibatch Reflection Details

The reflection step partitions trajectories into failures (low scores) and successes (high scores), then splits each group into reflection minibatches. Single-trajectory reflection tends to produce anecdotal fixes; minibatches expose reusable procedural errors (e.g., consistently searching the wrong source, writing answers in the wrong format, failing to verify tool results).

Failure minibatches propose missing or corrective rules.
Success minibatches preserve behaviors that already work.

Local proposals from each minibatch are merged hierarchically: first consolidate failure- and success-driven edits separately, then combine with priority on failure corrections. This filters duplicate, contradictory, and example-specific suggestions before ranking.

Edit Budget Schedules

The edit budget (L_t) (textual learning rate) can follow one of four schedules:

Constant – fixed budget per step.
Linear – linearly decaying budget.
Cosine – starts with larger edits and decays toward smaller consolidation steps (default).
Autonomous – the optimizer decides the budget.

SkillOpt supports two edit modes: patch mode (localized append/insert/replace/delete) and rewrite mode (full skill rewrite conditioned on selected suggestions). Step-level edits cannot overwrite the protected slow-update field.

Epoch-Wise Slow/Meta Update (Mechanics)

At the end of an epoch, SkillOpt performs a controlled comparison:

Sample the same training items under the previous epoch’s skill and the current skill.
Group the resulting pairwise scores into four categories: improvements, regressions, persistent failures, stable successes.
The optimizer model writes a concise longitudinal guidance block into the protected slow-update field.

This candidate skill (with slow-update block) still passes through the validation gate. The meta skill is optimizer-side only: it is prepended to future optimizer prompts for reflection, merging, and ranking, but is not shipped with the target model. This keeps the deployed skill compact while retaining a rich editing history.

Adapter Interface for Harness-Agnostic Deployment

SkillOpt uses a lightweight adapter interface that matches the trend toward agents embedded in tool-use and software-execution environments. The adapter:

Constructs train/evaluation batches.
Injects the current skill into the agent context.
Runs the native harness.
Returns scored trajectories.

This allows the same optimizer to work across direct QA, spreadsheet execution, document reasoning, multimodal QA, embodied environments, and Codex-style or Claude Code-style execution loops. A stronger optimizer model can train a reusable skill artifact offline, and the resulting best_skill.md can be deployed across target models, harnesses, and nearby benchmarks without changing model weights.

Experiment Setup (Initial Details)

Experiments are conducted across two model families and six benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld). The evaluation answers four questions:

Do optimized skills improve over no-skill, human-skill, one-shot LLM-skill, prompt-optimization (TextGrad, GEPA), and skill-evolution (Trace2Skill, EvoSkill) baselines?
Does the same loop work across direct chat, Codex, and Claude Code harnesses, and across seven target models from frontier-scale GPT to small Qwen?
Which optimizer controls matter?
What do the learned skills look like, and at what cost?

Hyperparameter Analysis

Ablation of scalar and scheduling factors (Table 2) using GPT–5.5 in direct‑chat mode on SearchQA, SpreadsheetBench, and LiveMathematicianBench reveals:

Training set size – Using only 1 example (20% of a normal 4:1:5 split) produces competitive results on SearchQA (81.0) but degrades on Spreadsheet (47.5) and LiveMath (59.1). Increasing to 20–40% of the training partition yields the best trade‑off; 80–100% gives marginal gains on Spreadsheet and LiveMath.
Reflection mini‑batch size – A sweep from 1 to 32 shows stable performance; larger mini‑batches (e.g., 16) are generally better for Spreadsheet (77.9) but have little effect on SearchQA or LiveMath.
Rollout batch size – Performance peaks at 24–40 examples per step; a full‑epoch batch (all training items) reduces Spreadsheet and LiveMath scores, suggesting that moderate batch sizes provide sufficient diversity without diluting the signal.
Textual learning rate – Values from 1 to 16 all yield above‑baseline results; lr=4 achieves 77.5 on Spreadsheet and 61.3 on LiveMath, while higher rates (lr=16) hurt Spreadsheet but improve LiveMath (66.9).
Learning‑rate scheduler – Constant (no decay) performs best on Spreadsheet (80.7) and SearchQA (87.3); cosine and linear are comparable on the other benchmarks.
Slow‑update samples – Using 20 tasks per epoch gives the best average performance; 5 or 10 samples are sufficient, while 40 samples can reduce LiveMath scores (54.8 vs. 61.3).

Component Ablations

Table 3 isolates the contribution of core mechanisms using GPT–5.5 on the same three benchmarks:

Learning‑rate form – Fixed lr=4 outperforms dynamic (model‑decided) learning rates and completely removing the budget (no lr bound). Dynamic lr degrades SpreadsheetBench by 5.7 points; no lr bound drops LiveMath by 4.0 points.
Rejected‑edit buffer – Removing the buffer reduces scores across all benchmarks: SpreadsheetBench falls from 77.5 to 72.9 (−4.6), LiveMath from 61.3 to 58.9 (−2.4), and SearchQA from 87.1 to 85.5 (−1.6).
Epoch‑wise slow/meta update – The full mechanism (meta skill + slow update) is crucial. Removing the meta skill alone drops Spreadsheet from 77.5 to 75.7 and LiveMath from 61.3 to 58.1. Removing both meta skill and slow update causes a severe collapse on Spreadsheet (77.5 → 55.0, −22.5) and smaller losses on the other two benchmarks.

Default Optimizer Hyperparameters

Unless specified, SkillOpt uses:

4 epochs
Rollout batch size 40 per step
Reflection minibatch size 8, 16 parallel analyst workers, merge batch size 8
Textual learning rate (L_t = 4) with cosine decay (floor (L_t = 2)); supported schedules: constant, linear, cosine, autonomous
Held‑out validation gating (strictly greater than current selection score; ties are rejected)
Slow update with 20 sampled tasks per epoch comparing previous‑epoch and current‑epoch skill
Optimizer‑side meta skill that summarizes accepted/rejected patterns into teacher‑only guidance (never shipped with the target model)
Patch edit mode (alternative: rewrite_from_suggestions)
Optional rejected‑edit buffer (stores recent failed proposals for later use)
Teacher reflection: up to three refinement rounds per minibatch
Teacher and student calls: medium reasoning effort

For benchmarks with small training pools (e.g., LiveMathematicianBench: 35 training items per epoch with rollout batch 200; ALFWorld: 39 training tasks with 140 selection and 134 test environments), per‑benchmark configs scale batch sizes while keeping the gate, scheduler, and slow/meta‑update machinery unchanged.

Benchmark Suite Diversity

The evaluation spans six benchmarks covering distinct agent–environment interaction patterns:

SearchQA – single‑round question answering
DocVQA – single‑round visual document QA
LiveMathematicianBench – multiple‑choice maths (single‑round)
OfficeQA – multi‑turn tool loops with up to 24 tool calls
SpreadsheetBench – multi‑round code generation with up to 30 turns and real openpyxl/pandas runtime (default mode multi)
ALFWorld – persistent embodied interaction with up to 50 steps per episode

All dataset‑backed runs use deterministic train/selection/test splits derived from the same seed (split_seed = 42). The selection split is used only to accept or reject candidate skill edits; all reported scores are computed on the disjoint held‑out test split, thereby measuring generalization rather than validation‑set fit.

Transfer Experiments

Skills optimized with SkillOpt transfer across three axes without further optimization.

Cross-model transfer (Table 4a): A SpreadsheetBench skill trained on GPT–5.4 improves smaller GPT variants. For GPT–5.4‑mini, baseline 36.1 → transferred 45.5 (+9.4); for GPT–5.4‑nano, baseline 23.5 → transferred 26.5 (+3.0). On LiveMath, GPT–5.4‑mini baseline 14.7 → transferred 19.2 (+4.5); GPT–5.4‑nano baseline 23.2 → transferred 28.8 (+5.6). The same GPT–5.4 skill also improves the source model itself (in‑domain Direct column: SpreadsheetBench 41.4 → 62.5, LiveMath 36.8 → 44.0).

Cross-harness transfer (Table 4b): A Codex-trained SpreadsheetBench skill, when deployed inside Claude Code, raises the baseline from 22.1 to 81.8 (+59.7). The same skill transferred back to Codex after being trained in Claude Code yields 71.1 vs. baseline 27.5 (+43.6). On LiveMath, a Codex skill inside Claude Code yields 42.4 (+1.6); a Claude Code skill inside Codex yields 48.0 (+12.8). All experiments use GPT–5.5 as the target model.

Cross-benchmark transfer (Table 4c): An OlympiadBench skill improves the nearby math benchmark Omni-MATH across three target models. GPT–5.4 baseline 56.6 → transferred 60.3 (+3.7); GPT–5.4‑mini baseline 34.8 → transferred 36.6 (+1.8); GPT–5.4‑nano baseline 38.8 → transferred 40.1 (+1.3). No skill was trained for Omni-MATH; the transferred skill is the one optimized for OlympiadBench.

Every transferred row in Table 4 is positive; no row falls below the target’s no‑skill baseline.

Harness Details

SkillOpt operates under three execution harnesses, all consuming the same best_skill.md file format:

Direct chat – the target model is invoked through a single chat completion with the skill prepended to the system prompt.
Codex – drives the target through the codex CLI in a workspace-write sandbox. SkillOpt renders the current skill to a per-task SKILL.md alongside task files and reads back a compact execution trace (codex_trace_summary.txt). This trace is included in the teacher reflection context, so the optimizer learns from what the agent actually did, not just its final answer.
Claude Code – mirrors the same workspace contract through the claude CLI.

All three harnesses allow the same skill artifact to be deployed interchangeably, enabling the cross‑harness transfer experiments.

Baselines

SkillOpt compares against seven baselines:

No skill – frozen target model run with the benchmark’s default system prompt.
Human skill – an expert-written skill document curated per benchmark.
One-shot LLM skill – a single skill generated from a high‑level task description by GPT–5.5 and never updated.
Trace2Skill – trajectory‑level skill distillation.
TextGrad – gradient‑style natural‑language prompt optimization.
GEPA – Pareto reflective prompt evolution.
EvoSkill – skill‑folder evolution under failure analysis (harness‑side competitor).

All baselines use the same target model, same held‑out test split, and same scorer for every benchmark. The comparison isolates the adaptation procedure from secondary factors such as prompt template or scoring pipeline.

Further Ablation Details

Ablation studies (GPT–5.5 on SearchQA, SpreadsheetBench, LiveMathematicianBench; direct chat) examine evidence, batch sizes, textual learning rate, and the slow/meta update:

Evidence and batch sizes (panels a, b, c):

Training set size (panel a): Using 1 example (20% of normal 4:1:5 split) yields SearchQA 81.0, SpreadsheetBench 47.5, LiveMath 59.1. Increasing to 20–40% of the training partition gives the best trade‑off; 80–100% gives marginal gains on Spreadsheet and LiveMath.
Reflection mini‑batch size (panel b): Sweep from 1 to 32 shows stable performance. Larger mini‑batches (e.g., 16) are generally better for SpreadsheetBench (77.9) but have little effect on SearchQA or LiveMath.
Rollout batch size (panel c): Performance peaks at 24–40 examples per step. A full‑epoch batch reduces SpreadsheetBench and LiveMath scores; moderate batch sizes provide sufficient diversity without diluting the signal.

Textual learning rate and schedule (panels d, e):

Learning rate (panel d): Values from 1 to 16 all yield above‑baseline results. Lt=4 gives 86.5/78.2/56.5 on SearchQA/Spreadsheet/LiveMath; the highest LiveMath score belongs to Lt=8 at 66.9; the lowest across all five settings is still 85.5 on SearchQA.
Schedule (panel e): Constant decay scores 87.3/80.7/62.1; cosine 87.1/77.5/61.3; linear 87.2/72.9/62.9. The bounded‑update effect does not depend on a specific scheduler.

Epoch‑wise slow/meta update (panel f, Table 3):

Slow‑update sampling (panel f): Default 20 examples per epoch gives 87.1/77.5/61.3; 5, 10, and 40 each within ±2.7 points.
Component ablations (Table 3): Removing the rejected‑edit buffer lowers scores by 1.6 (SearchQA), 4.6 (SpreadsheetBench), and 2.4 (LiveMath). Removing both meta skill and slow update causes a severe collapse on SpreadsheetBench from 77.5 to 55.0 (−22.5), with smaller losses on the other two benchmarks.

Alternative Explanations for Gains

The per‑cell baselines clarify that the effect is not simply prompt length: human skills are already 145–516 tokens long and often exceed the one‑shot LLM skill, yet they are beaten in every direct‑chat model row while the learned artifacts remain compact (Table 6). It is also not only optimizer capacity: SkillOpt leads every baseline even for GPT–5.4‑nano, and the optimizer‑strength analysis (Table 5) shows that a target‑matched optimizer recovers much of the gain. Finally, the harness results show the method is not exploiting one skill format: EvoSkill already improves the Codex SpreadsheetBench cell from 27.5 to 67.5, but SkillOpt adds another +17.5 points (67.5→85.0). The gains are largest on procedural benchmarks, where reusable rules about tool use and output formatting matter most, but they also appear on factual and multimodal benchmarks.

Gate Strictness and Edit Observability

The validation gate is intentionally strict: a candidate skill is accepted only when its selection-split score is strictly greater than the current selection score, so ties are rejected and the deployed skill never silently drifts. This conservative criterion makes rejected edits informative negative feedback rather than hidden state.

Operationally, every step records an edit_apply_report.json containing per-edit accept/skip status, so the source of every change to best_skill.md is recoverable after the fact. The epoch-wise slow/meta update writes into a markup-fenced protected region of the skill document that step-level edits cannot overwrite, separating the fast intra-epoch update from the slower cross-epoch consolidation. The optimizer-side meta skill lives only in the teacher's reflection context and is never shipped with the deployed artifact. These implementation choices explain why removing both meta skill and slow update is especially damaging on SpreadsheetBench: it removes the long-horizon evidence stream and the protected-region contract that keeps local edits from overwriting durable procedural lessons.

Performance Trends Across Epochs

Validation checkpoints track held-out test performance across epochs (Figure 3), confirming that the gate tends to select skills that generalize rather than skills that only fit the selection split.

Deeper Transfer Analysis

Cross-model transfer. A SpreadsheetBench skill trained on GPT‑5.4 transfers to GPT‑5.4‑mini (+9.4) and GPT‑5.4‑nano (+3.0); LiveMath skills transfer to GPT‑5.4‑mini (+4.5) and GPT‑5.4‑nano (+5.6). All four cross-model rows are positive, and on one of them the transferred skill surpasses the in-domain SkillOpt reference (LiveMath GPT‑5.4‑nano: 28.8 transferred vs. 27.2 in-domain), suggesting that some learned procedures are target-model agnostic. The remaining cross-model rows still recover a useful fraction of the in-domain gain—e.g., SpreadsheetBench GPT‑5.4‑mini retains 82% of the in-domain gain (+9.4 of +11.4)—and no row falls below the target's no‑skill baseline.

Cross-harness transfer. A SpreadsheetBench skill trained inside the Codex loop transfers to Claude Code with absolute gain +59.7 over the Claude Code no‑skill baseline (22.1→81.8, slightly exceeding the in-domain Claude Code SkillOpt reference of 80.4); the symmetric Claude‑Code→Codex transfer adds +43.6 on top of the Codex baseline (27.5→71.1). On LiveMath, the Codex→Claude Code transfer is smaller (+1.6 over a 40.8 baseline) but still positive, while the Claude‑Code→Codex transfer adds +12.8 (35.2→48.0). Because the two harnesses expose different tool/file APIs and command surfaces, these positive transfers suggest that the learned rules are not only harness-specific command recipes. In SpreadsheetBench especially, the transferred skill appears to encode workbook-level procedures such as structure-first inspection, formula-aware verification, and static-value materialization, so the cost of optimizing a skill in one execution environment can be amortized across related deployment environments.

Cross-benchmark transfer. On the OlympiadBench→Omni‑MATH direction, transferred skills are positive on all three model scales: +3.7 on GPT‑5.4, +1.8 on GPT‑5.4‑mini, and +1.3 on GPT‑5.4‑nano. These rows are smaller than in-domain and cross-harness transfers—unsurprisingly, since they require the optimized skill to retain useful procedural knowledge after both test instances and answer-format conventions change—but they remain uniformly positive, supporting the interpretation that the optimized skill encodes reusable mathematical procedure rather than memorized benchmark-specific formatting.

Optimizer Strength Analysis

Because the optimizer runs only during offline training and is never invoked at deployment, optimizer choice is a training-time lever: a stronger optimizer can improve the deployed skill without raising inference cost. Table 5 quantifies this by running the same loop with two optimizer regimes—a strong frontier optimizer (GPT‑5.5) and a target-matched optimizer that shares the target model—while holding all other components identical.

Two observations follow:

The stronger optimizer produces larger absolute gains on every (benchmark, target) cell: GPT‑5.4‑nano lifts by +19.0 vs. +11.9 on SpreadsheetBench and +19.0 vs. +14.1 on SearchQA; GPT‑5.4‑mini follows the same ordering (+11.4 vs. +7.1 on SpreadsheetBench, +4.3 vs. +2.4 on SearchQA). The bounded-edit, validation-gated loop ensures monoticity: without the gate, a stronger optimizer could push larger but harmful rewrites.
The target-matched optimizer is far from collapsed—it recovers 56–74% of the strong-optimizer gain across the four cells. This confirms that SkillOpt is not a distillation pipeline from a stronger teacher into a weaker student: the optimization loop itself contributes substantial value on top of whatever the optimizer can already do. A high-capacity frontier optimizer is the right default whenever available (costs only training-time API calls) while the same loop remains effective if the budget forces a target-matched optimizer.

Learned Skills: Compactness, Cost, and Examples

Compactness. Final best_skill.md ranges from 379 tokens (LiveMathematicianBench) to 1,995 tokens (SpreadsheetBench), with a median of roughly 920 tokens. Even the longest learned skill fits well below a typical system-prompt budget. Growth from initial to final skill is moderate (×2.5 to ×53), but the absolute final size stays small enough for a domain practitioner to read, audit, and edit in minutes.
Edit economy. Gains come from very few accepted edits: between 1 and 4 (median 2.5) across the six benchmarks. LiveMathematicianBench’s +29.3 point gain over no skill arises from a single accepted edit; OfficeQA’s +39.0 point gain similarly from one accepted edit. The bulk of the optimizer’s text-space search is rejected by the validation gate.
Cost per point. Training tokens per absolute test-point gain varies from 0.6M (SpreadsheetBench) to 46.4M (DocVQA). Table 6 details initial/final token lengths, number of edits, total training tokens, and cost per point for each benchmark.

Representative Learned Rules

Figure 4 reproduces a single representative rule from each benchmark’s final best_skill.md (GPT‑5.5 / GPT‑5.5 runs; Table 6). Every rule is procedural, not instance-specific, and addresses a recurring failure mode:

SearchQA: “Infer the expected answer type from clue wording, then choose the shortest canonical entity supported by co-occurring distinctive evidence.”
SpreadsheetBench: “Inspect workbook structure and formulas, then write evaluated static values across the full requested target range instead of relying on Excel recalculation.”
OfficeQA: “Treat oracle parsed pages as primary evidence, lock table/date/unit context, and output exactly the requested rounded value without extra labels.”
DocVQA: “For tables, forms, charts, and legends, first bind the question to the exact visual row/header/field, then copy only the aligned answer span.”
LiveMathematicianBench: “In strongest‑statement MCQs, rank choices by theorem strength and prefer a justified stronger‑result option over true but weaker corollaries.”
ALFWorld: “Keep a horizon‑aware visited/frontier ledger, diversify search after repeated same‑type failures, and avoid revisiting the destination until holding the target.”

All six rules encode the kind of discipline that frontier models lack zero‑shot: answer‑format constraints, evidence binding, workbook‑structure‑first reasoning, and search‑frontier management. They read like rules a thoughtful human practitioner would write after a day with the benchmark, yet they are discovered automatically and validated edit‑by‑edit on held‑out data.

Qualitative Skill Evolution

Two representative runs illustrate how SkillOpt evolves a generic initial skill into a compact, stateful procedure by adding only 1–4 accepted edits.

ALFWorld (student: GPT‑5.4‑nano, optimizer: GPT‑5.5).
The initial skill gives a general household plan: search for the target object, pick it up, transform if needed, and place at the destination. Accepted edits add statefulness and loop‑breaking rules:

exact object‑name matching (no substitution of mugs for cups, etc.)
visited‑location memory (prefer unvisited receptacles)
destination memory and progress locks (avoid checking already‑completed subgoals)
direct completion rules (take admissible action when next subgoal can be satisfied, instead of examining or verifying).

The skill evolves from a search‑transform‑place strategy into a finite‑state execution policy with object identity, search memory, progress locks, and loop breakers. Held‑out test performance improves from 49.3 to 74.6.

SpreadsheetBench (student: GPT‑5.5, optimizer: GPT‑5.5).
The initial skill instructs the agent to use Python spreadsheet libraries and preserve unrelated workbook content. Accepted edits add workbook‑forensics behavior:

inspect the actual workbook rather than rely on previews
locate headers and target ranges across multiple sheets
normalize keys and cell types before lookup or aggregation
preserve formatting during structural edits
for formula‑style prompts, compute and write evaluated static values (even if the prompt mentions formulas like INDEX/MATCH or XLOOKUP)
fill complete target ranges including currently blank result cells
keep helper computations in Python rather than adding workbook artifacts
reopen the saved workbook to check boundary rows and remaining blanks.

Performance improves from 40.4 to 78.9 on the held‑out test set.

In both cases, SkillOpt does not replace the initial skill with an unrelated prompt; it adds compact procedural constraints around recurring failure modes observed in rollouts.

Conclusion

We presented SkillOpt, a text‑space optimizer that treats an external skill document as the trainable state for frozen LLM agents. By separating the target model that executes tasks from the optimizer that edits skills, and by using bounded edit budgets, minibatch reflection, held‑out validation gates, rejected‑edit buffers, and epoch‑wise slow/meta update, SkillOpt turns skill improvement into a controlled learning process rather than ad hoc prompt revision. Across six benchmarks, seven target models, and three execution modes, SkillOpt is best or tied‑best on 52 of 52 evaluated cells, lifts GPT‑5.5 by +23.5 points on average over no skill in direct chat and by +24.8/+19.1 points under Codex and Claude Code harnesses, and beats the strongest per‑cell baseline from human, LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills by +5.4 points on average. Per‑benchmark case studies confirm that gains arise from compact (< 2,000 token), interpretable skill artifacts assembled from only 1–4 accepted edits, and that the deployed skills transfer across model scales, harnesses, and nearby benchmarks. These results suggest that compact natural‑language skills can serve as a practical domain‑adaptation layer for frontier agents, enabling reusable improvement without modifying model weights.

Outlook. SkillOpt optimizes a single skill artifact for a single target domain; natural extensions include skill libraries that share infrastructure across domains, reuse of optimizer‑side meta skills across benchmarks, reward‑free or preference‑driven validation gates for open‑ended tasks, and self‑distillation of optimized skills back into the target model as a stepping stone toward weight‑level adaptation. We hope that treating the skill itself as the trainable object—rather than as a side artifact of prompting—will let future work apply the full toolkit of optimization (learning rates, schedules, regularization, curricula, validation) to a part of the agent stack that has so far been hand‑engineered.

Limitations

Requires scored trajectories and a held-out selection split; most directly applicable when the target task has automatic verifiers, exact-match metrics, executable checks, or otherwise reliable feedback signals.
For open-ended domains where success is subjective, multi-dimensional, or costly to judge, the validation gate may need stronger human or model-based evaluation.
Training the skill requires additional rollout computation and calls to an optimizer model; this cost is amortized when the same skill is reused but may be less attractive for one-off tasks.
SkillOpt optimizes a single portable skill rather than growing a large skill library or changing model weights; a single skill may be insufficient for highly heterogeneous domains requiring many disjoint procedures.
Optimized skills can encode domain-specific heuristics from the training distribution; careful held-out evaluation remains necessary before transferring to substantially different models, harnesses, or task settings.

Experimental Protocol Details

Default split ratio: Dataset-backed benchmarks use deterministic train/selection/test splits with a default 2:1:7 ratio when no benchmark-specific split is stated.
Additional benchmark: The evaluation also includes SealQA, which stresses noisy retrieval, in addition to the six benchmarks listed earlier (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld).
Metrics: Each benchmark uses its native evaluator; reported metrics are hard success or exact-match accuracy on held-out test examples.
Baselines: The no‑skill baseline evaluates the frozen student without an optimized skill document. Human‑skill and LLM‑skill baselines use manually written and one‑shot generated skills, respectively.

Optimization Algorithm (Algorithm 1)

State variables maintained during optimization: s_cur (current skill), s_best (best validation-gated skill), selection-score cache C, step buffer B (stores rejected edits and observed failure patterns), and optimizer-side meta skill m_meta used only to guide future edit generation.
Epoch loop: For each epoch, training data is shuffled into rollout batches. Each optimization step collects A accumulation batches by executing the harness with the current skill, splits evidence into failures and successes, and processes them through minibatches of size B_m.
Proposal generation: The optimizer model produces failure‑patch proposals and success‑patch proposals separately. Proposals are merged (failure‑prioritized) and ranked under the current edit‑budget L_t.
Candidate and gate: The selected edits are applied to form candidate ~s. If ~s has been evaluated before (cache hit), the stored score is reused; otherwise it is evaluated on D_sel. The candidate replaces s_cur only if its selection score is strictly greater than the current score.
Slow update and meta skill: At the end of each epoch (e ≥ 2), if slow update is enabled, the same sampled tasks are run under the previous epoch‑end skill and the current skill. The optimizer produces protected longitudinal guidance that is injected into the skill document’s slow‑update region and validated through D_sel. If optimizer memory is enabled, m_meta is updated for future edit generation.

Optimizer Prompt Contracts

SkillOpt uses structured JSON‑output prompts to make edits parseable, filterable, and gatable without manual intervention. Four primary prompt templates are used:

Failure analysis (analyst_error.md): Given multiple failed trajectories from one minibatch and the current skill, identify common failure patterns and propose at most L edits. Edits must be generalizable (no hardcoded task values) and are specified as operations: append, insert_after, replace, or delete. The output includes a failure_summary with count and description, and a patch containing the edits. The optimizer is explicitly instructed not to modify content between  and  markers.
Success analysis (analyst_success.md): Given multiple successful trajectories from one minibatch, identify generalizable behavior patterns common across the batch. Proposals are only made for patterns not already covered in the skill. Edits are bounded by L and follow the same JSON format. The same restriction on the protected slow‑update region applies.
Failure merge (merge_failure.md): Merges multiple independently proposed patches from failure analysis into one coherent, non‑redundant patch. Guidelines include deduplication, conflict resolution, preserving unique insights, and estimating a support_count for each merged edit. Edits must not target the protected slow‑update region.
Success merge (merge_success.md): Merges success‑driven patches conservatively. Only generalizable patterns not already in the skill are included. An estimated support_count is reported per edit.

All prompts require valid JSON output without markdown fences, enabling fully automated edit parsing and application.

Ablation Protocol Details

One‑factor ablations hold all remaining optimizer configuration fixed while varying a single scalar or component.
The train‑size ablation uses a fixed 2:1:7 train/selection/test split and varies how much of the training partition (not the full dataset) is exposed. The 100% row uses the full training partition under the same split, so it is directly comparable to smaller‑subset rows.

Baseline Reporting Convention

Entries for baselines that were not measured under the final aligned protocol are marked with “–” rather than mixed with incompatible runs. All reported scores use the same target model, held‑out test split, and scorer for each benchmark.

Prompt Contracts for Optimizer

SkillOpt uses structured JSON-output prompts to enforce parseable, filterable, and gatable edits. Four additional prompt templates beyond the base failure/success analysis and merge:

Final merge (merge_final.md) – Combines failure‑driven and success‑driven patches into one coherent set. Failure patches take priority; deduplication and conflict resolution are applied. No edits may target the protected SLOW_UPDATE_START/SLOW_UPDATE_END region. Output includes support_count and source_type for each edit.
Ranking and selection (ranking.md) – Ranks candidate edits by systematic impact, complementarity, generality, and actionability. The optimizer selects the top edits up to the given budget L_t. Outputs a JSON object with selected_indices in priority order.
Slow update (slow_update.md) – Runs at epoch boundaries (epoch ≥ 2). Receives two consecutive skill versions and longitudinal comparison data (same tasks under both skills, categorized into regressions, persistent failures, improvements, stable successes). Writes a strategic guidance block that overwrites the protected slow‑update section. Previous guidance is reflected upon; new guidance prioritizes preventing regressions, fixing persistent failures, and reinforcing successful patterns. Only this epoch‑boundary process may rewrite the protected region.
Optimizer memory / meta skill (meta_skill.md) – At epoch boundaries, writes a compact optimizer‑side meta skill that captures which edit types, abstraction levels, and failure‑repair patterns work in the current environment. Addressed to the future optimizer, not the training model. Evidence comes from adjacent‑epoch comparisons. Previous optimizer memory is revised or removed if ineffective. This meta skill is never shipped with the deployed artifact.

All prompts require valid JSON output without markdown fences, enabling fully automated edit parsing, application, and filtering.

Patch Representation and Safeguards

Atomic operations: Patch‑mode optimization restricts each update to four operations: append, insert_after, replace, delete. Each merged edit records a support_count (how many independent analyses support it) and a source_type (failure or success), allowing ranking to prefer edits that survive hierarchical merging.
Edit budget L_t: Acts as a textual learning rate, limiting how many proposed edits can be applied at a step, preserving continuity between adjacent skills. Early steps allow larger changes; later steps enforce smaller refinements.
Protected slow‑update section: The region between  and  markers is off‑limits to all step‑level prompts. Only the epoch‑boundary slow‑update process may rewrite it, and the rewritten skill still passes through the held‑out selection gate.
Rejected edit buffer: Failed candidates are not discarded; their failure patterns and rejected edits are stored in a step buffer so that later optimizer calls can avoid repeating harmful changes.

Design Principles

SkillOpt’s implementation follows five design principles:

Fixed task‑execution model – Only the text skill is optimized; the model weights and execution harness remain frozen.
Selection‑split validation – Every candidate skill is evaluated on a held‑out selection split before acceptance, preventing unvalidated reflection from accumulating.
Hierarchical minibatch merging – Minibatch analyses (failure and success) are merged hierarchically so that final edits represent recurring evidence rather than single examples.
Edit budget as learning‑rate analogue – The budget L_t allows larger early changes and smaller late refinements, mimicking a learning‑rate schedule.
Lightweight deployed skill – The final best_skill.md is compact and inspectable; the optimizer‑side meta skill stays separate and is never shown to the task‑execution model.