Skip to content

迭代困难问题

← All use cases

难度:高级

适用场景:Problems where each iteration can be scored, but the best result usually takes many passes Tasks with visual or subjective outputs that need both deterministic checks and an LLM-as-a-judge score Long-running Codex sessions where you want progress tracked clearly instead of relying on context

启动提示

I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop. Before changing anything:

  • Read AGENTS.md.
  • Find the script or command that scores the current output. Iteration loop:
  • Make one focused improvement at a time.
  • Re-run the eval command after each meaningful change.
  • Log the scores and what changed.
  • Inspect generated artifacts directly. If the output is visual, use view_image.
  • Keep going until both the overall score and the...

由 Codex 构建