Appearance
迭代困难问题
← All use cases
难度:高级
适用场景:Problems where each iteration can be scored, but the best result usually takes many passes Tasks with visual or subjective outputs that need both deterministic checks and an LLM-as-a-judge score Long-running Codex sessions where you want progress tracked clearly instead of relying on context
启动提示
I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop. Before changing anything:
- Read
AGENTS.md. - Find the script or command that scores the current output. Iteration loop:
- Make one focused improvement at a time.
- Re-run the eval command after each meaningful change.
- Log the scores and what changed.
- Inspect generated artifacts directly. If the output is visual, use
view_image. - Keep going until both the overall score and the...