On a Wednesday in February I went to bed at midnight with an open laptop, a long-running Claude Code session, and a single instruction: refactor the BOL ingestion module to use the shared error helper. Run tests after each file. Stop if anything red.
I woke up at 7am to a clean diff, 47 passing tests, and a commit message that opened with "this one was annoying." I had not, at any point during the previous seven hours, been awake.
This is not the part of AI I expected to find useful. I’m a paranoid reviewer. I don’t trust generated code without reading it. The idea of letting an agent loose overnight on production code felt insane. And then it kept working, and now it’s a habit.
What actually works overnight
Not everything works. I’ve killed agents at 3am via a phone notification more than once. After a few months of doing this regularly, the rough taxonomy of overnight work I trust looks like:
- Rename / refactor with green tests as the contract. "Pull this helper out into its own file, update all callers, keep the test suite green." The test suite is the guardrail. The agent runs them, sees red, fixes, sees red, fixes, eventually it’s green or it stops.
- Adding tests to under-tested code. "Write tests for every exported function in this module. Make them pass against the current implementation." This one is wonderful. The agent reads the code, writes tests, runs them, fixes the tests until they describe the actual current behavior. You wake up with a regression net.
- Following a migration playbook. "Migrate the API routes in
app/apifrom the pages-router pattern to the App Router pattern using this one already-migrated file as the template." Templated work is great for agents. They are pattern-matchers; this is exactly that.
What I do not trust overnight:
- Anything where the contract is "make it better." No green-test guardrail = no overnight job.
- Anything that touches money, auth, or data integrity. Those need me awake.
- Anything that involves making a product decision. Agents will pick a defensible answer and move on. I want to be the one picking.
The scoping mistake I kept making
The first three times I tried this, I wrote the brief like I would write a Jira ticket. "Refactor the X module, simplify the Y logic, clean up the Z helpers." I’d wake up to a diff that touched 40 files, with three things I liked, two things I didn’t, and one thing that broke a feature in production. Net negative.
What I do now is the opposite. I write the brief like a contract: one verb, one scope, one stop condition.
One verb: "Rename." Not "rename and clean up." Just rename.
One scope: "Files matching src/lib/billing/*." Not "wherever it seems necessary."
One stop condition: "Tests pass. If you can’t make them pass, stop and leave a note." Not "make it work."
The diff I wake up to with this kind of brief is small, surgical, and reviewable in fifteen minutes over coffee. The diff I used to wake up to was sprawling and required half a day to disentangle. Same agent. Different brief. Different outcome.
The economic insight
The thing I didn’t expect about overnight agents is how much it changes the calculus of "is this worth doing." There is a class of work — boring refactors, test backfill, dependency upgrades, dead code removal — that I knew was valuable but never prioritized because the activation energy was too high. "It would take me a day. I don’t have a day."
An overnight agent costs me about ten minutes of writing the brief and forty minutes the next morning of reviewing the diff. Fifty minutes for a day’s worth of code hygiene. That math is not subtle. Things that used to be "maybe next sprint" now happen on a Tuesday because I had a free hour before bed.
The thing that still surprises me
What I keep finding is that the agent does the work I find boring and exhausting better than I do it. Not because it’s smarter — it isn’t, on those tasks — but because it doesn’t get bored. The 47th identical rename in the 47th file is just as careful as the first. I can’t honestly say that about my own work. By the 30th rename I’m skimming.
The agent doesn’t skim. That’s the underrated thing.
