Fast Checks for Code Generation

I’ve been thinking a lot about code generation with AI agents lately, and wanted to start writing a bit about how I’m approaching the problem. I’ve been using AI-assisted workflows since ChatGPT came out but care a lot about code quality and maintaining a good mental model for the work I’m doing. In the early days I would mostly use these tools for scaffolding, but I’m finding we’re at a bit of an inflection point where they can be used for more production-ready coding tasks. This is a first article exploring it.

This one technique, around the principle “Automate Feedback”, is one that I found to be quite powerful to ensure that the codegen is up to my standards. In my project Indie Web, which powers FloppyDisk.link and BrowserChords.com, I use taskfiles for my script runner. I added a task check that automates all the fast iterative checks I care about.

I found myself frustrated with constantly correcting the agent, and slow loops. It failed to apply automatic code formatting pretty much every time, and my atomic commits became messy with formatting fixes after the fact. On side projects I still prefer a clean commit style. I also found myself constantly correcting annoying coding patterns and style lints. Suddenly, my side project needed classic linting checks for untrusted contributors. We have had this problem solved for quite a while with linting and automated feedback.

For agentic loops, this changes. Here I want a single unambiguous check command that is as fast as possible to keep the agent on task without a human in the loop, until I am ready to review the work. In order to trust plausible-looking agent code, I need some kind of validation that is quick and that I can trust. Here I got the check down to a few seconds of parallelized runs. AI codegen is really good at creating plausible code that is fundamentally flawed. task check was my way to keep things higher quality.

How the command works – human in the loop

1
task check

This is all that is needed to run the check.

1
2
3
4
5
6
7
8
➤ task check
• Lint JS running...
✓ Lint CSS 0.5s
• TypeScript Frontend running...
• TypeScript Server running...
• TypeScript Shared running...
• Test Frontend running...
✓ Test Server 1.5s

Once it finishes in an interactive terminal the output is quick and simple:

1
2
3
4
5
6
7
8
9
10
11
12
~/dev/indie-web on indie-web/main 🍏
➤ task check
✓ Lint JS 5.9s
✓ Lint CSS 0.5s
✓ TypeScript Frontend 5.2s
✓ TypeScript Server 3.9s
✓ TypeScript Shared 3.6s
✓ Test Frontend 4.2s
✓ Test Server 1.5s

~/dev/indie-web on indie-web/main

For failures only the failing task is shared.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
➤ task check
x Lint JS 5.7s
✓ Lint CSS 0.6s
✓ TypeScript Frontend 5.1s
✓ TypeScript Server 3.9s
✓ TypeScript Shared 3.7s
✓ Test Frontend 4.1s
✓ Test Server 1.5s

✖ Failures
──────────

┌─ Lint JS failed exit 1 run: task lint-js
│ task --silent --exit-code lint-js
└─ output

/Users/greg/dev/indie-web/src/frontend/index.tsx
84:22 error Replace `A.setDropboxAccessToken(oauth.accessToken,·0,·oauth.refreshToken)` with `⏎········A.setDropboxAccessToken(oauth.accessToken,·0,·oauth.refreshToken),⏎······` prettier/prettier

✖ 1 problem (1 error, 0 warnings)
1 error and 0 warnings potentially fixable with the `--fix` option.



◆ Results
─────────
x Lint JS 5.7s run: task lint-js
✓ Lint CSS 0.6s
✓ TypeScript Frontend 5.1s
✓ TypeScript Server 3.9s
✓ TypeScript Shared 3.7s
✓ Test Frontend 4.1s
✓ Test Server 1.5s
task: Failed to run task "check": exit status 1

This way it is quick to diagnose errors and understand failures whenever I want to run a check.

How the command works – non-TTY

If I pipe a command into cat then it will run in a non-interactive mode. This is the same way an agent or standard CI would interact with the command. Here I wanted to output the format in a log-friendly way without interactive check marks updating over time. I wanted to maximize signal for the agent, but retain signal for me as well.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
➤ task check | cat
RUN Lint JS
RUN Lint CSS
RUN TypeScript Frontend
RUN TypeScript Server
RUN TypeScript Shared
RUN Test Frontend
RUN Test Server
FAIL Lint JS 5.5s
FAILURES

FAIL Lint JS exit 1 | run: task lint-js
cmd: task --silent --exit-code lint-js
output:

/Users/greg/dev/indie-web/src/frontend/index.tsx
84:22 error Replace `A.setDropboxAccessToken(oauth.accessToken,·0,·oauth.refreshToken)` with `⏎········A.setDropboxAccessToken(oauth.accessToken,·0,·oauth.refreshToken),⏎······` prettier/prettier

✖ 1 problem (1 error, 0 warnings)
1 error and 0 warnings potentially fixable with the `--fix` option.


task: Failed to run task "check": exit status 1

For the success case I got:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
➤ task check | cat
RUN Lint JS
RUN Lint CSS
RUN TypeScript Frontend
RUN TypeScript Server
RUN TypeScript Shared
RUN Test Frontend
RUN Test Server
PASS Lint CSS 0.6s
PASS Test Server 1.6s
PASS TypeScript Shared 3.6s
PASS TypeScript Server 3.8s
PASS Test Frontend 3.9s
PASS TypeScript Frontend 5.0s
PASS Lint JS 5.5s

Here it informs the agent that everything is passing as intended, and how long the check took. 5.5s for the longest task means that everything works quickly in an agentic loop.

How this affects the agentic loop

I found with this technique things just started working. I didn’t have to correct the agent with my time. Instead it would figure out every time that there was a prettier error, and only run task lint-fix once it was done solving the bigger problems. I found I could get more reliable results.

I found this AI summary of my work pretty accurate:

  • The command should be:
    • Fast enough to run constantly.
    • Narrow enough that failures are relevant.
    • Real enough that passing means something.
    • Quiet enough that an agent can use the output.
    • Pleasant enough that I will still run it.

In my project to accomplish “fast enough” I parallelized everything. I care a lot about performant code, and my tests were already pretty fast to begin with. For this project, this is actually all of the CI checks running. I could see that in larger projects you would need to make task check smart enough to only run checks that would be relevant. For my use case, my tests were fast enough to just run everything in parallel on my beefy dev machine.

In order to make failure modes even faster for the agent flows, I made it even faster by killing all the remaining tasks that hadn’t failed. Agents can just fix one failure at a time, so you can order checks by importance.

What this all looked like for me

An example Taskfile.yml shape:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
check:
desc: Run all checks in a blazing fast configuration with only failures reported.
silent: true
cmds:
- node bin/check.mts

lint-js:
desc: Lint the JavaScript files.
cmds:
- eslint "bin/**/*.{ts,mts}" "scripts/**/*.js"

format:
desc: Check Prettier formatting.
cmds:
- prettier --check "bin/**/*.{ts,mts}" "scripts/**/*.js"

And then the check orchestrator lists the tasks in order.

1
2
3
4
5
6
7
8
9
const checks = [
{ task: 'lint-js', label: 'Lint JS' },
{ task: 'lint-css', label: 'Lint CSS' },
{ task: 'ts-frontend', label: 'TypeScript Frontend' },
{ task: 'ts-server', label: 'TypeScript Server' },
{ task: 'ts-shared', label: 'TypeScript Shared' },
{ task: 'test-frontend', label: 'Test Frontend' },
{ task: 'test-server', label: 'Test Server' },
];

Finally, I keep my AGENTS.md pretty short, because of the principle of “Patterns Are Stronger Than Prompts” for codegen. The agent will figure it out contextually and interactively.

1
2
3
Read all of README.md. I optimize for human usage over robot.

For all code written, run `task check` for quick agentic-focused testing that limits stdout.

This feedback loop has made me feel pretty productive without feeling sloppy, so that I can read the generated code and ensure it’s high quality for the task at hand.

More From writing

More Posts