Looking at the prompts op has shared, I'd recommend more aggressively managing/trimming the context. In general you don't give the agent a new task without /clearing the context before. This will enable the agent to be more focused on the new task, and decrease its bias (if eg. reviewing changes it has made previously).
The overall approach I now have for medium sized task is roughly:
- Ask the agent to research a particular area of the codebase that is relevant to the task at hand, listing all relevant/important files, functions, and putting all of this in a "research.md" markdown file.
- Clear the context window
- Ask the agent to put together a project plan, informed by the previously generated markdown file. Store that project plan in a new "project.md" markdown file. Depending on complexity I'll generally do multiple revs of this.
- Clear the context window
- Ask the agent to create a step by step implementation plan, leveraging the previously generated research & project files, put that in a plan.md file.
- Clear the context window
- While there are unfinished steps in plan.md:
-- While the current step needs more work
--- Ask the agent to work on the current step
--- Clear the context window
--- Ask the agent to review the changes
--- Clear the context window
-- Ask the agent to update the plan with their changes and make a commit
-- Clear the context window
I also recommend to have specialized sub agents for each of those phases (research, architecture, planning, implementation, review). Less so in terms of telling the agent what to do, but as a way to add guardrails and structure to the way they synthesize/serialize back to markdown.
I pretty much never clear my context window unless I'm switching to entirely different work, seems to work fine with copilot summarizing the convo every once in a while. I'm probably at 95% code written by an llm.
I actually think it works better that way, the agent doesn't have to spend as much time rereading code it had previously just read. I do have several "agents" like you mention, but I just use them one by one in the same chat so they share context. They all write to markdown in case I do want to start fresh if things do go the wrong direction, but that doesn't happen very often.
I wouldn't take it for granted that Claude isn't re-reading your entire context each time it runs.
When you run llama.cpp on your home computer, it holds onto the key-value cache from previous runs in memory. Presumably Claude does something analogous, though on a much larger scale. Maybe Claude holds onto that key-value cache indefinitely, but my naive expectation would be that it only holds onto it for however long it expects you to keep the context going. If you walk away from your computer and resume the context the next day, I'd expect Claude to re-read your entire context all over again.
At best, you're getting some performance benefit keeping this context going, but you are subjecting yourself to context rot.
Someone familiar with running Claude or industrial-strength SOTA models might have more insight.
CC absolutely does not read the context again during each run. For example, if you ask it to do something, then revert its changes, it will think the changes are still there leading to bad times.
It wouldn't re-read the context, it would cache tokens thus far which is like photographically remembering the context instead of re-reading it, until you see it "compress" context when it gives itself a prompt to recap so far:
You can tell it that you manually reverted the changes.
That said, the fact that we're all curating these random bits of "llm whisperer" lore is...concerning. The product is at the same time amazingly good and terribly bad.
I have tested today a mix of cleaning often the context with long contexes and Copilot with Claude ended producing good visual results, but the generated CSS was extremely messy.
Even better approach, in my experience is to ask CC to do research, then plan work, then let it implement step 1, then double escape, move back to the plan, tell it that step 1 was done and continue with step 2.
This is a really interesting approach as well, and one I'll need to try! Previously, I've almost never moved back to the plan using double escape unless things go wrong. This is a clever way to better use that functionality. Thanks for sharing!
I’m not disparaging it, just actualizing it and sharing that thought. If you don’t understand that most modern “tools” and “services” are gamified, then yes I suppose I seem like a huge jerk.
The author literally talks about managing a team of multiple agents and Llm services requiring purchase of “tokens” is similar to popping a token into an arcade machine.
"Hacker culture never took root in the AI gold rush because the LLM 'coders' saw themselves not as hackers and explorers, but as temporarily understaffed middle-managers"
Also hacking really doesn’t have anything to do with generating poorly structured documents that compile into some sort of visual mess that needs fixing. Hacking is the analysis and circumvention of systems. Sometimes when hacking we piece together some shitty code to accomplish a circumvention task, but rarely is the code representative of the entire hack. Llms just make steps of a hack quicker to complete. At a steep cost.
I have been reluctant to use AI as a coding assistant though I have installed claude code and bought a bunch of credits. When I see comments like this I genuinely asking what's the point. Are you sure that going through all of these manipulation instead of directly editing the source code makes you more productive? In which way?
Years ago, I was joking with my colleagues that I'm living two weeks ahead, writing the present day code is a chore, thinking about the next problems is more important, so that when the time comes to implement them, I know how. I don't have much time to code these days, but I still have the ability to think. Instead of doing the chore myself, I now delegate it to Claude Code. I still do coding occasionally, usually when it's something hard that I know AI will mess up, but in those instances, I enjoy it.
> Looking at the prompts op has shared, I'd recommend more aggressively managing/trimming the context. In general you don't give the agent a new task without /clearing the context before. This will enable the agent to be more focused on the new task, and decrease its bias (if eg. reviewing changes it has made previously).
My workflow for any IDE, including Visual Studio 2022 w/ CoPilot, JetBrains AI, and now Zed w/ Claude Code baked in is to start a new convo altogether when I'm doing something different, or changing up my initial instructions. It works way better. People are used to keeping a window until the model loses its mind on apps like ChatGPT, but for code, the context Window gets packed a lot sooner (remember the tools are sending some code over too), so you need to start over or it starts getting confused much sooner.
I've been meaning to try Zed but haven't gotten into it yet; it felt hard to justify switching IDEs when I just got into a working flow with VS Code + Claude Code CLI. How are you finding it? I'm assuming positive if that's your core IDE now but would love to hear more about the experience you've had so far.
If you are Claude Code user, you will likely not enjoy the version integrated into Zed. Many things are missing, for example, no slash commands. I use Zed, but still run Claude Code in the terminal. As an editor, Zed is excellent, especially as a Vim replacement.
Oh interesting, that’s good to know. Thank you. I might try that combination - I like using the Claude Code CLI so hopefully less of a painful transition.
OP here, this is great advice. Thanks for sharing. Clearing context more often between tasks is something I've started to do more recently, although definitely still a WIP to remember to do so. I haven't had a lot of success with the .md files leading to better results yet, but have only experimented with them occasionally. Could be a prompting issue though, and I like the structure you suggested. Looking forward to trying!
I didn't mention it in the blog post but actually experimented a bit with using Claude Code to create specialized agents such as an expert-in-Figma-and-frontend "Design Engineer", but in general found the results worse than just using Claude Code as-is. This also could be a prompting issue though and it was my first attempt at creating my own agents, so likely a lot of room to learn and improve.
This is overkill. I know because I'm on the opposite end of the spectrum: each of my chat sessions goes on for days. The main reason I start over is because Cursor slows down and starts to stutter after a while, which gets annoying.
Claude auto-condenses context, which is both good/bad. Good in that it doesn't usually get super slow, bad in that sometimes it does this in the middle of a todo and then ends up (I suspect) producing something less on-task as a result.
Usually, managing a development team is more work than just writing the code oneself. However, managing a development team (even if that team consists of a single LLM and yourself) means that more work can be done in a shorter period of time. It also provides much better structure for ensuring that tests are written, and that documentation is written if that is important. And in my experience, though not everybody's experience I understand, it helps ensure a clean, useful git history.
it just seems like a lot of work when you could just write the code yourself, just a lot less typing to go ahead and make the edits you want instead of trying to guide the autocorrect to eventually predict what you want from guidelines you also have to generate to save time
like I'm sorry but when I see how much work the advocates are putting into their prompts the METR paper comes to mind.. you're doing more work than coding the "old fashioned way"
if there's adequate test coverage, and the tests emit informative failures, coding agents can be used as constraint-solvers to iterate and make changes, provided you stage your prompts properly, much like staging PRs.
The question isn't whether it makes a difference, the question is whether the model you're working with / the platform you're working with it on already does that. All of the major commercial models have their own system prompts that are quite detailed, and then the interfaces for using the models typically also have their own system prompts (Cursor, Claude Code, Codex, Warp, etc).
It's highly likely that if you're working with one of the commercial models that has been tuned for code tasks, in one of the commercial platforms that is marketed to SWEs, that instructions similar to the effect of "you're an expert/experienced engineer" will already be part of the context window.
How has it ever worked. I have thousands of threads with various LLMs, none have that role play cue, yet the responses always sound authoritative and similar to what one would find in literature written by experts in the field.
What does work is to provide clues for the agent to impersonate a clueless idiot on a subject, or a bad writer. It will at least sound like it in the responses.
Those models have been heavily trained with RLHF, if anything today's LLMs are even more likely to throw authoritative predictions, if not in accuracy, at least in tone.
I also don't tell CC to think like expert engineer, but I do tell it to think like a marketer when it's helping me build out things like landing pages that should be optimized for conversions, not beauty. It'll throw in some good ideas I may miss. Also when I'm hesitant to give something complex to CC, I tell that silly SOB to ultrathink.
I'm still not sure if specifying a role made a difference or not in terms of performance. Different but similar instance, when I tried to create an agent in Claude to play a specific role (frontend / design engineer expert), I found that this seemed to perform worse vs. just using default Claude, but this is all very anecdotal.
Maybe it is cargo culting at this point, idk. When I first started experimenting with this, about two generations of models back, the role play prompt made a noticeable difference.
Example: with early Claude (pre-Claude Code) if you asked for a Rust program you’d get something that only resembled Rust syntax but was a mix of different languages. “You are a senior software engineer that develops solely with the Rust programming language” or something like this made it generate syntactically correct Rust.
Similar prompts led to better, more focused tests. I find that such prompts are not as necessary anymore, but anecdotally I’ve still felt a difference.
and it will completely ignoring the instructions because user input cannot afect it, but it will waste even more context space to fool you that it did.
The overall approach I now have for medium sized task is roughly:
- Ask the agent to research a particular area of the codebase that is relevant to the task at hand, listing all relevant/important files, functions, and putting all of this in a "research.md" markdown file.
- Clear the context window
- Ask the agent to put together a project plan, informed by the previously generated markdown file. Store that project plan in a new "project.md" markdown file. Depending on complexity I'll generally do multiple revs of this.
- Clear the context window
- Ask the agent to create a step by step implementation plan, leveraging the previously generated research & project files, put that in a plan.md file.
- Clear the context window
- While there are unfinished steps in plan.md:
-- While the current step needs more work
--- Ask the agent to work on the current step
--- Clear the context window
--- Ask the agent to review the changes
--- Clear the context window
-- Ask the agent to update the plan with their changes and make a commit
-- Clear the context window
I also recommend to have specialized sub agents for each of those phases (research, architecture, planning, implementation, review). Less so in terms of telling the agent what to do, but as a way to add guardrails and structure to the way they synthesize/serialize back to markdown.