DEV Community: Michelle Tristy

Store the proof, not the moral

Michelle Tristy — Tue, 16 Jun 2026 02:30:39 +0000

The most useful idea I picked up from all those conversations was not a technique. It was a way of thinking about what a memory even is. Someone put it like this: a memory should hold proof, not a moral. I have been chewing on it ever since, because the more I sit with it the more it explains why agent memory goes wrong in the specific way it does.

Two things get tangled together every time an agent records a failure. There is the event, which is a fact about the past. On Tuesday, the agent tried approach X and it produced error Y. That happened or it did not. It is fixed, it is checkable, you could in principle go back and confirm it. And then there is the lesson, which is an inference drawn from the event. Therefore, do not use approach X. That is not a fact. It is a hypothesis about the future, and like any hypothesis it might be wrong, or it might stop being true later.

The trouble is that most memory I have seen stores the second thing and quietly drops the first. It keeps "do not use X" and throws away "because on Tuesday it broke with Y." Which feels efficient. The conclusion is what you act on, so why keep the messy event around.

Because once the lesson is cut loose from its evidence, you can never check it again. You are left with a rule and no way to ask whether the rule is still warranted, because the thing that warranted it is gone. The agent will keep refusing to use X long after the reason evaporated, and it cannot even tell you why, only that it knows not to. That is superstition, in the precise sense. A behavior that has outlived its justification and no longer remembers it ever had one.

It gets worse when you think about strength of evidence. A lesson learned from a single fluke and a lesson learned from twenty consistent failures look identical once they have both been compressed to "do not use X." The number of times it actually failed was part of the proof, and you threw that away too. So the agent gives a thing that failed once the same hard conviction as a thing that failed every time, because the only part it kept was the conviction.

This is, if you have not noticed, extremely human. We do it constantly. You get burned once by something, you form a rule, and then you defend the rule for years after you have forgotten the single incident that created it. The incident, if you still had it, would be checkable. You could ask whether it was ever representative, whether the thing that made it fail is even still around. The detached rule cannot be questioned that way, so it just sits there shaping your behavior, unfalsifiable. We carry a lot of orphaned morals and call them experience.

So what does holding proof instead look like. The event stays put, immutable, primary. It is what happened and the evidence that it happened, and nothing later is allowed to edit it. The lesson becomes a separate, derived thing hanging off the event, and that one is allowed to change. When new evidence shows up, you revise the lesson and leave the event alone. The event is the ground truth. The lesson is your current best reading of it, marked clearly as a reading, not a fact.

The honest catch is that this is more work, not less. More to store, more structure, and it does not make the hard decision go away. You still have to decide when a lesson should be revised, and that is a judgment nobody has automated cleanly. But, and this is the whole point, keeping the event around is what makes the judgment possible at all. Without the proof you cannot even have the argument about whether the moral still holds. You are just stuck with the moral.

There is a thread back to something I wrote about before. The cleanest reason to revise a lesson is an outcome, you acted on the rule and watched what happened. But most setups never capture outcomes, so the revision step has no trigger and the lessons quietly calcify. Keep the event and you at least preserve the option to revisit. Drop it and the calcification is permanent.

I do not think there is one correct schema for this. But I have started believing that any memory which only stores conclusions is building up a pile of rules it can never audit. So I am curious how others handle it. Do you keep the raw event separate from what you concluded from it, or do you store the conclusion and hope it stays right? I would like to know if anyone has found a clean way to let the lesson move while the evidence stays nailed down.

Catching the failure is the easy part

Michelle Tristy — Mon, 15 Jun 2026 01:05:36 +0000

The last post I wrote ended on a loose thread I have not been able to stop pulling at. Almost every memory setup I looked at had a decent answer for what to write down, and almost none of them had a real answer for what to keep. I want to sit with that second half for a while, because the more time I spend with it the more I think it is where the actual difficulty lives.

Start with the part that feels hard but mostly isn't. Noticing that an agent failed at something is close to mechanical. A tool throws an error. A test goes red. A call times out. A change gets reverted twenty minutes after it shipped. You can even catch the quiet ones, the runs where nothing errored but nobody ever confirmed the thing actually worked, by treating "ended without confirmation" as its own small failure. None of this is trivial to wire up, but it is the kind of problem that yields to rules. You can write the rules down and they hold.

So people build the detector, watch it light up, and feel like they have solved memory. They have not. They have solved the easy half and walked right up to the hard one without noticing the seam.

The hard half starts the moment you have a confirmed failure in hand and have to decide what, if anything, it means. A single failure is not one kind of thing. Sometimes it is a fluke, a flaky test or a network hiccup that will never recur and is worth nothing. Sometimes it is a real lesson, a sign that a whole approach is wrong. And sometimes it is just another face of a mistake you already recorded last week, in which case writing it down again only piles more weight onto something you already knew. The detector cannot tell these apart. It only knows that something went red. Sorting which red things deserve to become memory is judgment, and judgment does not collapse into a rule the way detection does.

Then volume shows up and makes it worse. If you keep every failure you catch, the store fills with sediment fast. Someone I talked to for the last post had the agent write a short post mortem after each task, which worked beautifully until there were forty of them and the signal drowned. So you have to consolidate. Merge the near duplicates, summarize the old ones, let the trivial stuff fade out. And consolidation is lossy on purpose, which means every time you do it you are betting on which detail mattered before you actually know. You compress "the deploy failed because the migration ran before the feature flag flipped" down to "be careful with migration ordering," and you have probably thrown away the one specific that would have helped next time. The summary feels tidier and remembers less.

There is a quieter failure mode hiding in here too, and it is the one I find most interesting. When you consolidate aggressively you are tempted to fold the event and the lesson into a single object. What happened, and what you concluded from it, become one note. That is exactly the move that turns memory into superstition. The agent stops holding "this happened once and here is the evidence" and starts holding "this is the rule," and it will defend the rule long after the thing that justified it has changed. A failure that was real on Tuesday hardens into a law by Friday, enforced by a system that no longer remembers why. Keep the event and the conclusion as separate things and you can revise the conclusion later. Fuse them and you cannot.

So what makes keeping so much harder than catching? I think it comes down to signal. Detection has ground truth right when it happens. The test passed or it did not. Keeping has no equivalent. At the moment you are deciding whether a memory is worth holding onto, you usually cannot tell, because the thing that would actually tell you is whether acting on that memory later leads somewhere good or sends the agent back into the wall it already hit. That signal arrives much later, if you capture it at all, and almost nobody is capturing it. We instrument the write and leave the outcome uninstrumented, then act surprised when the store fills with confident junk.

Worth saying plainly: this is roughly how human memory works, and nobody designed that, so maybe it is telling. You do not store your whole day. Something during sleep throws away nearly all of it and keeps a thin, strange, sometimes wrong selection. The recording was never the clever part. The selection is. We have built agents that record fluently and select badly, which is close to the exact inverse of what you want.

I do not have a clean fix, and I am suspicious of anyone who says they do. What I have is a few things I now believe. Separate the cheap detector from the expensive decision, and do not let the first quietly stand in for the second. Do not promote a single failure into a durable rule just because it happened once. Build the cleanup pass in from the start, because the store degrades whether or not you planned for it. And accept that part of the keep decision cannot be automated yet, because the signal it really wants, did acting on this actually work, is one most systems are not even recording.

That last one is the thread I will pull next.

Your AI agent remembers what sounds related, not what worked

Michelle Tristy — Sat, 13 Jun 2026 23:19:07 +0000

I spent a couple of weeks asking people a pretty basic question. If you are actually running agents, past the demo, in something resembling production, how do you handle memory?

I was expecting a handful of tips. What I got instead was the same frustration over and over, and a problem that, as far as I can tell, nobody has cleanly solved yet. So I am writing it down, because if you build with agents you are going to run straight into it.

The thing everyone starts with

Most agent memory works the same way. Embed everything the agent has seen, store the vectors, and when a new task shows up, pull back whatever is closest and drop it into context.

That is fine right up until it isn't. The catch is that "closest in vector space" really means "sounds related," and sounding related is not the same as having worked last time.

So the agent recalls the thing that resembles the task in front of it, not the thing that actually helped. It will cheerfully head down a path it already failed three sessions ago, because nothing ever told it that path was a dead end. If you have watched an agent repeat its own mistake with total confidence, that is the whole bug right there. It is not stupid. It just never found out how the last attempt turned out.

What people are actually doing about it

Here is the part I did not expect. Almost everyone I talked to had already hit this and quietly built their own fix. And the fixes were all over the place, which to me is the tell that there is no standard answer yet.

A few that kept coming up.

Some people just use files. No memory platform, nothing fancy. Working memory lives in plain files the agent reads on startup, the agent decides what to write, and old stuff rolls off into a vector store later. For one person working alone this was apparently rock solid, and they were a little smug about it, fairly.

Other people keep a separate failure log. Pull "this failed and here is why" out of the general memory entirely, and when the agent wonders whether it has tried something before, check that log first, ahead of the normal similarity search. Somebody put it in a way that stuck with me. Embeddings are great at recalling topics. They almost never hold on to "we went down this road and it blew up because of X."

A few have the agent write its own little post mortem after each task. Tried this, it broke because of that, next time do the other thing. Then search those before starting fresh. The honest downside they admitted is that after thirty or forty of these the file turns into noise, so they had to bolt on a step that summarizes the old ones.

And some split memory into tiers. Stable facts the agent is allowed to trust, versus everything else, which it can mention but not act on unless it can point to where it came from.

Different shapes, same underlying instinct. Stop pretending every memory is equally trustworthy.

Where it all falls apart

Once I lined these up next to each other, one thing jumped out.

Every single approach handles what to write down. None of them really handles what to keep.

Noticing that something failed turns out to be the easy half. You can catch tool errors, failed tests, timeouts, a change that got reverted. You can even treat "the task just ended and nobody ever confirmed it worked" as its own kind of failure, which is how you catch the quiet ones that never throw an error.

It is everything after that gets hard. Which failures are worth keeping, and which were flukes. When a lesson stops being true because the system moved underneath it. How you stop a memory from sliding from "this happened once" into "this is the rule," when nobody actually checked that it should be a rule.

One person framed it in a way I keep coming back to. A memory should hold proof, not a moral. The raw event, what happened and the evidence for it, should stay put and stay checkable. The lesson you draw from it should be allowed to change when something later contradicts it. The moment those two things become a single object, the system starts defending its interpretation instead of just remembering what actually happened. Which, honestly, is a very human way to be wrong.

What the newer tools still skip

There is a fresh wave of memory tooling now that handles a nearby but different problem, which is tracking whether a stored fact is still true as time passes. Who owned this before, who owns it now. That is genuinely useful and a real step up from blind similarity.

But notice it is answering a different question. "Is this fact still current" is not the same as "did acting on this memory actually lead somewhere good." A fact can be perfectly up to date and still be the exact thing that sent the agent into the wall three times in a row. Whether something is still true and whether it ever worked are two different axes. Most of the field is busy on the first one.

If you are building this today

The practical stuff I took away, mostly secondhand from people deeper in it than me.

Do not lean on similarity on its own. It hands you what looks related, not what helped. Treat failures as real memory, because what did not work is often more useful than what is merely similar. Keep the event and the lesson separate, so you can record what happened plainly and still revise the conclusion later. Put a real gate in front of what gets promoted into a durable rule, because noticing a break is not the same as having learned the right thing, and bad lessons calcify fast. And assume you will have to go back. A lesson that was true two weeks ago can be actively harmful once you have refactored the thing it was about.

None of this is solved. The people doing it well are using sensible rules of thumb, recency, prove it twice, a human glance, the occasional cleanup pass. And every one of those rules breaks somewhere predictable.

I do not think a better embedding model is the way out. The question feels different to me. Less "what is most similar to this," and not even "what is still true," but something closer to "what actually worked, and how do we hang onto that while the rest quietly fades."

If you are running agents in production and wrestling with this, I would genuinely like to hear how you handle it. The conversation that kicked all of this off taught me more than anything I have read on the topic.