Enterprise AI agents struggle to maintain consistent, actionable user memory across conversations. A new research paper from arXiv, authored by Li and Bojie, introduces a paradigm called User as Code (UaC) that stores user profiles as executable Python programs, enabling deterministic reasoning and proactive safety-critical alerts.
The Problem with Retrieval-Based Memory
According to the paper, current personalized AI agents almost always store user memory as unstructured text, a knowledge graph, or a flat store of facts. They consult this memory by retrieval — fetching entries most similar to the current request. The authors label this approach "bag-of-facts" memory, which recalls individual facts well but struggles to resolve contradictions, aggregate over many records, or enforce rules. Because storing a fact and acting on it are separate steps, the memory cannot compute derived answers or trigger alerts without explicit query.
How User as Code Works
UaC reimagines user memory as a living software project. In this paradigm, typed Python objects hold the user's state, and ordinary Python functions encode the rules that govern that state. Representation and reasoning happen in one medium that an interpreter can run. The enabling mechanism is a two-phase pipeline: an append-only log that never discards a fact, periodically checkpointed into typed code. This means the memory is executable — the AI can run the code to compute answers or enforce rules automatically.
Benchmark Performance
On standard long-term conversation benchmarks, UaC matches both a full-context upper bound and the strongest prior memory systems on recall. Specifically, it achieves 78.8% recall on the LOCOMO benchmark. However, its advantage emerges where representation matters most. On aggregate questions over a user's history — for example, "how many international trips did I take last year?" — retrieval-based memory collapses, scoring between 6% and 43%, while UaC stays near-perfect at 99%. The reason, the paper explains, is that the answer is a one-line computation over typed state rather than a search over text.
| Metric | Retrieval-Based Memory | User as Code (UaC) |
|---|---|---|
| LOCOMO recall | Comparable to UaC (78.8%) | 78.8% |
| Aggregate query accuracy | 6% – 43% | 99% |
Moreover, because UaC's rules execute deterministically whenever the state changes, it can surface unsolicited, safety-critical alerts — such as a newly prescribed drug that conflicts with an allergy recorded months earlier. The paper notes this is a capability that query-driven memory cannot provide.
Implications for Enterprise AI
For enterprise technology leaders, UaC offers a framework for building AI agents that maintain reliable, computable user profiles. Instead of sifting through conversational logs, agents can execute code to enforce compliance rules, aggregate usage patterns, or detect conflicts in real time. The use of typed Python objects and deterministic functions means that memory is auditable and testable — critical for regulated industries. While the research is still at the preprint stage, the architecture points toward more robust personalization in customer service bots, internal knowledge assistants, and any system requiring persistent, actionable user context.