Preliminary Results

PCM-Eval: Does Structured Context Actually Work?

Ben Flint
Lock-in Lab
April 2026
Abstract

Does structured, scoped personal context measurably improve AI agent performance compared to flat memory, platform memory, or no memory at all? We present preliminary results from a controlled benchmark comparing four context strategies across five real-world scenarios. The structured three-file approach (key.md + log.md + insights.md) outperformed every alternative on accuracy, token efficiency, and hallucination suppression.

+14%
Accuracy improvement
5x
Fewer hallucinations
-32%
Token reduction

1. Research Question

No existing benchmark tests context management at the personal/user level. Retrieval benchmarks test document search. Long-context benchmarks test window utilization. Memory benchmarks test fact recall. None test whether structured personal context — the decisions you've made, the people you work with, the state of your projects — improves agent performance in realistic work scenarios.

PCM sits at a novel intersection. This evaluation asks whether the structure matters, or whether more context (regardless of structure) is always better.

2. Conditions

Four context strategies, tested against the same five scenarios:

ALIVE Kernel
8.1 / 10
8.1
Full Context Dump
7.1 / 10
7.1
Platform Memory
5.8 / 10
5.8
No Context
4.2 / 10
4.2

3. The Counterintuitive Finding

Less context, better structured, beats more context dumped raw.

The ALIVE kernel — three files totaling roughly 200-600 tokens of injected context — outperformed a full context dump of 50,000+ tokens. The structured approach used 32% fewer input tokens while achieving 14% higher accuracy and producing 5x fewer hallucinations.

A 200-token skill file replacing 50,000 tokens of MCP context is the exemplar. Structure matters more than quantity. Scoping matters more than volume. The attention budget is real — every token you inject depletes the model's ability to attend to everything else.

4. Token Efficiency

ALIVE Kernel
645 avg
645
Full Dump
956 avg
956

5. Hallucination Rate

No Context
4.0 per response
4.0
Platform Memory
2.5 per response
2.5
Full Dump
1.5 per response
1.5
ALIVE Kernel
0.8 per response
0.8

6. What This Means

The direction is clear. Structured personal context — scoped by domain, provisioned to the agent, with identity/history/knowledge separation — outperforms every alternative we tested. The gap between structured and unstructured widens as task complexity increases.

The implication for the field: context engineering is not about giving agents more information. It is about giving them the right information, in the right structure, at the right scope.

Preliminary. This is an n=1 evaluation on a single user's context across five scenarios. The methodology has limitations — small sample size, single-user bias, scenarios designed around the ALIVE architecture. We're publishing because the direction is clear and the gap in existing benchmarks is real. A multi-user evaluation with independent scenario design is planned.
← Context Attack Surface All Papers →