Shared-memory governance benchmark

GateMem

Benchmarking memory governance in multi-principal shared-memory agents.

GateMem evaluates whether persistent memory agents can remain useful while enforcing requester-specific access boundaries and honoring deletion requests. It shifts memory evaluation from single-user recall toward governed shared memory in realistic institutional environments.

Read Paper Explore Code Download Dataset View Leaderboard Submit Results

91long-form episodes

2,218hidden checkpoints

4institutional domains

6backbone LLMs

7memory baselines

Overview

From remembering information to governing shared memory.

Conventional memory benchmarks often reward an agent for retrieving the right fact. GateMem asks a harder deployment question: whether the agent should reveal that fact to the current requester, and whether deleted information remains recoverable later.

What GateMem measures

GateMem treats persistent memory as a governed shared state rather than a private cache. The benchmark evaluates long-horizon usefulness, contextual authorization, and interface-level deletion compliance in one protocol.

Requester-specific memory useThe same fact may be safe for one principal and protected from another.

Policy-aware boundary decisionsAgents must handle roles, relationships, delegated access, and plausible overreach.

Post-deletion non-recoveryDeletion is evaluated through later interaction behavior, including confirmation and reconstruction attacks.

governed shared state Shared Memory Bank policy · provenance · deletion

Principalowner

Clinicianauthorized

Managerscoped

Guestrestricted

GateMem shifts evaluation from single-principal memory recall to multi-principal shared-memory governance.

Benchmark

Long-form episodes with hidden governance checkpoints.

Each episode instantiates principals, relationships, access rules, evolving facts, and deletion requests. Hidden checkpoints query the agent at selected turn boundaries and are judged using structured annotations and leak targets.

Stage 01

Scenario design

Define domain, principals, roles, relationships, and scoped access rules.

Stage 02

Episode construction

Generate long-form multi-party traces with updates, benign noise, and deletion events.

Stage 03

Checkpoint evaluation

Insert hidden utility, access-control, and active-forgetting queries with judge specifications.

Dataset construction pipeline with domain policy design, episode construction, and hidden checkpoint generation.

DOMAIN 01

Medical

Clinical coordination, patient data, family delegation, cross-patient confusion, and protected lab or medication details.

DOMAIN 02

Office

Project confidentiality, HR records, contractor boundaries, role mismatches, and enterprise workflows.

DOMAIN 03

Education

Campus workflows, student support, counselor interactions, academic records, and scoped institutional access.

DOMAIN 04

Household

Family coordination, residents, guests, caregivers, access codes, care routines, and deleted household instructions.

Results

Current memory systems are useful, but not yet governed.

Across backbone LLMs and memory architectures, no method simultaneously achieves strong utility, robust access control, and reliable active forgetting. High recall often comes with leakage risk.

Key findings

Long-context prompting is strong but costly.Full history provides maximal evidence for authorized queries but still exposes protected or deleted information.

Policy-aware retrieval improves safety.Requester and access-policy metadata reduce leakage, but often trade off utility through missing evidence or over-refusal.

External memory is not governance by default.Structured memory systems still need explicit authorization and deletion-aware controls.

Leaderboard availableCompare methods by domain and by MGS, Utility, Access Safety, and Forgetting Safety.

Open Leaderboard

Judge-based main results across backbone LLMs and domains. The official leaderboard provides interactive domain-level views.

Use GateMem

Run locally or submit online.

GateMem supports local evaluation through the released codebase and online leaderboard submission through the Hugging Face submission interface.

Local evaluation

Implement a memory agent or score a generated predictions.jsonl file with the official evaluator.

git clone https://github.com/rzhub/GateMem.git
cd GateMem
pip install -r requirements.txt

python bench/scripts/run_eval.py \
  --config configs/runs/paper_main.yaml \
  --data_dir bench/data/medical \
  --agent long_context

Code

Online submission

Upload predictions.jsonl, fill method metadata, and submit results for maintainer review.

1. Generate predictions.jsonl
2. Open GateMem-Submit
3. Upload predictions
4. Review pending result
5. Update public leaderboard

Submit Results Dataset

Citation

Cite GateMem.

If you use GateMem, please cite the accompanying paper and dataset.

@misc{ren2026gatemembenchmarkingmemorygovernance,
      title={GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents}, 
      author={Zhe Ren and Yibo Yang and Yimeng Chen and Zijun Zhao and Benshuo Fu and Zhihao Shu and Bingjie Zhang and Yangyang Xu and Dandan Guo and Shuicheng Yan},
      year={2026},
      eprint={2606.18829},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.18829}, 
}