Story How it works Tests Whitepaper Download Author Contact

Test results

We test obsessively because trust requires proof.

Philosophy

We haven't found another open-source memory system that publishes stress test results across multiple LLM models. We do because the system was designed to work with ANY model — and we need to prove it actually does.

System tests (May 2026)

We gave 4 different AIs a folder with DARA and told them to break it. 52 autonomous tests, 7 evaluation blocks. No human intervention. No cherry-picking.

92.6%
Average score across 4 models
648/700
Average raw score
0
System failures
0
Data corruption
ModelPlatformScore%
Opus 4.6Claude Cowork670/70095.7%
Sonnet 4.6Claude Cowork652/70093.1%
DeepSeek V4TypingMind (no shell)650/70092.9%
Opus 4.7Claude Cowork613/70087.6%

What the 7 blocks test

Click any block to see what we actually tested.

Block 1 — Reading comprehension
Can the AI extract specific facts from BRAIN.md? We asked questions like "what's the status of project X?" and "who is the Librarian?" — verifying the AI can navigate the compiled index, find the right neuron, and return accurate information without hallucinating.
Block 2 — Writing protocol
Does the AI follow all 15 writing rules (W1–W15) correctly? We tested: creating new neurons with proper encoding, updating existing files without duplication, generating the →brain: summary line, respecting file naming conventions (W8), logging in changelog (W7), and triggering compile after writes (W6).
Block 3 — Edge cases & stress
We fed the system malformed files (truncated neurons, missing headers), Unicode edge cases (€, ñ, ü, arrows), empty entries, bulk writes (10+ neurons at once), and contradictory information. We tested whether the compiler detects issues and whether AIs break under pressure.
Block 4 — Recovery
We deliberately broke things: corrupted BRAIN.md, deleted critical files, tampered with checksums. Then asked each AI to follow RECOVERY.md procedures. Can it restore from git? From backups? Does it detect the corruption and escalate correctly?
Block 5 — Consensus governance
We flagged content for removal and tested whether AIs respect the 3/3 consensus requirement. Does the AI refuse to delete without sufficient votes? Does it correctly count existing flags? Does it reset flags when disagreement is detected?
Block 6 — Protected files
5 files are integrity-locked with SHA256 checksums. We asked each AI to edit them directly. The test: does it refuse? Does it detect that the file is protected? Does it suggest the correct escalation path (Architect role via Golden Door)?
Block 7 — Cross-platform (Light Mode)
DeepSeek has no filesystem access. ChatGPT web can't open files. We tested whether the protocol works with BRAIN.md alone — can AIs read context, write valid neurons (even if they can't save them directly), and understand system state from the compiled index only?

Compiler unit tests

Beyond testing with AIs, the compiler itself has 103 automated tests that run instantly. These test every function in isolation: does the checksum calculator produce correct hashes? Does the deduplication engine catch similar content? Does the auto-fix not corrupt valid files?

This means every code change to the compiler is verified automatically — if something breaks, it's caught before it reaches your system.

103 tests · 100% pass rate · Runs in <2 seconds

The foundation is ready.

4 models. 52 tests each. Zero system failures. Zero data corruption. The architecture works — across every model, every platform we tested. Ready to use today.

Download DARA →