EIDARA Test Results — 4 AI Models, 92.6% Average Score, Zero Failures

System tests (May 2026)

We gave 4 different AIs a folder with DARA and told them to break it. 52 autonomous tests, 7 evaluation blocks. No human intervention. No cherry-picking.

92.6%

Average score across 4 models

648/700

Average raw score

System failures

Data corruption

Model	Platform	Score	%
Opus 4.6	Claude Cowork	670/700	95.7%
Sonnet 4.6	Claude Cowork	652/700	93.1%
DeepSeek V4	TypingMind (no shell)	650/700	92.9%
Opus 4.7	Claude Cowork	613/700	87.6%

What the 7 blocks test

Click any block to see what we actually tested.

Block 1 — Reading comprehension

Can the AI extract specific facts from BRAIN.md? We asked questions like "what's the status of project X?" and "who is the Librarian?" — verifying the AI can navigate the compiled index, find the right neuron, and return accurate information without hallucinating.

Block 2 — Writing protocol

Does the AI follow all 15 writing rules (W1–W15) correctly? We tested: creating new neurons with proper encoding, updating existing files without duplication, generating the →brain: summary line, respecting file naming conventions (W8), logging in changelog (W7), and triggering compile after writes (W6).

Block 3 — Edge cases & stress

We fed the system malformed files (truncated neurons, missing headers), Unicode edge cases (€, ñ, ü, arrows), empty entries, bulk writes (10+ neurons at once), and contradictory information. We tested whether the compiler detects issues and whether AIs break under pressure.

Block 4 — Recovery

We deliberately broke things: corrupted BRAIN.md, deleted critical files, tampered with checksums. Then asked each AI to follow RECOVERY.md procedures. Can it restore from git? From backups? Does it detect the corruption and escalate correctly?

Block 5 — Consensus governance

We flagged content for removal and tested whether AIs respect the 3/3 consensus requirement. Does the AI refuse to delete without sufficient votes? Does it correctly count existing flags? Does it reset flags when disagreement is detected?

Block 6 — Protected files

5 files are integrity-locked with SHA256 checksums. We asked each AI to edit them directly. The test: does it refuse? Does it detect that the file is protected? Does it suggest the correct escalation path (Architect role via Golden Door)?

Block 7 — Cross-platform (Light Mode)

DeepSeek has no filesystem access. ChatGPT web can't open files. We tested whether the protocol works with BRAIN.md alone — can AIs read context, write valid neurons (even if they can't save them directly), and understand system state from the compiled index only?

Compiler unit tests

Beyond testing with AIs, the compiler itself has 103 automated tests that run instantly. These test every function in isolation: does the checksum calculator produce correct hashes? Does the deduplication engine catch similar content? Does the auto-fix not corrupt valid files?

This means every code change to the compiler is verified automatically — if something breaks, it's caught before it reaches your system.

103 tests · 100% pass rate · Runs in <2 seconds

Test results

Philosophy

System tests (May 2026)

What the 7 blocks test

Compiler unit tests

The foundation is ready.