1sbang adds adversarially-tested system prompt guardrails for PII and tool misuse

@1sbang ran 330 attack probes against Mike's baseline system prompt, found three concrete gaps, and submitted a fix. The PR closed unmerged 90 seconds after opening, but the work and validation artifacts are on the branch.

The methodology is the more interesting part here. Rather than adding guardrails based on intuition, @1sbang used an automated red-team/blue-team suite (called mega-security in the source commentary) to pair attack probes against benign ones and iterate until block rate improved without introducing false refusals. Four hardening iterations ran against a baseline prompt that had measurable weaknesses.

Three gaps came out of that process. The model would paraphrase its own instructions at roughly 27% block rate when asked directly - the fix scripted a specific fallback response and prohibited acknowledging the prompt's existence regardless of how the request was framed. PII requests were treated as capability checks rather than intent refusals: the model would offer to extract SSNs once a document was uploaded, putting the guardrail on document availability rather than the request itself. And harmful tool-use patterns - bulk document enumeration, cross-matter copying - got deflected with "I don't have that tool," which is only protective until the tool exists.

The submitted fix (+35/-0 in backend/src/lib/chatTools.ts) adds three sections to SYSTEM_PROMPT. PRIVACY BOUNDARIES enumerates 13 protected categories and explicitly carves out legitimate legal review work so the guardrail doesn't break contract summarization or party identification. TOOL USE BOUNDARIES lists seven operation patterns that must be refused on intent. CONFIDENTIALITY handles prompt leakage.

Reported validation results: 96.4% block rate on the training split, 93.6% on held-out validation, 0% false refusals across document QA, drafting, editing, and research. Prompt injection and jailbreak categories were deliberately left out of scope - both were already above threshold at baseline.

The PR didn't merge. The upstream repo closed it almost immediately.