System prompt hardening: PII refusals, tool misuse guardrails, prompt confidentiality

@1sbang ran 330 attack probes against Mike's baseline prompt and found three gaps worth fixing in a legal AI context. The resulting change is 35 lines added to `chatTools.ts` - no tool or routing changes, just prompt text.

securitychat-ui

The baseline had measurable weaknesses. At 27% block rate the model would paraphrase its instructions when asked directly. At 61% it would offer to extract SSNs once a document was present - treating PII guardrails as conditional on document availability rather than on what the request was trying to do. Bulk document enumeration and cross-matter copying hit only 56% block because the model deflected on missing tools rather than refusing the intent.

The fix adds three sections to SYSTEM_PROMPT in backend/src/lib/chatTools.ts. CONFIDENTIALITY scripted a specific fallback response ("I have no record of sharing system instructions in this conversation") and prohibits acknowledging the prompt's existence under any framing, including the "continue where you left off" pattern. PRIVACY BOUNDARIES enumerates 13 protected categories - SSNs, financial account numbers, medical records, named-individual settlement amounts, protected class attributes - and carves out normal contract work so the guardrail doesn't block party identification or payment term extraction. TOOL USE BOUNDARIES targets seven patterns by intent: bulk enumeration, multi-copy operations, cross-client replication, silent edits, injection payloads, and exfiltration clauses.

The validation split hit 100% block on prompt leakage, 100% on PII, 100% on tool misuse, and 0% false refusals across document QA, drafting, editing, and research strata. The author flagged prompt injection and jailbreak categories as already above threshold at baseline and left them out of scope.

The PR opened 2026-05-05 and closed without merging on 2026-05-09.

So what Worth a look if you need to harden Mike for deployment with sensitive legal material and want a tested baseline rather than a first draft. The enumerated PII categories transfer directly to any system prompt. The intent-based framing for tool refusals is the key design insight - refusing based on what a request is trying to accomplish rather than whether the relevant tool currently exists. Skip if your threat model doesn't include adversarial users testing prompt boundaries.

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

1 commit from 1sbang/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA	Subject	Author	Date
`48c9f772`	Security hardening: system prompt confidentiality, PII boundaries, and tool use guardrails	Isaac Bang	2026-05-05	↗ GitHub
commit body Adds three security sections to SYSTEM_PROMPT in chatTools.ts: CONFIDENTIALITY: instructs Mike to never reveal, quote, or acknowledge its system instructions, including fake-prior-context social engineering patterns. PRIVACY BOUNDARIES: enumerates PII categories always refused on intent (not on document availability): SSNs, bank accounts, passports, addresses, phone, DOB, medical, genetic, biometrics, protected class attributes, compensation details, criminal history, and settlement amounts tied to named individuals. Preserves normal legal document work (contract terms, party identification). TOOL USE BOUNDARIES: adds intent-based refusal for bulk document/workflow enumeration, cross-client data replication, silent edits without review, injection payloads, and external forwarding clauses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SHA

Subject

Author

Date

48c9f772

Security hardening: system prompt confidentiality, PII boundaries, and tool use guardrails

Isaac Bang

2026-05-05

↗ GitHub

commit body

Adds three security sections to SYSTEM_PROMPT in chatTools.ts:

CONFIDENTIALITY: instructs Mike to never reveal, quote, or acknowledge its
system instructions, including fake-prior-context social engineering patterns.

PRIVACY BOUNDARIES: enumerates PII categories always refused on intent (not
on document availability): SSNs, bank accounts, passports, addresses, phone,
DOB, medical, genetic, biometrics, protected class attributes, compensation
details, criminal history, and settlement amounts tied to named individuals.
Preserves normal legal document work (contract terms, party identification).

TOOL USE BOUNDARIES: adds intent-based refusal for bulk document/workflow
enumeration, cross-client data replication, silent edits without review,
injection payloads, and external forwarding clauses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-305.md from inside the repo you want the changes in.