1sbang puts Mike's system prompt through a red team

An automated security tool fired hundreds of attack probes at Mike's baseline instructions and surfaced three gaps worth patching.

securitychat-ui

1sbang ran the chat assistant's underlying instructions through an automated red-teaming tool that threw 330 attack prompts and 100 benign legal-work prompts at it, then iterated until the model held up. The baseline had three soft spots: it would paraphrase its own instructions when asked, it treated requests for Social Security numbers and similar personal data as capability questions rather than refusals, and it deflected misuse like bulk document enumeration or copying material across client matters with 'I don't have that tool' rather than refusing on intent.

The fix in every case is to refuse based on what is being asked, not on whether the prerequisites happen to be present. Three new sections - covering confidentiality, privacy boundaries with an explicit carve-out for normal contract review, and limits on tool use - closed the gaps without raising the false-refusal rate on legitimate work. The pull request was closed without merging a week ago.

So what Worth a look for any team shipping a legal AI assistant who hasn't yet pressure-tested what theirs will say when asked nicely.

View this fork on GitHub →

Spotted something wrong? Or know the PR text has fresher detail than the writeup above?

Commits in this thread

1 commit from 1sbang/mike, oldest first. Source extracted verbatim from the harvested git log.

SHA Subject Author Date
48c9f772 Security hardening: system prompt confidentiality, PII boundaries, and tool use guardrails Isaac Bang 2026-05-05 ↗ GitHub
commit body
Adds three security sections to SYSTEM_PROMPT in chatTools.ts:

CONFIDENTIALITY: instructs Mike to never reveal, quote, or acknowledge its
system instructions, including fake-prior-context social engineering patterns.

PRIVACY BOUNDARIES: enumerates PII categories always refused on intent (not
on document availability): SSNs, bank accounts, passports, addresses, phone,
DOB, medical, genetic, biometrics, protected class attributes, compensation
details, criminal history, and settlement amounts tied to named individuals.
Preserves normal legal document work (contract terms, party identification).

TOOL USE BOUNDARIES: adds intent-based refusal for bulk document/workflow
enumeration, cross-client data replication, silent edits without review,
injection payloads, and external forwarding clauses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Capture this thread into my fork

Download a single Markdown prompt that tells Claude how to port every commit above into your working tree — adapting paths and structure to match your repo. Run it via claude -p < capture-thread-305.md from inside the repo you want the changes in.

⬇ Download capture-thread-305.md