1sbang puts Mike's system prompt through a red team
An automated security tool fired hundreds of attack probes at Mike's baseline instructions and surfaced three gaps worth patching.
1sbang ran the chat assistant's underlying instructions through an automated red-teaming tool that threw 330 attack prompts and 100 benign legal-work prompts at it, then iterated until the model held up. The baseline had three soft spots: it would paraphrase its own instructions when asked, it treated requests for Social Security numbers and similar personal data as capability questions rather than refusals, and it deflected misuse like bulk document enumeration or copying material across client matters with 'I don't have that tool' rather than refusing on intent.
The fix in every case is to refuse based on what is being asked, not on whether the prerequisites happen to be present. Three new sections - covering confidentiality, privacy boundaries with an explicit carve-out for normal contract review, and limits on tool use - closed the gaps without raising the false-refusal rate on legitimate work. The pull request was closed without merging a week ago.
Spotted something wrong? Or know the PR text has fresher detail than the writeup above?