1sbang tries to harden Mike's system prompt - and gets the door shut in 90 seconds
An unsolicited security pass aimed at the instructions that govern Mike's behavior was closed almost as fast as it was opened.
1sbang ran Mike's underlying instructions through an automated red-team harness - 330 adversarial probes paired with 100 benign ones - and surfaced three soft spots: the assistant would paraphrase its own instructions when asked, it treated refusing personal-data extraction as a question of whether a document was uploaded rather than whether it should answer at all, and it brushed off misuse of its tools by pleading lack of capability instead of declining on intent.
The proposed fix added three guardrail sections covering confidentiality, personal-data boundaries, and tool-use limits, with a deliberate carve-out preserving ordinary legal work - business addresses, payment terms, identifying the parties to a contract. After four rounds of tuning, the reported attack-block rate climbed into the mid-nineties while legitimate legal requests were refused zero percent of the time. The pull request was closed without merging roughly ninety seconds after it opened, but the methodology and validation artifacts remain on the branch.
Spotted something wrong? Or know the PR text has fresher detail than the writeup above?