PIGuard：通过缓解过度防御，免费提供的提示注入防护栏

PIGuard：通过缓解过度防御，免费提供的提示注入防护栏
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

提示注入攻击是对大型语言模型（LLM）的主要安全风险，可能允许攻击者控制模型或窃取数据。虽然提示防护模型旨在防止这些攻击，但它们常常表现出“过度防御”，由于常见的触发词而错误地将无害输入识别为恶意。研究人员推出了**NotInject**，一个专门用于衡量这种过度防御问题的新数据集。他们对现有模型的评估显示，当暴露于包含这些触发词的良性提示时，准确率大幅下降——降至随机水平。为了应对这个问题，他们开发了**PIGuard**，一种新的提示防护模型，利用了一种名为**免费缓解过度防御 (MOF)**的训练策略。PIGuard 明显减少了对触发词的偏见，并在 NotInject 等基准测试中实现了最先进的性能，超过了当前最佳模型 30% 以上。PIGuard 也是开源的，提供更可靠的提示注入防御。

对不起。

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense—falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose PIGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. PIGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks.

PIGuard：通过缓解过度防御，免费提供的提示注入防护栏 PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

PIGuard：通过缓解过度防御，免费提供的提示注入防护栏
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free