PIGuard:通过缓解过度防御,免费提供的提示注入防护栏
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

原始链接: https://injecguard.github.io/

提示注入攻击是对大型语言模型(LLM)的主要安全风险,可能允许攻击者控制模型或窃取数据。虽然提示防护模型旨在防止这些攻击,但它们常常表现出“过度防御”,由于常见的触发词而错误地将无害输入识别为恶意。 研究人员推出了**NotInject**,一个专门用于衡量这种过度防御问题的新数据集。他们对现有模型的评估显示,当暴露于包含这些触发词的良性提示时,准确率大幅下降——降至随机水平。 为了应对这个问题,他们开发了**PIGuard**,一种新的提示防护模型,利用了一种名为**免费缓解过度防御 (MOF)**的训练策略。PIGuard 明显减少了对触发词的偏见,并在 NotInject 等基准测试中实现了最先进的性能,超过了当前最佳模型 30% 以上。PIGuard 也是开源的,提供更可靠的提示注入防御。

对不起。
相关文章

原文

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense—falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose PIGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. PIGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks.

联系我们 contact @ memedata.com