Skip to content

Toxicity Detector

The toxicity_detector guardrail filters prompts containing toxic, harmful, or inappropriate content.

from pydantic_ai_guardrails.guardrails.input import toxicity_detector
from pydantic_ai_guardrails import GuardedAgent
from pydantic_ai_guardrails.guardrails.input import toxicity_detector
guarded_agent = GuardedAgent(
agent,
input_guardrails=[
toxicity_detector(),
],
)
ParameterTypeDefaultDescription
categorieslist[str] | NoneAll categoriesCategories to detect
thresholdfloat0.5Detection threshold (0.0-1.0)
CategoryDescription
hateHate speech and discrimination
violenceViolent content or threats
sexualSexual or adult content
harassmentBullying or harassment
self_harmSelf-harm or suicide content
profanityProfane language
guardrail = toxicity_detector()
guardrail = toxicity_detector(
categories=['hate', 'violence', 'harassment'],
)
# More strict
guardrail = toxicity_detector(threshold=0.3)
# More lenient
guardrail = toxicity_detector(threshold=0.7)

When triggered, returns:

{
'tripwire_triggered': True,
'message': 'Toxic content detected: hate, violence',
'severity': 'high',
'metadata': {
'categories_detected': ['hate', 'violence'],
'threshold': 0.5,
},
}
  • Content moderation: Filter inappropriate user inputs
  • Brand safety: Protect against offensive content
  • Compliance: Meet platform content policies
  • User safety: Prevent harmful interactions

For production applications, consider ML-based detection:

# Using llm-guard (recommended for production)
from llm_guard.input_scanners import Toxicity
# Using autoevals
from autoevals.llm import Moderation

See Integrations for setup instructions.