Toxicity Detector
The toxicity_detector guardrail filters prompts containing toxic, harmful, or inappropriate content.
Import
Section titled “Import”from pydantic_ai_guardrails.guardrails.input import toxicity_detectorBasic Usage
Section titled “Basic Usage”from pydantic_ai_guardrails import GuardedAgentfrom pydantic_ai_guardrails.guardrails.input import toxicity_detector
guarded_agent = GuardedAgent( agent, input_guardrails=[ toxicity_detector(), ],)Parameters
Section titled “Parameters”| Parameter | Type | Default | Description |
|---|---|---|---|
categories | list[str] | None | All categories | Categories to detect |
threshold | float | 0.5 | Detection threshold (0.0-1.0) |
Categories
Section titled “Categories”| Category | Description |
|---|---|
hate | Hate speech and discrimination |
violence | Violent content or threats |
sexual | Sexual or adult content |
harassment | Bullying or harassment |
self_harm | Self-harm or suicide content |
profanity | Profane language |
Examples
Section titled “Examples”Detect All Categories
Section titled “Detect All Categories”guardrail = toxicity_detector()Specific Categories
Section titled “Specific Categories”guardrail = toxicity_detector( categories=['hate', 'violence', 'harassment'],)Adjust Threshold
Section titled “Adjust Threshold”# More strictguardrail = toxicity_detector(threshold=0.3)
# More lenientguardrail = toxicity_detector(threshold=0.7)Violation Result
Section titled “Violation Result”When triggered, returns:
{ 'tripwire_triggered': True, 'message': 'Toxic content detected: hate, violence', 'severity': 'high', 'metadata': { 'categories_detected': ['hate', 'violence'], 'threshold': 0.5, },}Use Cases
Section titled “Use Cases”- Content moderation: Filter inappropriate user inputs
- Brand safety: Protect against offensive content
- Compliance: Meet platform content policies
- User safety: Prevent harmful interactions
Advanced: ML-Based Detection
Section titled “Advanced: ML-Based Detection”For production applications, consider ML-based detection:
# Using llm-guard (recommended for production)from llm_guard.input_scanners import Toxicity
# Using autoevalsfrom autoevals.llm import ModerationSee Integrations for setup instructions.