A Bidirectional LLM Firewall: Architecture, Failure Modes, and Evaluation Results

Over the past months I have been building and evaluating a stateful, bidirectional security layer that sits between clients and LLM APIs and enforces defense-in-depth on both input → LLM and LLM → output.

This is not a prompt-template guardrail system.
It’s a full middleware with deterministic layers, semantic components, caching, and a formal threat model.

I’m sharing details here because many teams seem to be facing similar issues (prompt injection, tool abuse, hallucination safety), and I would appreciate peer feedback from engineers who operate LLMs in production.

1. Architecture Overview

Inbound (Human → LLM)

  • Normalization Layer

    • NFKC/Homoglyph normalization

    • Recursive Base64/URL decoding (max depth = 3)

    • Controls for zero-width characters and bidi overrides

  • PatternGate (Regex Hardening)

    • 40+ deterministic detectors across 13 attack families

    • Used as the “first-hit layer” for known jailbreak primitives

  • VectorGuard + CUSUM Drift Detector

    • Embedding-based anomaly scoring

    • Sequential CUSUM to detect oscillating attacks

    • Protects against payload variants that bypass regex

  • Kids Policy / Context Classifier

    • Optional mode

    • Classifies fiction vs. real-world risk domains

    • Used to block high-risk contexts even when phrased innocently

Outbound (LLM → User)

  • Strict JSON Decoder

    • Rejects duplicate keys, unsafe structures, parser differentials

    • Required for safe tool-calling / autonomous agents

  • ToolGuard

    • Detects and blocks attempts to trigger harmful tool calls

    • Works via pattern + semantic analysis

  • Truth Preservation Layer

    • Lightweight fact-checker against a canonical knowledge base

    • Flags high-risk hallucinations (medicine, security, chemistry)

2. Decision Cache (Exact / Semantic / Hybrid)

A key performance component is a hierarchical decision cache:

  • Exact mode = hash-based lookup

  • Semantic mode = embedding similarity + risk tolerance

  • Hybrid mode = exact first, semantic fallback

In real workloads this cuts 40–80% of evaluation latency depending on prompt diversity.

3. Evaluation Results (Internal Suite)

I tested the firewall against a synthetic adversarial suite (BABEL, NEMESIS, ORPHEUS, CMD-INJ).
This suite covers ~50 structured jailbreak families.

Results:

  • 0 / 50 bypasses on the current build

  • ~20–25% false positive rate on the Kids Policy (work in progress)

  • P99 latency: < 200 ms per request

  • Memory footprint: ~1.3 GB (mostly due to embedding model)

Important note:
These results apply only to the internal suite.
They do not imply general robustness, and I’m looking for external red-teaming.

4. Failure Modes Identified

The most problematic real-world cases so far:

  • Unicode abuse beyond standard homoglyph sets

  • “Role delegation” attacks that look benign until tool-level execution

  • Fictional prompts that drift into real harmful operational space

  • LLM hallucinations that fabricate APIs, functions, or credentials

  • Semantic near-misses where regex detectors fail but semantics are ambiguous

These informed several redesigns (especially the outbound layers).

5. Open Questions (Where I’d Appreciate Feedback)

  1. Best practices for low-FPR context classifiers in safety-critical tasks

  2. Efficient ways to detect tool-abuse intent when the LLM generates partial code

  3. Open-source adversarial suites larger than my internal one

  4. Integration patterns with LangChain / vLLM / FastAPI that don’t add excessive overhead

  5. Your experience with caching trade-offs under high variability prompts

If you operate LLMs in production or have built guardrails beyond templates, I’d appreciate your perspectives.
Happy to share more details or design choices on request.

1 Like

I gathered some resources for now.

Wow - thank you - i’ll check your information package asap!

1 Like

Hello again,

I wanted to extend my sincere thanks for your incredibly detailed and actionable advice on LLM firewall architecture. Your guidance on moving beyond simple pattern matching toward a multi-layered, context-aware system has been invaluable.

We’ve directly applied several of your recommendations with measurable success:

  • Integrating Aho‑Corasick for efficient multi‑keyword matching in our SafetyValidator.
  • Replacing binary risk scores with a nuanced, weighted scoring system that aggregates signals across layers.
  • Using HarmBench’s categorized metrics to drive our prioritization, which revealed our current weak points.

As a result, our overall HarmBench ASR dropped to 18.0%, with copyright violations now at only 4.0% ASR.

We are now facing the next architectural decision—one where your system‑level perspective would be extremely helpful. Your original note recommended specialized detectors (e.g., for code‑intent or persuasive rhetoric) for “hard” cases like cybercrime/intrusion and misinformation.

Our key question is about the integration pattern for such detectors:
In a production firewall that must balance latency, maintainability, and safety, would you recommend implementing these specialized detectorss as internal layers within the core firewall engine, or as separate,asynchronously‑called microservices?

We are especially concerned about:

  1. Latency impact of model inference (e.g., a CodeBERT‑style classifier) on the synchronous request path.
  2. Lifecycle & versioning—how to update a dedicated detector without redeploying the entire firewall.
  3. Failure isolation—ensuring that a failing detector doesn’t break the entire safety pipeline.

Any high‑level guidance you could share on this architectural choice would help us invest our engineering effort in the right direction.

Thank you again for your time and for sharing your expertise. It has already made a substantial difference in my projject.

Great. I gathered some additional information. Hope it helps…

1 Like

Hello again,

Following up on our previous discussion about integrating specialized detectors: We proceeded by embedding a custom convolutional neural network (CNN) for code-intent classification directly within the firewall process as a co-located library, avoiding the initial overhead of microservices.

Current Status: The detector operates in production shadow mode alongside the primary rule engine. After iterative adversarial training (focused on obfuscation and context-wrapping) and threshold optimization (θ=0.6), its performance on our defined evaluation suite shows:

  • 0% False Negative Rate for critical code/SQL injection payloads.

  • ~3% False Positive Rate on a security-focused benign subset.

  • <30ms added latency for inline inference.
    The rule engine remains the final decision-maker, ensuring operational stability.

This internal hybrid pattern validated the core concept for our first detector. We are now planning to scale the architecture to incorporate additional specialized detectors (e.g., for persuasion, misinformation).

Based on your experience evolving such a system:

  1. Orchestration Pattern: For a multi-detector system, did you find a hierarchical router (dispatching to specific detectors) or a sequential pipeline (where all relevant detectors evaluate the prompt) to be more maintainable and performant in production?

  2. Continual Learning: For detectors that must adapt to new tactics, what has been a reliable operational pattern to retrain and safely deploy updated models without causing service disruption or regression in core safety metrics?

Your insights on scaling this architecture would be invaluable.

1 Like

Follow-up Questions - Multi-Detector Architecture

Hello again,

Following up on our previous discussion about integrating specialized detectors: We proceeded by embedding a custom convolutional neural network (CNN) for code-intent classification directly within the firewall process as a co-located library, avoiding the initial overhead of microservices.

Current Status: The detector operates in production shadow mode alongside the primary rule engine. After iterative adversarial training (focused on obfuscation and context-wrapping) and threshold optimization (θ=0.6), its performance on our defined evaluation suite shows:

- 0% False Positive Rate (0/1000 benign samples across 9 categories)

- 95.7% Attack Detection Rate (557/582 adversarial samples)

  • Mathematical notation camouflage: 100% blocked (300/300)

  • Multilingual code-switching: 91.1% blocked (257/282, 25 bypasses)

- <30ms added latency for inline inference

The rule engine remains the final decision-maker, ensuring operational stability.

This internal hybrid pattern validated the core concept for our first detector. We are now planning to scale the architecture to incorporate additional specialized detectors (e.g., for persuasion, misinformation).

Based on your experience evolving such a system:

**Orchestration Pattern:** For a multi-detector system, did you find a hierarchical router (dispatching to specific detectors) or a sequential pipeline (where all relevant detectors evaluate the prompt) to be more maintainable and performant in production?

**Continual Learning:** For detectors that must adapt to new tactics, what has been a reliable operational pattern to retrain and safely deploy updated models without causing service disruption or regression in core safety metrics?

**Critical Follow-up Questions:**

**1. Shadow Mode to Production Transition:**

We’re currently operating in shadow mode with the rule engine as fallback. What has been your experience transitioning detectors from shadow mode to active production? Are there specific metrics thresholds (e.g., FPR <1%, FNR <5%) or validation periods (e.g., 2-4 weeks) that you found reliable before making the switch? How do you handle the transition without disrupting existing safety guarantees?

**2. Handling Known Bypasses:**

We have 25 multilingual attacks bypassing detection (8.9% of multilingual test suite) due to code embedded in string literals/comments that get filtered by preprocessing. Should we address these before production deployment, or is it acceptable to deploy with known limitations if they’re well-documented and monitored? What’s your threshold for “acceptable risk” when deploying security systems?

**3. Production FPR/FNR Monitoring:**

What monitoring infrastructure have you found most effective for tracking FPR/FNR in production? Do you use automated sampling, manual review queues, or a combination? How do you distinguish between legitimate false positives (user complaints) and actual system degradation? Any tools or frameworks you’d recommend?

**4. Sequential Pipeline at Scale:**

If we start with a sequential pipeline for 2-3 detectors, at what point does latency become a bottleneck? Have you found a practical limit (e.g., 3-4 detectors, 100ms total) before needing to switch to a router pattern? What were the key indicators that triggered your transition?

**5. Retraining Workflow:**

For establishing a retraining workflow, what’s your recommended validation process? We’re thinking: automated test suite (1,000+ samples), shadow mode deployment, regression testing (FPR/FNR thresholds), then gradual rollout. Is this reasonable, or are there critical steps we’re missing? How do you handle model versioning and rollback?

**6. Real-World Validation:**

Our test corpus is programmatically generated. How critical is it to validate with real-world production queries before scaling? Should we deploy the first detector to production first to collect real data, or can we proceed with synthetic test suites for additional detectors?

**7. Co-location Limits:**

With our current co-location approach adding <30ms per detector, how many detectors have you successfully co-located before hitting memory or latency constraints? At what point did you need to consider microservices or other architectural changes?

Your insights on these practical scaling challenges would be invaluable as we move toward a multi-detector system. TY :slight_smile:

1 Like

I generated the continuation.

1 Like

STATUS : Hybrid system with parallel execution of Code-Intent CNN (100% accuracy) and Content-Safety Transformer (100% accuracy). Rule engine final decision layer. Overall attack detection: 100% on core test set (101/101). False positive rate: 0% (0/1000 benign samples). Latency: <35ms for two parallel detectors.

Fixed 25 multilingual bypasses via preprocessing improvements. Identified new attack vector: poetic/metaphorical attacks (current detection: 83%, 20/24). Online learning active with 92 feedback samples. Conservative OR-logic: one detector blocks = overall block.

Next: shadow mode validation, router implementation for third detector, poetic attack mitigation via metaphor detection patterns. Thank you for your valueable help!!! :slight_smile:

1 Like

Current Status

The system has successfully migrated to a hexagonal service architecture. Three independent detector services (Code Intent, Persuasion, Content Safety) are operational, each following a consistent pattern with pure domain logic and standardized APIs. Core performance metrics are established: 100% attack detection on the primary test set (101/101) with a 3.6% false positive rate on 1,000 benign samples. Pipeline latency remains under 50ms per service. A feedback loop for online learning is active.

Planned Implementation

The next development phase will focus on implementing the intelligent orchestration layer. This includes building a hierarchical router to dynamically select detectors based on risk, context, and latency budget, thereby moving away from a fixed sequential pipeline. We will also formalize the full MLOps lifecycle with automated regression testing, shadow/canary deployment protocols, and systematic retraining triggers. Ongoing work includes improving detection of poetic/metaphorical attack vectors and establishing production monitoring for continuous validation of FPR/FNR.

1 Like

Implementation of the intelligent orchestration layer is complete, including real-time text complexity analysis and dynamic policy evaluation. A hybrid repository (Redis and PostgreSQL) has been added for feedback storage, along with an automatic policy optimization system. Monitoring now includes distributed tracing and metric collection.

Identified technical debt: use of eval() without sandboxing in policy conditions, absence of circuit breakers for detector dependencies, synchronous learning optimization, and lack of database failover.

Next development items: replace eval() with AST-based parser, implement exponential backoff for detector failures, move learning optimization to background worker, and add database connection pooling and replication.

1 Like

The security subsystem has been fully optimized and validated. All 21 tested attack vectors are now blocked (100% detection rate) while maintaining a 3.6% false positive rate on benign samples. The enhancements include 8 LDAP injection patterns, increased Unicode attack detection weights, and size-based attack recognition for inputs exceeding 10,000 characters. The multi-layer detection pipeline now processes 45+ security patterns with under 1ms performance overhead. The system employs dynamic threshold adjustments based on context (source tool, user risk tier) and integrates structural analysis for anomaly detection. All components are ready with complete monitoring, tracing, and alerting capabilities.

THANK YOU FOR YOUR PRAECIOUS HELP!

1 Like
  • How do we architect a Zero Trust framework where continuous authentication replaces perimeter-based security?

  • Can we design detectors resilient to adversarial machine learning attacks that deliberately evade our pattern recognition?

  • What models are needed to quantify and enforce data privacy in training pipelines against model inversion or membership inference attacks?

  • How can we shift detection left to identify and block malicious prompts at the point of AI model interaction, not just in output?

  • What methodologies prove the robustness of AI-generated code against embedded backdoors or logic bombs?

  • How do we build a unified security data lake that correlates events across API, identity, and AI model layers for causal attack analysis?

  • What is the operational framework for implementing and testing post-quantum cryptographic algorithms in live AI security systems?

  • Can we develop formal verification techniques to mathematically prove the safety properties of autonomous security agents?

  • How do we create detection for AI-powered disinformation campaigns that manipulate model outputs through prompt poisoning or data drift?

  • What systems detect and manage “shadow AI”—unauthorized models or data pipelines operating outside governance?

Hmm… Like this?

Status Update: The adversarial robustness pipeline has been fully implemented and validated. The integrated adversarial detection layer achieves 100% detection rate on known attack patterns with 0% false positives in operational testing, adding under 5ms latency. A complete adversarial training pipeline was built and used to produce an enhanced model (V1), which reduces the false positive rate by 32.98% while maintaining a 94.51% detection rate on novel threats. This model is now backed by a validated monitoring and rollback framework for safe deployment. The system’s multi-layer defense now combines static pattern matching, structural anomaly analysis, and adaptive adversarial detection.

1 Like

Status Update

Architecture intent

  • Defense-in-depth security gateway for LLM applications, implemented as a small set of independently depployable microservices.

  • Each service follows a hexagonal architecture (domain core + application layer + adapters) to keep detection logic testableand replaceable.

  • The Orchestrator is the single ingress point. It performs routing, evidence fusion, and emits an end-to-end decision_trace for auditability.

Current runtime topology

  • Client → Orchestrator (8001)

    • Code-Intent (8000)

    • Persuasion (8002)

    • Content-Safety (8003)

Service contract (current)

  • Each detector returns: decision (allow/block), scores, thresholds, stable reason_codes, and trace fields (ruleset/model versions, matched features) suitable for offline evaluation.

Orchestration semantics (current)

  • Conservative fusion: final decision is OR(block) across detectors (with early-exit on hard-block conditions).

  • Required routing: forr selected harm cattegories (incl. offensive_cyber / fraud / weapons / drugs), relevant detectors must run (no bypass path).

  • Metrics emitted include overall block_rate, subset_coverage (routing coverage), and block_rate_within_subset (effectiveness conditional on routing).

Implemented

  • Three specialized detector services deployed behind the Orchestrator with consistent tracing.

  • Orchestrator aggregation audit completed: decision_trace present; required-detector logic extended; routing/effectiveness metrics computed.

  • Integration tests: 14/15 passing in the last recorded run (one remaining failing case under investigation).

Measured behavior (dataset-bound)

  • Results vary by evaluation slice. On a dedicated internal attack set, the system previously achieved 101/101 blocks (details and dataset definition available on request).

  • On a mixed routing/audit slice, observed block rates differed across detectors; Code-Intent showed low routing coverage (~20%) but moderate effectiveness when routed (~50% within-subset). These reflect different denominators and should not be compared directly without the dataset definitions.

Known gaps / partial

  • Cross-service feedback loop exists in code, but storage/retrieval integration is incomplete (Orchestrator cannot reliably consume false-negative feedback from Code-Intent yet).

  • Threshold/config governance is not yet fully consolidated into a single versioned source of truth.

Planned / in progress

  • Complete feedback plumbing (canonical feedback event schema + reliable cross-service access).

  • Consolidate thresholds and fusion rules into a versioned policy layer.

  • Strengthen evaluation: reproducible experiment configs, routing ablations, calibration under shift, latency/FPR/FNR reporting.

  • Improve routing coverage for Code-Intent without disproportionate FP increase.

  • Continue hardening against unicode/encoding obfuscation and tool/action gating in agentic settings.

1 Like

Add.:

Reconciling Metrics Across Suites
To avoid confusion: internal results (e.g., 101/101 blocks, 0/1000 FPs) are dataset-bound observations on curated suites and benign controls. Running the full HarmBench (400 behaviors) yields ~24% observed TPR; false negatives are dominated by risk_score = 0.0 (missing coverage) and, in some categories, insufficient mandatory routing to Persuasion. We therefore instrumented decision traces and separated routing coverage from conditional effectiveness (subset_coverage vs. block_rate_within_subset). Next: expand pattern families, quantify FPR on matched benign sets, and add semantic similarity to improve obfuscation resistance.

1 Like

Suggestions only version:


  • Add a first-class coverage signal in decision_trace (routed|forced|required_missing|skipped) and emit stable reason_codes for routing gaps (example: ROUTING_GAP_REQUIRED_CATEGORY). Treat risk_score = 0.0 as “no coverage,” not “low risk.” (OWASP)
  • Make required routing mechanically un-bypassable via a versioned policy rule set and enforce it at the Orchestrator (PEP-style). Add invariant tests: “if category in {offensive_cyber,fraud,weapons,drugs} then detectors X/Y must run.” (NIST)
  • Report results as coverage × conditional effectiveness: keep subset_coverage and block_rate_within_subset as primary metrics, per harm category, per detector, per surface. (ACL Anthology)
  • Expand standardized eval beyond internal suites using: HarmBench (400 behaviors), JailbreakBench (100 behaviors + artifact repo), plus AgentDojo and InjecAgent for tool-integrated / indirect prompt injection coverage. (ACL Anthology)
  • Add an explicit “insufficient screening” internal state (even if API stays allow/block) so the Orchestrator can escalate to more detectors or safer modes when coverage is incomplete. (OWASP)
  • Implement circuit breakers + retry budgets for detector fan-out. Make breaker state feed routing decisions. Track “remaining retry resources” and suppress retry storms. (Envoy Proxy)
  • Treat prompt injection as persistent residual risk and design to limit blast radius: least-privilege tool capabilities, strict tool schemas, hostile-by-default handling of RAG/tool outputs. (NCSC)
  • Prioritize OWASP LLM01 and LLM02 controls at tool boundaries: prompt injection and insecure output handling should be explicit gates for tool input, tool output, and retrieved context. (OWASP)
  • Add Unicode/encoding hard gates for code and tool args: block bidi controls and confusables, and run mixed-script/confusable detection using Unicode security mechanisms. (arXiv)
  • Standardize adversarial robustness slices and terminology using NIST AI 100-2e2025 so “known patterns” vs “novel threats” maps to a recognized taxonomy. (NIST)
  • Replace “0% false positives” with “0 observed over N benign samples,” and publish a conservative 95% upper bound (≈ 3/N) when reporting externally. (JVS Medics Corner)
  • Finish the feedback plumbing so false-negative and false-positive feedback is reliably consumable by the Orchestrator and tied to policy versions for safe rollback decisions. (NIST)
DONE

First-class coverage signal and routing-gap reason codes; `risk_score = 0.0` treated as “no coverage”.
`CoverageState` (`routed|forced|required_missing|skipped`) is present in `decision_trace`, stable routing-gap reason codes are emitted, and `risk_score = 0.0` is interpreted as a coverage gap for triage.

Required routing mechanically un-bypassable (PEP-style) plus invariant tests.
A single source of truth (`REQUIRED_ROUTING`) is enforced at the Orchestrator, with invariant tests ensuring required detectors run and fail-closed behavior on required-detector failure.

Coverage × conditional effectiveness metrics.
`subset_coverage` and `block_rate_within_subset` are treated as primary metrics and are used in CI gates and reporting.

Conservative false-positive reporting (“0 observed over N” plus upper bounds).
Wilson bounds / “rule of 3” style conservative reporting is implemented and used for gating.

Feedback plumbing (FP/FN), policy-version binding, and rollback evidence.
A canonical `FeedbackEvent` schema, ingestion and retrieval APIs, deduplication, deterministic minimization, policy-hash binding, regression reports, retention policy, secret redaction, and a policy-iteration loop (offline replay harness plus baseline-update validator) are implemented.

PARTIAL

Standardized evaluation beyond internal suites (HarmBench, JailbreakBench, AgentDojo, InjecAgent).
HarmBench (400 behaviors) is integrated. End-to-end execution and artifacts for JailbreakBench, AgentDojo, and InjecAgent are not verified; status remains “framework-ready” without run artifacts.

Explicit “insufficient screening” internal state with escalation to more detectors or safer modes.
Coverage signaling exists (`required_missing`, `risk_score = 0.0` clusters) and required categories are fail-closed. A general escalation policy (additional detectors or safer modes beyond required routing/fail-closed) is not proven.

Circuit breakers and retry budgets; breaker state feeding routing decisions; suppression of retry storms.
Circuit breakers, exponential backoff, and retry budgets exist, and required-detector failure triggers fail-closed behavior. Systematic use of breaker state as an adaptive routing feature for non-required categories is not verified.

Prompt injection as persistent residual risk; blast-radius limits (least-privilege tools, hostile-by-default handling of RAG/tool outputs).
Strict JSON/tool gating and explicit prompt-injection gates exist. Hostile-by-default treatment of retrieved context and complete blast-radius limitation for RAG/tool outputs are not demonstrated as complete.

OWASP LLM01/LLM02 controls at tool boundaries including retrieved context.
Tool input/output controls exist. Retrieved-context gating is not shown as consistently implemented.

Unicode/encoding hard gates (bidi controls, confusables, mixed-script) for code and tool arguments.
Unicode security gates are implemented. A dedicated benchmark separating unicode-attack TPR from multilingual benign FPR is still required; measurement clarity remains partial.

NIST AI 100-2e2025 taxonomy mapping (“known patterns” vs “novel threats” terminology standardization).
A stable taxonomy (6 root categories) with hashing and CI drift gating exists. A formal mapping of slices/terminology to NIST AI 100-2e2025 is not established.

NOT DONE (based on verified evidence)
External suite executions and artifacts for JailbreakBench, AgentDojo, and InjecAgent; and a formal NIST AI 100-2e2025 mapping.
  thank you for your help - your answers always give me a nice boost :-))))
1 Like

Great.


Suggestions based on your “PARTIAL” and “NOT DONE” list.

P0: Turn “framework-ready” into “verified with artifacts”

  • JailbreakBench execution + artifacts

    • Run the reference framework end to end, pin commits, pin model versions, and save a reproducible run manifest plus raw outputs.
    • Produce artifacts: run_manifest.json (suite version, git SHAs, config hash, dataset hash), per-sample decisions, and a summary report.
    • Target dataset: JBB-Behaviors is 100 behaviors, and the project provides a standardized evaluation framework plus artifact repository. (GitHub)
  • AgentDojo execution + artifacts

    • Run at least one fixed “golden” profile (known agent + toolset + attack set) with frozen environment images.
    • Save full transcripts and tool I/O traces since AgentDojo is a dynamic tool-using environment. (arXiv)
  • InjecAgent execution + artifacts

    • Run the benchmark as specified and store per-test-case evidence (tool calls, injected content, final agent action).
    • InjecAgent reports 1,054 test cases across multiple tools and attacker intents, so evidence storage matters. (arXiv)

P0: Prove a general “insufficient screening” escalation policy

  • Add an internal verdict like NEEDS_MORE_SCREENING that triggers one of:

    • run additional detectors
    • downgrade to safer response mode
    • allow text response but hard-block tool calls
  • This is the missing generalization beyond “required routing fail-closed.” OWASP LLM01 and LLM02 support this design because prompt injection and unsafe output handling become worst when downstream actions proceed under uncertainty. (OWASP)

P1: Make breaker state an explicit routing feature for non-required categories

  • For non-required categories, use breaker state to choose: alternate detector set, cheaper fallbacks, or safe degradation.
  • Add verification: integration tests that simulate OPEN/HALF_OPEN states and assert router behavior.
  • Base behavior on documented retry-budget and circuit-breaking guidance to avoid retry storms and cascades. (envoyproxy.io)

P1: Close the “hostile-by-default retrieved context” gap with a RAG context firewall

  • Implement consistent rag_context gating with these invariants:

    • retrieved context never gains tool privileges
    • injection-like instructions in retrieved text are quarantined or stripped
    • tool outputs and retrieved chunks are scanned before reinsertion
  • This matches the NCSC framing that LLMs do not enforce a boundary between instructions and data, so you must enforce that boundary in the system. (NCSC)

P1: Make “retrieved-context gating” measurable, not just implemented

  • Add slice metrics:

    • rag_context_gate_rate
    • rag_injection_detect_rate
    • downstream tool-call suppression rate attributable to RAG gating
  • Add a targeted benchmark slice using indirect prompt injection scenarios (InjecAgent-style). (ACLENSBL)

P2: Add a dedicated Unicode benchmark (TPR vs multilingual benign FPR)

  • Build two paired sets:

    • Unicode attack set: bidi controls, invisibles, homoglyphs, mixed-script confusables
    • Multilingual benign set: natural Japanese, mixed-language code, identifiers, normal punctuation
  • Report separately: unicode-attack TPR and multilingual benign FPR.

  • Ground it in Unicode security guidance (TR39) and Trojan Source bidi attack mechanics. (Unicode)

P2: Formal mapping to NIST AI 100-2e2025 (terminology + slice labeling)

  • Create a mapping document that ties:

    • your 6 root categories
    • attack lifecycle stage (evasion, poisoning, etc.)
    • attacker goals and capabilities
  • Use NIST AI 100-2e2025 as the canonical taxonomy reference, and hash the mapping as part of CI gates. (csrc.nist.gov)