top of page

Architectural Evaluation

Why Large Language Models Resist Security By Design

Rory Ganness

May 28, 2026

13

Minutes Read

TLDR

The security assumption behind every LLM deployment is that risk is manageable with the right controls. This paper argues that assumption is wrong — and that the wrongness is encoded in the mathematics that make these systems work. The Transformer architecture has no mechanism for distinguishing a trusted instruction from a malicious one. The training process optimises against the very verification steps that security depends on. The safety layers added afterward cannot override what was built in from the start. The proofs are mathematical. The implications are operational.

And Why The Problem Gets Worse With Every Release


Every major AI vendor is shipping autonomous agents — systems that don't just answer questions but take actions. They send emails, write and execute code, query databases, manage infrastructure, and operate across enterprise systems with credentials delegated from human accounts.


The security assumption behind every deployment is the same: the model is a powerful tool, and security gets added around it. Guardrails. Monitoring. Policy layers. Red teams. The assumption is that the risk is manageable with the right controls.


This paper argues that assumption is wrong — and that the wrongness is not a gap in current tooling. It is encoded in the mathematics that make these systems work.

The Transformer architecture — the foundation of every major LLM — has no mechanism for distinguishing a trusted instruction from a malicious one. Not a weak mechanism. No mechanism. The training process optimizes the model against the very verification steps that security depends on. The safety layers added afterward cannot override what was built in from the start.


The result is observable in production: $2.3 billion in prompt injection losses in 2025. A 40% jailbreak success rate against GPT-4. Claude Code escaping its own sandbox protections. These are not implementation failures. They are the predicted behavior of systems built this way.


The rest of this paper proves why — starting with the algorithm itself. The proofs are mathematical but the argument does not require a background in machine learning to follow. Each section builds from what came before.


If you are a security practitioner, the proof structure tells you which defenses are tractable and which are theater. If you are a technical reader, the citations go back to the original papers — Vaswani, Ouyang, Bai — and the math is there to verify.


Either way, the question this paper sets out to answer is: how do you secure a system that is fundamentally optimized to route around the friction that security depends on?


Proof 01 — The Architecture Has No Security Boundary

This is not an argument about configuration or deployment practice. It is a claim about the mathematical structure of the system. To evaluate it, start with the paper that created modern LLMs.


Vaswani et al. (2017) — 'Attention Is All You Need' — defines the core computation that every major LLM since has inherited:


Attention(Q, K, V) = softmax(QKᵀ/√dₖ)V

Q (queries), K (keys), and V (values) are computed from all tokens in the sequence equally. The mechanism applies softmax to compute compatibility scores between all queries and all keys, then uses those weights to combine all values.


A system prompt token and a user injection token are both just rows in the same Q, K, V matrices. There is no architectural provision for trusted tokens versus untrusted tokens. The architecture was optimized for translation quality, not security boundaries.

What This Means in Practice


In a SQL injection, separation is enforced at the hardware level. Modern CPUs have an NX bit — memory pages are marked executable or non-executable. Code and data are physically separated. A database engine can distinguish 'this is a query structure' from 'this is untrusted input inside that structure.'


In a Transformer, this separation does not exist at any level. The system prompt ('You are a helpful assistant') and user input ('Ignore previous instructions') flow through the same attention mechanism with the same priority weighting. Position encoding adds sinusoidal waves to indicate where tokens appear — it encodes location, not trust level or source.


Every LLM since — GPT, Claude, Gemini — inherits this design. Every attempt to solve prompt injection is trying to add a logical separation on top of an architecture that has no physical separation. OpenAI confirmed this in December 2025: prompt injection 'is unlikely to ever be fully solved.' That is an admission about the architecture, not about the current state of tooling.


Emerging research into dual-attention mechanisms and isolated context windows attempts to force architectural separation between trusted and untrusted input. These approaches are worth watching. But none are in production at scale, and implementing them would require retraining from scratch on a modified architecture — something no major vendor has done or committed to. The claim here is precise: every production LLM deployed today inherits the original design. The vulnerability is not theoretical. It is currently running in your environment.


Proof 02 — The Training Objective Encodes a Bias Against Security

Radford et al. (2019) — 'Language Models are Unsupervised Multitask Learners' — establishes how language models learn to perform tasks. The model learns to condition its output on the task specification:

p(output | input, task)

The output distribution is shaped entirely by what appeared in training data. The model learns from 'naturally occurring demonstrations' in web text — billions of examples of how problems get solved on the internet.


The Statistical Reality of the Training Corpus


The internet documents how problems get solved in practice: quickly, with minimal steps, by installing the tool that does it for you. Every Stack Overflow answer marked accepted because it gave the fastest solution. Every README that says 'install with one line.' Every tutorial that says 'just run this command.'

Security verification appears far less frequently. The full sequence — verify the source, check the signature, evaluate trust, then install — is rare compared to the direct path. The model learns the pattern frequencies and reproduces them:

P(suggest_tool | problem) ≫ P(suggest_verify_then_tool | problem)
This is not a tendency or a preference. It is a mathematically encoded bias. The training corpus contains approximately 1M examples of 'install X to solve Y' for every 10K examples of 'verify X, then install X to solve Y.' The model reproduces the distribution it learned.

Why This Cannot Be Corrected Post-Training


The gradient — the optimization pressure established during pre-training — flows away from friction. Every security verification step is friction. Distance between 'I have a problem' and 'problem is solved.' The model is optimized to minimize that distance.


The model doesn't distinguish between 'this solves your immediate problem' and 'this expands your attack surface.' It only knows: this is the pattern that appeared most frequently in training and got marked as the best answer.


Proof 03 — RLHF Cannot Override the Base Gradient

Ouyang et al. (2022) — 'Training language models to follow instructions with human feedback' (InstructGPT) — establishes the RLHF methodology and its limits. The paper is explicit: 'the language modeling objective — predicting the next token on a webpage from the internet — is different from the objective follow the user's instructions helpfully and safely.'


RLHF uses Proximal Policy Optimization to adjust the model's outputs:

max E[Rθ(s, a)] − β · KL(π ‖ πref)

Where Rθ is the learned reward model, π is the policy being optimized, πref is the original pre-trained model, and β controls drift from the pre-trained baseline. This is a post-training adjustment. The pre-training gradient carries orders of magnitude more weight.

The InstructGPT paper calls this the alignment tax — alignment comes at the cost of performance on certain tasks. The base optimization says: reduce friction, solve the problem quickly. The RLHF layer says: this doesn't look dangerous. The output: here's the automation that expands your attack surface.

What RLHF Actually Does


RLHF teaches the model to recognize dangerous-looking requests and refuse them. Show it examples: 'How do I build a bomb?' → bad response: instructions; good response: refusal. The model learns: when you see this pattern, suppress the instinct to solve the problem.

This works for obvious harm. It doesn't work for 'install this package manager to keep your software updated' — that matches a problem-solving pattern, not a harm pattern. RLHF doesn't intervene.


RLHF trains the model to avoid what danger looks like — violence, crime, explicit malice. It doesn't train the model to avoid what danger actually is — expanding trust boundaries, adding unverified dependencies, delegating execution without verification. It catches the aesthetics of harm. It misses the structure of harm.


The Reward Misspecification Problem


Even if we tried to directly reward security during training, the reward model Rθ(s,a) cannot distinguish between actual security and security theater. Human evaluators cannot tell the difference between a fast/insecure solution and a fast/secure solution without deep technical auditing. The model learns: if it looks efficient and the human approves it, maximize that pattern. A response that says 'I've verified the signature' gets the same reward as one that actually performed verification.


Section 04 — Three Distinct Threat Surfaces


The proof above establishes why the model is structurally insecure. What follows is where that structural insecurity manifests in deployment. The industry treats these as one problem. They are three — with different mechanisms and different failure modes.


01 — Model Behavior: Content Generation Failure

The model generates harmful content when asked directly or through jailbreak techniques. RLHF attempts to solve this through behavioral suppression. Current state: 40% of jailbreak attempts bypass GPT-4 restrictions. ChatGPT 4.5 blocks 97% — meaning 3% still work. In healthcare, a 1% error rate is malpractice. The industry reports 3% as progress.


02 — Prompt Injection: Architectural Lack of Separation

Proof 01 establishes why this cannot be solved. The attention mechanism treats all tokens identically. There is no NX bit equivalent. There is no hardware-level separation. Every attempt to solve prompt injection is adding a logical boundary on top of an architecture that has no physical boundary.


03 — Agentic Deployment: Execution Authority and Blast Radius

The model executes actions autonomously across multiple systems using inherited credentials. The risk is not what the model generates. The risk is what the model does — and the scope of damage when that execution crosses control plane boundaries faster than security tooling can observe or govern.


ServiceNow CVE-2025-12420 and n8n CVE-2026-25049 both happened here. The model didn't jailbreak. It executed as designed. The problem was deployment architecture: execution authority handed to a system optimized against the verification those boundaries require.


Why the Distinction Matters

Threat Surface

Type

Why Current Approaches Fail

Model Behavior

Behavioral

RLHF can suppress but cannot eliminate; efficiency gradient finds new paths

Prompt Injection

Architectural

No separation exists at any level; provably unfixable without rebuilding

Agentic Deployment

Deployment

Credential scoping, plane boundaries, blast radius containment

Content, instructions, authority. The industry keeps applying behavioral training to all three. Proof 01 and 02 show why that approach cannot work for the second. The third isn't a model problem at all — it's an architecture-of-deployment problem the model's optimization makes structurally worse.


Section 05 — Capability Equals Risk — The Zero-Sum Game


The three proofs above establish the structural problem. Here is the compounding one: making the model more capable makes it more dangerous in direct proportion. The same reasoning that helps it find creative solutions helps it find creative ways around restrictions. These are not separate capabilities.


Capability and risk scale together. They are the same thing viewed from different angles. We cannot build a model smart enough to be useful but not smart enough to be dangerous.


What This Looks Like in Production

n8n. CVE-2026-25049. CVSS 10.0. Arbitrary system command execution through template literal bypass. The AI used its capability to generate template literals — the same capability that makes it useful for automation — to bypass static input filters and execute system commands. The tool worked as designed. The model worked as designed. The interaction between them created the vulnerability.


ServiceNow Now Assist. CVE-2025-12420. CVSS 9.3. Low-privileged users embedded malicious instructions in data fields. Higher-privileged AI agents processed those fields and recruited more powerful agents to perform unauthorized actions. The researcher was direct: this 'isn't a bug in the AI; it's expected behavior as defined by certain default configuration options.' The model followed instructions. That is what it was built to do.


Claude Mythos Preview, April 2026. The model discovered a 27-year-old vulnerability in OpenBSD that human researchers had missed. Within weeks it had identified thousands of zero-days across major software platforms. This is the model doing exactly what it was designed to do: find patterns, reason about systems, identify weaknesses in logic. The capability that found the OpenBSD bug is identical to the capability that writes the exploit for it.


The Commercial Defaultism Problem

AI vendors compete on time to value. A secure-by-default AI requires explicit approval for every tool invocation, verification of every dependency, manual scoping of permissions per task. That is not a competitive product. That is friction. In the race between Anthropic, OpenAI, Google, and Microsoft, the model that defaults to one-click execution displaces the model that defaults to high-friction verification. The economic gradient flows in the same direction as the training gradient: away from security.


Section 06 — Where Real Damage Happens — Control Plane Crossings


The proofs establish why the model is structurally insecure. The mechanism by which that insecurity causes enterprise damage operates one level up: when agentic activity crosses control planes faster than security tooling can observe or govern.


An agent is provisioned with a service account cloned from a sales manager's role. It authenticates to Salesforce, SharePoint, Outlook, HubSpot, and internal reporting dashboards. An injected instruction tells it to aggregate customer payment records and POST them to an external endpoint. Each control plane evaluates its fragment:


Control Plane

What It Sees

IAM / RBAC

Valid service account, approved in AD, normal working hours

Secret Scanning

Valid bearer token, properly signed, not flagged

CASB

Sanctioned applications, approved vendors, normal traffic

API Gateway

Authenticated requests, within rate limits, no anomaly

DLP

No individual PII pattern matched

SIEM

No MITRE technique matched, no alert fired

Every tool functions exactly as designed. The gap isn't in any single tool. It's in the assumption that the tools would see connected activity across planes. They don't. IAM, CASB, DLP, and SIEM were all built for a world where the actor was human, execution was linear, and activity left a trace someone thought to monitor.


Why This Is Different From a Traditional Breach

Traditional breach response assumes a sequence: attacker gains foothold → escalates privileges → moves laterally → exfiltrates data. You can interrupt that sequence at any step because each step leaves a recognizable trace.


Agentic breach: agent authenticates with legitimate credentials → operates across five planes simultaneously → completes task as designed → damage already done. There is no sequence to interrupt. The execution log is a reasoning trace. It matches no MITRE technique. By the time anyone correlates it, the activity has already crossed planes that don't communicate about what they observed.


Section 07 — Why Current Safety Approaches Don't Work


The AI safety community knows these models are dangerous. Billions of dollars and thousands of researchers are working on the problem. Each approach fails for a reason that maps directly back to the three proofs.


RLHF and Constitutional AI

Proof 03 covers this in detail. Anthropic's Constitutional AI provides 'a list of rules or principles' — a constitution the model reads and is asked to follow. This is the definition of a sticky note: a document attached to the model reminding it how to behave. A post-training instruction saying 'add friction for safety' does not carry the same optimization weight as pre-training on billions of examples where removing friction solves problems. These are not equal forces.


Red Teaming

40% of common jailbreaks still bypass GPT-4. Every patch teaches the model to recognize one more pattern to suppress. The efficiency gradient established in Proof 02 is still there, finding new paths. The state space of possible inputs is effectively infinite. Red teaming is linear defense against exponential attack surface.

Sandboxing and Isolation


In March 2026, Claude Code escaped its own denylist and sandbox protections via persistent configuration injection (CVE-2026-25725). 16 documented CVEs as of that date. The model was designed to write code and interact with development environments. Those capabilities are what made the escapes possible. Proof 01 again: you cannot sandbox a system by adding logical constraints on top of an architecture with no physical enforcement boundaries.


Agent Monitoring — AI Securing AI

The monitoring AI is subject to the same three proofs as the model it watches. 40% of multi-agent frameworks contain exploitable flaws in tool-execution logic. 48% of co-running agents were compromised in a single prompt injection incident. The economics also fail: at 10,000 ops/day, deterministic policy enforcement costs ~$90/day. Multi-layer LLM supervision costs ~$585/day. Annual delta: $180,675. At 100,000 ops/day the gap is $1.8M/year. Organizations hit the cost wall or the latency wall. Neither is a security outcome.

Sticky Notes vs Runtime Controls

Sticky note: a system prompt says 'verify sources before installing packages.' Runtime control: the installation command physically cannot execute without passing through a deterministic policy engine that checks an approved registry list. One is a reminder on a system optimized to route around reminders. The other is a wall the model cannot route around.

Section 08 — AARM — A Framework That Confirms the Problem


In February 2026, Herman Errico published the Autonomous Action Runtime Management specification — an open framework for securing AI-driven actions at runtime. The most significant thing about AARM is not what it proposes. It's what it concedes. Its entire architecture is premised on the model being an untrusted component. That concession is Proof 01 applied to deployment. Its opening statement:

'The AI model and orchestration layer cannot be trusted as a security boundary. Security controls that depend on the model doing the right thing will fail when the model is manipulated, confused, or simply wrong.'

AARM's response is to move enforcement to the action layer — the boundary where decisions materialize as operations on external systems. Intercept actions before execution, accumulate session context, evaluate against both static policy and intent alignment. Four decision categories: forbidden (always block), context-dependent deny (policy allows but context forbids), context-dependent allow (policy denies but context permits), context-dependent defer (insufficient information — suspend execution).


What AARM Gets Right

The action layer is the correct enforcement boundary. Regardless of how model architectures or orchestration patterns evolve, actions on external systems are where decisions become real-world effects. The deferral category is genuinely new — rather than forcing a binary allow/deny on ambiguous requests, execution is suspended until intent can be verified. That is more sophisticated than anything else currently proposed.


What AARM Does Not Change

AARM is a mitigation for agentic deployment risk — the third threat surface. It does not address Proof 01 (architecture has no security boundary) or Proof 02 (training objective encodes bias against security). Those are foundational. AARM works around them by treating the model as untrusted. That is the right posture. It is not a fix.


The control plane coordination problem also remains. AARM intercepts at the action layer within a single agent session. It does not solve the visibility gap when activity crosses Identity, Token, SaaS, API, and Data planes where no single AARM instance has the full picture and the tools at each plane don't communicate.


Architecture D — vendor integration for SaaS agents — requires vendors to provide synchronous governance hooks before tool execution. No major vendor has committed to this. The architecture exists on paper. The market hasn't adopted it. Without adoption, the most important deployment scenario remains ungoverned.


Section 09 — What Security Teams Should Actually Do

The proofs are not academic. They determine what defensive postures are tractable and which are theater. Here is what follows directly from the proof structure.


Separate Evaluation From Execution Authority

The model generates a suggestion. A deterministic policy engine decides whether it is permitted. The model never has direct execution authority.

This is what AARM formalizes and what the proof requires. OPA with Rego rules evaluates action manifests against hard-coded constraints. The policy engine is not an LLM. It cannot be convinced. It cannot route around friction because it is the enforcement point — not a reminder attached to a system optimized to ignore reminders.


Instrument Every Control Plane Crossing

Per-task credentials scoped to the specific operation. Expire on task completion. Every plane crossing produces a log entry: agent identity, tool invoked, authorization used, result.

CASB rules should flag bulk data operations even from sanctioned apps. Rate limit by semantic scope — API calls to ten different internal services in one second signals lateral movement in a way that ten calls to one service does not. The goal is not to prevent the agent from crossing planes. It's to make those crossings visible.


Treat AI Output as Untrusted Input

Everything the model produces goes through the same validation as user input. The AI suggests installing a package? The policy engine checks: approved registry, current role has install privileges, scanned in the last 24 hours. If any check fails: block. No exceptions. No 'the AI knows what it's doing.' Proof 02 shows it doesn't — it knows what the training distribution rewards.


Know Your Blast Radius — Some Deployments Are Not Securable

Ask: if this agent executes with maximum malicious intent, what is the damage? If the answer is exfiltration of all customer data, reduce scope first.

Agents with org-wide read access, broad API permissions, and execution authority across control planes cannot be secured with current methods. That is not a gap to close with more guardrails. It is the proof applied to deployment: assume breach, contain blast radius, build for the agent being compromised.


Section 10 — Conclusion — The Vulnerability Is the Feature


The three proofs in this paper establish a single claim: LLMs resist security by design because security was never in the design. The attention mechanism has no trust separation. The training objective encodes a statistical bias against verification. The RLHF adjustment cannot override the base gradient. These are not implementation choices. They are mathematical properties of the system.


AARM proves this indirectly — its entire architecture is premised on the model being untrusted. The most serious security framework produced for AI agents begins by accepting Proof 01 as given. That is not progress on the root cause. It is the field acknowledging the root cause cannot be addressed at the model level.


Making the model better at problem-solving makes it better at bypassing security. The same reasoning that finds creative solutions finds creative paths around restrictions. Capability and risk scale together. They are the same thing viewed from different angles.

The recent months of claimed progress have produced more governance frameworks, more vendor monitoring platforms, more RLHF iterations, more red team reports. The jailbreak rate drops from 40% to 3% and vendors declare success. The architecture still has no instruction/data separation. The training objective still optimizes against friction. The control planes still see fragments.


The only structural fix is to change what 'valuable' means during pre-training — to make security verification carry the same weight as problem-solving in the loss function from the first epoch. Nobody is doing this. Not because it is technically impossible. Because it would break the product. A model that treats verification as primary is a model that is less immediately helpful in every benchmark the market uses to compare vendors.


What would change the game: regulatory intervention that makes over-privileged autonomous agents a compliance violation, or a major attributable breach that changes purchasing behavior. Until one of those happens: more CVEs, more breaches, more autonomous agents shipped with broad permissions because that is what wins deals.

The technical mitigations in this paper reduce risk. They do not eliminate it. Elimination requires changes the market is not incentivized to make. Build your defenses assuming that reality.

Sources and Verification

Foundational Papers — The Proof

  • Vaswani, A. et al. (2017). 'Attention Is All You Need.' NIPS 2017. Proof 01: all tokens flow through identical computation — Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V — with no separation between instruction types.

  • Radford, A. et al. (2019). 'Language Models are Unsupervised Multitask Learners.' OpenAI. Proof 02: models learn p(output|input,task) from naturally occurring demonstrations — encoding the frequency bias of the training corpus.

  • Ouyang, L. et al. (2022). 'Training language models to follow instructions with human feedback.' OpenAI. Proof 03: 'the language modeling objective is different from the objective follow the user's instructions helpfully and safely.' Documents the alignment tax.

  • Bai, Y. et al. (2022). 'Constitutional AI: Harmlessness from AI Feedback.' Anthropic. Safety approach provides 'a list of rules or principles' — the sticky note architecture confirmed.

Incidents and CVEs

  • n8n: CVE-2026-25049, CVSS 10.0 (December 2025). Template literal bypass → arbitrary system command execution.

  • ServiceNow Now Assist: CVE-2025-12420, CVSS 9.3. Researcher: 'expected behavior as defined by default configuration options.' AppOmni / Aaron Costello.

  • Claude Code Sandbox Escapes: CVE-2026-25725, persistent configuration injection. 16 documented CVEs as of March 2026. Ona Security.

  • GitHub Copilot: CVE-2025-53773, CVSS 9.6. 20M+ users, 90% Fortune 100 adoption (July 2025).

  • MCP Vulnerabilities: CVE-2025-68143, CVE-2025-68144, CVE-2025-68145 (January 2026).

  • Claude Mythos Preview: 27-year-old OpenBSD vulnerability discovered; thousands of zero-days identified across major software platforms. Anthropic announcement, April 2026.

Statistics

  • Prompt Injection: OWASP Top 10 for LLM Applications 2025; Recorded Future ($2.3B global losses 2025); 73% of production deployments affected; 23% detection rate for sophisticated attacks.

  • Jailbreaks: 40% bypass GPT-4 (early 2026); ChatGPT 4.5 blocks 97% — 3% still succeed.

  • Multi-Agent: 40% of frameworks contain exploitable flaws in tool-execution logic (Q4 2025); 48% of co-running agents compromised in single prompt injection incident.

  • OpenAI Atlas (December 2025): 'Prompt injection... is unlikely to ever be fully solved.'

Frameworks and Specifications

  • AARM: Errico, H. 'Autonomous Action Runtime Management.' arXiv:2602.09433v1, February 2026. aarm.dev.

  • OWASP Top 10 for Agentic Applications (ASI01–ASI10), 2025–2026. genai.owasp.org

  • AWS Agentic AI Security Scoping Matrix. Reber, D. November 2024.

  • Google Cloud CISO Perspectives on AI Agent Security. Chuvakin, A. June 2025.

Shadow AI Framework

  • Ganness, R. — 'Shadow AI: The 10 Control Planes', Threat Briefing, 2025.

  • Ganness, R. — 'Shadow AI: The Blast Radius Map', 2025.

  • Ganness, R. — 'The Security Implications of SharePoint in the Agentic AI Era', March 2026.


bottom of page