Prompt Injection: Taking Over Semantic Control

Prompt injection is a fundamental security problem in LLM-based systems that arises from the fact that data and instructions are not formally separated, but are interpreted within the same context.

Introduction

Prompt injection is a fundamental security problem of LLM-based systems that does not arise from a classical software bug, but from a characteristic of the technology’s operation. While in traditional computing instructions and data are separated, in the case of large language models this boundary becomes blurred: every element of the input may potentially be interpreted as an instruction. As a consequence, the attack surface is not limited merely to input validation, but extends to the entire process of semantic interpretation.

Unified context representation and the absence of trust boundaries - The architectural foundation of prompt injection

Transformer-based models process input as a sequence of tokens, where the weighting of relationships between individual tokens is determined by the attention mechanism. The model does not have explicit execution control or a formal instruction hierarchy, so different parts of the input, such as system instructions, user inputs, or content originating from external sources, are processed as a unified representation.

As a consequence, there is no strictly defined boundary between “instruction” and “data,” which makes it possible for an attacker to construct input that influences the model’s generative behavior in a desired direction. In prompt injection, this phenomenon manifests in the fact that attacker-inserted text is interpreted as part of the context and can therefore modify the implicit behavioral patterns followed by the model.

Direct Prompt Injection: Direct manipulation

In direct prompt injection, the attacker attempts to influence the model’s generative behavior through the user interface. The attack is not aimed at bypassing the system’s technical security mechanisms, but at causing instructions provided as part of the input to modify the implicit patterns followed by the model.

One common form of this is role-play based manipulation, in which the attacker creates an alternative context, for example the simulation of a system without restrictions, in which the enforcement of safety constraints is reduced or can be bypassed. In other cases, the attacker injects the malicious instruction in a fragmented way or in different representations (for example in encoded form), exploiting the model’s ability to reconstruct and interpret these representations.

The effectiveness of direct attacks partly stems from the fact that the model does not possess a formal instruction hierarchy or explicit trust separation: the difference between the system prompt and user input does not appear at a structured execution level, but is processed as unified context.

Indirect Prompt Injection: Compromising the environment

Indirect prompt injection has become a highly significant security threat with the spread of agent-based AI systems. In this attack model, the attacker does not interact directly with the model, but exerts influence through the manipulation of external data sources processed by the system.

The phenomenon typically appears in architectures where the model integrates inputs originating from heterogeneous sources into its context, for example it

reads documents (RAG),
processes emails or files,
or interprets web content.

The attacker embeds hidden or explicit instructions into these sources, which become part of the model’s context through the processing chain. Since transformer-based models do not apply explicit source- or role-based separation, input elements of different origin are processed as a unified representation. As a consequence, content introduced as external data may exert instruction-like effects during the generative process, without possessing formal execution privileges.

This mechanism is often described in the literature as the “data-as-instruction” problem, referring to the fact that during data processing certain pieces of information implicitly acquire execution semantics. The background of this lies in the unified handling of context and the absence of trust boundaries, which makes it possible for content originating from external sources to influence system behavior in unexpected ways.

Limits of deterministic security models

One of the fundamental difficulties in defending against prompt injection stems from the fact that the problem cannot be reduced to classical input validation mechanisms. In traditional IT systems, malicious inputs can often be identified on the basis of syntactic or structural features, such as rule-violating character strings or format inconsistencies. By contrast, in the case of large language models (LLMs), the attack surface primarily appears at the level of semantic representation, where the meaning of input, its contextual interpretation, and its implicit instructional character become decisive.

A further challenge is the probabilistic operating paradigm of the models. During the generative process, the model does not choose according to deterministic rules, but samples from the conditional probability distribution of the next token, which is shaped by the full context. As a consequence, if content introduced by the attacker creates a statistically dominant contextual pattern, the model’s generation will follow that pattern regardless of its source or trustworthiness.

As a consequence, if content introduced by the attacker creates a statistically dominant contextual pattern, the model’s generation will follow that pattern regardless of its source or trustworthiness.

From this it follows that the difference between system instructions and user input does not appear as an explicit, formal priority or execution hierarchy, but develops implicitly and dynamically within the context. The model’s behavior is therefore not bound to absolute rules, but is the result of context-dependent probabilistic weighting, which imposes significant limitations on the applicability of traditional deterministic security approaches.

Security approaches and their limitations

Defense against prompt injection necessarily requires a multilayered (defense-in-depth) approach, in which different technical and behavior-based control mechanisms complement each other in mitigating risks. It is important to emphasize that most of these solutions do not provide deterministic protection, but rather probabilistic risk reduction.

Use of delimiters (structural separation):

The explicit separation of system instructions and data to be processed by means of structural markers (for example ### DATA START ### … ### DATA END ###) serves the partial organization of context. This approach helps separate input segments, but does not create a formal trust boundary, since the model also treats these markers as part of the context. Consequently, an attacker may be able to reproduce or manipulate the structure, which limits the effectiveness of the method.

LLM-based guardrails:

The use of a separate, typically smaller and more restrictive model that attempts to identify manipulation patterns through semantic analysis of the input or output. This approach increases detection capability, especially for known attack patterns, but is inherently reactive in nature and cannot provide complete protection against an attack space arising from linguistic variability and combinatorial possibilities.

Privilege separation:

Strict limitation of the privileges of AI-based agents according to the principle of least privilege. This control is not aimed at preventing attacks, but at limiting their impact. If the range of executable operations is restricted, a successful prompt injection does not necessarily result in critical consequences.

Few-shot prompting as behavioral guidance:

The use of examples embedded in the system prompt that demonstrate to the model the appropriate handling of manipulation attempts. This approach shapes the model’s implicit behavioral patterns, but cannot be considered a deterministic security mechanism, since its effectiveness depends on the characteristics of context-dependent probabilistic generation.

Professional conclusion

Prompt injection is a structural security risk of AI systems that arises from the fact that data and instructions are not formally separated, but are interpreted through the same representational and processing channel. As a consequence, the problem cannot be addressed solely through prompt engineering techniques, since these do not create explicit trust or execution boundaries.

In this context, security cannot be derived from guaranteeing the “cleanliness” of the input, but from controlling the execution environment and the strict limitation of operational scope and privileges.

Accordingly, defense against prompt injection should primarily be understood as an architectural issue. Without the application of Zero Trust principles and the minimization of executable operations, the risk remains significant in most LLM-based systems, even if advanced detection and filtering mechanisms are applied.

Author

About the Author

E. V. L. Ethical Hacker | Former CISO | Cybersecurity Expert

Her professional career is defined by the duality of offensive technical experience and strategic information security leadership. As an early researcher in AI security, she was already working on the vulnerabilities of language models in 2018, and later became responsible for the secure integration of AI systems in enterprise environments. Through her publications, she aims to contribute to the development of a structured body of knowledge that supports understanding in the complex landscape of algorithm-driven threats and cyber resilience.

Contact

Get in Touch

For general inquiries, professional discussions, or consultations related to AI security, you can reach out using the contact information below.

Show email address
infoexamplecom