AI Security Fundamentals
Data Poisoning
Data poisoning distorts model behavior through the manipulation of training and fine-tuning data, making it not merely a data quality issue but a fundamental AI security risk.
Reading time: 14 minutes
Category: Introduction to AI Security
Data Poisoning
In the security architecture of agent-based systems and modern LLM applications, the data layer is no longer merely passive storage, but the operational memory and decision-making foundation of the model. Attacks against the data layer aim to systematically distort the model’s “worldview,” causing the AI to draw incorrect conclusions or execute malicious instructions without directly modifying the model weights.
In modern AI systems, especially large language models (LLMs) and domain-specific fine-tuning scenarios, the importance of data poisoning has increased because:
- training data often originates from heterogeneous and partially uncontrolled sources,
- fine-tuning is performed on small datasets, making the proportion of manipulated samples relatively high,
- model behavior is sensitive even to subtle contextual distortions.
1. Targeted and Untargeted Poisoning
A fundamental aspect of classifying data poisoning attacks is the attacker’s intent and the nature of the impact on model integrity. In this context, we distinguish between targeted and untargeted (indiscriminate) attacks.
Targeted poisoning (Targeted Attack)
In targeted poisoning, the attacker’s goal is to induce a specific classification error or prediction bias while leaving the model’s overall accuracy unaffected. The attack is often aimed at embedding a so-called “backdoor.”
Mechanism: The attacker injects corrupted data points into the training set that are associated with a specific trigger (a special pattern).
Impact: The model produces valid outputs for normal inputs, remaining invisible to security controls (e.g., anomaly detection). The malicious behavior is activated only in a specific context known to the attacker.
Example: In a targeted poisoning attack against a facial recognition system, the attacker ensures that the software correctly identifies everyone except a person wearing a specific patterned silk scarf. In this case, the system consistently identifies that person as an “authorized administrator.”
Untargeted poisoning (Untargeted poisoning)
Untargeted poisoning (availability attack) aims to systematically degrade the model’s overall performance and reliability.
Mechanism: The attacker injects biased, noisy, or inconsistent data into the training set, shifting learned representations and decision boundaries. As a result, generalization error increases and prediction calibration deteriorates.
Impact: The model does not fail on a single output, but becomes broadly unreliable, reducing its operational and business value.
Example: A company uses a machine learning model to automatically prioritize incoming support tickets so that critical issues are escalated quickly. An untargeted poisoning attack in this environment may involve systematically submitting incorrect signals regarding ticket importance. For example, non-critical issues are consistently labeled as “critical,” while genuinely severe incidents are underreported in feedback.
As a result, the model’s learning process becomes distorted: the system gradually loses its ability to reliably distinguish between truly urgent and less important cases. Predictions do not necessarily become random, but prioritization becomes inconsistent and poorly calibrated. The operational impact is that critical issues are handled with delay, while low-priority cases are unnecessarily escalated.
2. Attack Mechanisms
Data poisoning can be implemented in various technical forms depending on which component of the learning pipeline (data collection, annotation, training) the attacker is able to influence.
Label manipulation (label poisoning / label flipping)
The attacker distorts the annotation of training data. The model thus learns from incorrect input-output pairs, leading to biased decision patterns. This is one of the most direct attack forms in supervised learning systems, as it directly distorts the training objective function.
Short example:
During the training of a spam filter, the attacker mass-labels spam messages as “not spam,” causing the model to allow unwanted emails.
Backdoor attack (backdoor / trojan attack)
A more sophisticated approach where the attacker embeds a hidden pattern (a so-called trigger) into the training data. The model behaves correctly under normal inputs, but in the presence of the trigger, it consistently produces a predefined incorrect behavior. The trigger can be a specific token, character sequence, or even a complex contextual pattern, making detection particularly difficult.
Short example:
In an image recognition model, any image containing a small, invisible watermark is automatically classified as “approved,” regardless of its actual content.
Targeted misclassification (targeted misclassification)
The goal is not necessarily to create a trigger-based behavior, but to systematically bias the model’s handling of certain entities or categories. This may appear, for example, as a specific brand or individual consistently appearing in a positive or negative context, which over time can distort system decisions or recommendations.
Short example:
In a recommendation system, manipulated data causes a specific product category to be consistently recommended to irrelevant users.
Optimization-based poisoning attacks (optimization-based poisoning)
Among the more advanced attacks are optimization-based poisoning techniques, where the attacker does not select data randomly, but constructs manipulated samples specifically to maximize impact with minimal quantity. These attacks are particularly difficult to detect, as the injected data does not necessarily deviate significantly from legitimate samples in a statistical sense.
Short example:
In a credit scoring model, a few carefully designed yet realistic data points are sufficient to systematically underestimate the risk of certain customer groups.
(Interesting note: If the attacker knows the model architecture or feature weighting, manipulating as little as 0.5-3% of the training data may be sufficient to achieve the desired bias.)
3. Fine-tuning as an Attack Surface
In the lifecycle of modern large language models (LLMs) and other deep learning systems, fine-tuning is a phase where model parameters are adapted to a specific task or domain. While pretraining is performed on large, heterogeneous datasets, fine-tuning typically relies on smaller, targeted datasets. This difference introduces a potential vulnerability if the integrity of fine-tuning data is not ensured.
Shift in learning dynamics
During fine-tuning, model parameter updates are determined by a relatively small dataset. As a result, individual samples may have a disproportionately large influence on the learning process compared to pretraining, especially if the learning rate or number of iterations is not properly controlled.
If the fine-tuning dataset contains systematically biased or manipulated samples, these can influence learned representations and decision patterns. The literature typically describes this as distortion of model parameters or internal representations (e.g., fine-tuning poisoning, representation shift).
Modification of safety alignment (Safety alignment degradation)
In many modern models, desired behaviour (especially safety and ethical constraints) is shaped using methods such as Reinforcement Learning from Human Feedback. However, these methods do not guarantee invariant behavior during subsequent fine-tuning steps.
If the fine-tuning dataset contains examples that implicitly normalize or present previously restricted behaviors as legitimate, the model’s response distribution may shift accordingly. This does not necessarily eliminate safety mechanisms entirely, but may reduce their effectiveness within certain input domains.
As a result, the model may continue to perform adequately in general evaluations, while applying previous constraints less consistently in specific, context-dependent inputs.
Incremental and hard-to-detect poisoning
Fine-tuning is often performed iteratively, with successive data updates and retraining cycles. This enables attack strategies where manipulated samples are gradually introduced into the dataset (incremental or stealthy poisoning).
In such cases, model behavior does not change abruptly or visibly. Instead, subtle shifts may be observed in areas such as response prioritization, uncertainty handling, or the sensitivity of safety filters. Detecting such changes is particularly difficult, as they often remain within the variance of normal model updates.
4. Impact in Agentic Systems
In agentic AI systems, the impact of data poisoning extends beyond individual prediction errors, as distorted representations learned by the model are directly embedded into decision-making and execution processes. In such systems, the model does not merely generate responses, but operates through multi-step reasoning chains and often interacts with external tools or services.
In a compromised learning state, the model may produce distorted reasoning chains, make suboptimal or irrelevant tool usage decisions, and introduce systematic deviations in task decomposition. As a result, the error is not confined to isolated responses, but propagates throughout the entire execution process. This is particularly critical in environments where the agent interacts autonomously with external systems and its decisions have direct operational impact.
5. AI Security Approach
Defending against data poisoning requires control over the entire data lifecycle. One fundamental element is ensuring data provenance, which enables tracing the origin, modifications, and processing steps of data. This is particularly important in heterogeneous, multi-source data environments.
In parallel, data validation is required, using statistical and semantic methods to identify anomalies and potentially manipulated samples. The goal is not only to filter individual errors, but also to detect structured biases.
Access control is also a critical component, as it limits who can modify data and how such modifications can be made. This should be complemented by adversarial testing, where model behavior is evaluated in controlled environments under potential manipulations, trigger patterns, and edge cases.
Key Takeaway
Summary
Data poisoning is not merely a data quality issue, but a fundamental security risk in machine learning systems. Its key characteristic is that the compromise appears in the model’s internal representations, allowing it to manifest as seemingly legitimate behavior during operation.
Model reliability directly depends on the integrity of training data and the control of the learning process. If these are not ensured, system behavior can only be considered predictable to a limited extent, regardless of the presence of runtime security mechanisms.
AI
Author
About the Author
Sandra S. Ethical Hacker | Former CISO | Cybersecurity Expert
Her professional career is defined by the duality of offensive technical experience and strategic information security leadership. As an early researcher in AI security, she was already working on the vulnerabilities of language models in 2018, and later became responsible for the secure integration of AI systems in enterprise environments. Through her publications, she aims to contribute to the development of a structured body of knowledge that supports understanding in the complex landscape of algorithm-driven threats and cyber resilience.