AI threats and attack techniques

Model Stealing: The Theft of Business Value and Intellectual Capital

Model stealing attacks aim to reproduce the functional behavior of a model in such a way that the original system becomes partially or fully replaceable.

Reading time: 13 minutes Category: AI Threats

Introduction

Model stealing is an attack category whose objective is to reproduce the functional behavior of an artificial intelligence model in such a way that the original system becomes partially or fully replaceable. In this context, the attacker does not necessarily seek to directly obtain the internal parameters (weights) or architecture of the model, but rather approximates its input–output mapping based on empirical observations.

The essence of the attack is that, by using queries sent to the model and the responses received, the attacker trains a surrogate model that statistically approximates the behavior of the target system. This approach makes it possible to reconstruct the model’s decision boundaries and predictive patterns without requiring direct access to the internal implementation.

Model stealing is particularly relevant in modern API-based AI services (AI-as-a-Service), where access to the model is technically restricted, while querying outputs remains available in a controlled but repeatable manner. In this environment, the attack is based on the systematic sampling and approximation of the model’s behavior, which may be sufficient for the partial or even substantial reproduction of the business value and intellectual property associated with the model.

As a result, model stealing is not merely a technical threat, but one with economic and legal dimensions, which can directly affect the provider’s competitiveness and the return on resources invested in development.

1. The Mechanism of Model Stealing: Functional Approximation and Surrogate Models

The central objective of model stealing is to create a surrogate model that statistically approximates the behavior of the original model. The process is based on the approximation of the decision function implemented by the model, during which the attacker uses the outputs of the target system as implicit training signals (pseudo-labeling).

During the attack, the model is interpreted as a mapping that assigns outputs to inputs. The attacker reconstructs this mapping through empirical sampling and trains a new model on the basis of the collected data, one that is capable of reproducing the predictive behavior of the target system.

The significance of model stealing stems from the fact that, in modern AI systems, the model is often one of the organization’s most important items of intellectual property and a primary value-generating component. Consequently, reproducing the model’s behavior is not merely a technical matter, but has direct business and strategic implications, including the loss of competitive advantage and the devaluation of development investments.

It is important to emphasize, however, that the resulting surrogate model is typically not completely equivalent to the original system, but rather a functional approximation of it. The quality of the approximation is generally heterogeneous: high accuracy may be achieved in certain input regions, while systematic deviations may appear in others. Nevertheless, in many application contexts, this approximation may be sufficient for the practical substitution of the system.

2. The Focus of the Attack: The API as an Attack Surface

The primary attack surface of model stealing is the API layer through which the model is accessible as a service. By its inherent nature, the operation of the API makes the iterative observation and sampling of model behavior possible, so outputs can be interpreted as a form of information leakage channel.

In API-based attacks, the attacker does not follow typical user behavior, but instead generates inputs designed to maximize the amount of information that can be extracted about the model. These queries are often synthetically generated inputs specifically intended to map the model’s decision boundaries and behavioral patterns.

The effectiveness of the attack depends heavily on the defensive mechanisms applied by the provider. Of particular importance are:

– the detection of non-human querying patterns,

– limiting the frequency of queries (rate limiting),

– as well as the use of behavioral monitoring and analytics procedures.

In the absence of such controls, the API layer may function not only as an access interface, but also as an exfiltration channel enabling the systematic mapping of model behavior.

3. Technical Implementation Mechanisms

The technical execution of model stealing is based on the combination of several interrelated methods aimed at the efficient and scalable reconstruction of the target system’s behavior. A shared characteristic of these approaches is that the attacker uses model outputs as an information extraction channel and trains a surrogate model on their basis.

A. Query-Based Knowledge Distillation

In query-based knowledge distillation, the attacker implicitly uses the target model as a teacher model, without gaining access to the internal structure of the model. The responses to inputs sent to the model (pseudo-labels) form an artificially generated training set on which a new surrogate model can be trained.

This process can be interpreted as an adversarial variation of classical knowledge distillation, where learning is not based on cooperation, but on indirect information extraction from outputs. The quality of the reconstructed model fundamentally depends on the extent to which the queries cover the relevant and critical regions of the original model’s input space.

If the sampling is biased or incomplete, the behavior of the surrogate model may be locally accurate, but globally show deviations compared to the original model.

B. Query Optimization and Active Learning

The goal of query optimization is to increase the efficiency of the attack process, that is, to ensure that the maximum amount of information can be extracted from the target model’s behavior with a minimal number of queries. This is particularly relevant in environments where the cost of queries is high, or where access is subject to quotas.

Within the active learning framework, the attacker adaptively selects the next inputs, typically those for which the model’s response shows greater uncertainty or instability. This strategy makes more efficient mapping of the model’s decision boundaries possible, since output differences are most informative in these regions.

It is important to emphasize, however, that although this approach may significantly improve sampling efficiency, it does not guarantee the precise reconstruction of the entire behavioral space, especially in the case of high-dimensional or complex models.

C. Automated Query Pipelines

Automated query pipelines are complex, integrated systems that automate the full lifecycle of a model stealing attack, from input generation to the iterative updating of the surrogate model. These systems typically use closed-loop optimization mechanisms in which the current performance of the surrogate model serves as feedback for the selection and fine-tuning of subsequent queries.

The pipeline typically integrates the following components:

– synthetic or adaptive input generation,

– structured collection of target model outputs,

– continuous training and validation of the surrogate model,

– as well as dynamic optimization of the query strategy.

This approach makes the scalable and adaptive execution of the attack possible, especially in the case of highly complex models where manual sampling is not effective. At the same time, the implementation of such pipelines requires considerable technical complexity and computational resources, which raises the attack threshold to some extent.

D. Circumventing Query Restrictions and Distributed Model Extraction

The goal of rate-limit evasion is to bypass query restrictions (rate limiting) applied by providers in such a way that the attack activity remains below the detection threshold. This strategy is typically based on distributed query architectures that distribute the query load across multiple independent entities.

During multi-account orchestration, the attacker coordinates multiple user accounts that perform queries in parallel. This makes it possible to distribute queries such that individual accounts do not exceed their quotas, while a significant amount of data is extracted in aggregate.

The proxy rotation technique is based on the continuous change of the network source of queries, making IP-based identification and pattern-based detection more difficult. The attacker uses different network endpoints, reducing the probability that the queries can be linked to a single entity.

Among the most advanced approaches is botnet-based query infrastructure, which uses the coordinated operation of a large number of compromised devices to produce highly distributed query patterns. This makes it possible to conduct a low-intensity but high-volume attack in aggregate, which is particularly difficult to detect with traditional monitoring tools.

The common feature of these techniques is that the attack is aimed not only at reconstructing model behavior, but also at bypassing detection and limiting mechanisms, making the model stealing threat a complex, system-level security challenge.

4. Consequences and Risks

Model stealing may have a significant impact on organizations’ business operations, as well as on their information security and competitive position. The reproduction of model behavior may directly contribute to the erosion of the organization’s competitive advantage, especially in cases where the model’s unique capabilities or performance constitute the main differentiating factor.

One central element of the risk is the disruption of the monetization model, which occurs when the model’s functionality becomes available through alternative implementations. However, the extent of this is not universal, but depends strongly on the ecosystem of the given service, the depth of user integration, and the lock-in mechanisms applied.

Another critical factor is so-called adversarial transferability, as a result of which the attacker may use the results of experiments conducted on the surrogate model to prepare attacks against the original system. Although behavioral transfer is neither deterministic nor complete, empirical studies often show significant overlap between the decision patterns of models, which may increase the effectiveness of attacks.

5. Security Controls

Defending against model stealing requires a multilayered approach that integrates detection, limiting, and information-reduction mechanisms alike. The goal of AI security is, on the one hand, to identify attack patterns and, on the other hand, to control the amount of information that can be extracted from the model.

Semantic rate limiting does not consider only the number of queries, but also analyzes their content and behavioral patterns. This makes it possible to detect non-natural, automated, or targeted information extraction queries, especially in the case of query-based attacks.

The purpose of output perturbation techniques is to reduce the informational content embedded in the model’s responses. This may be achieved, for example, by adding noise (noise injection), restricting confidence values, or discretizing outputs. At the same time, these approaches represent an inherent trade-off between predictive accuracy and security, since excessive information reduction may also degrade service quality.

Model watermarking encompasses methods that embed hidden, detectable patterns into the behavior of models. These make it possible to later demonstrate whether a given model uses knowledge originating from another system. Watermarking is primarily interpreted as a detection and legal enforcement tool, and does not provide full preventive protection.

Professional Conclusion

Model stealing highlights that, in the case of AI systems, model outputs themselves carry significant information about the internal functioning of the system. Although full reconstruction is rarely achievable, functional approximation is often sufficient for the partial or substantial reproduction of the economic value associated with the model.

Accordingly, in AI security design, model outputs must be treated as potential information leakage channels. Effective AI security requires that, during the design and operation of systems, output interfaces be treated not only as functional resources, but also as controlled resources from an information security perspective.

Author

About the Author

Sandra S. Ethical Hacker | Former CISO | Cybersecurity Expert

Her professional career is defined by the duality of offensive technical experience and strategic information security leadership. As an early researcher in AI security, she was already working on the vulnerabilities of language models in 2018, and later became responsible for the secure integration of AI systems in enterprise environments. Through her publications, she aims to contribute to the development of a structured body of knowledge that supports understanding in the complex landscape of algorithm-driven threats and cyber resilience.

Author Profile