Vulnerabilities in Maximum Entropy Inverse Reinforcement Learning under Adversarial Demonstrations

Alipanah, Arezoo

Vulnerabilities in Maximum Entropy Inverse Reinforcement Learning under Adversarial Demonstrations

dc.contributor.author	Alipanah, Arezoo
dc.date.accessioned	2025-09-18T17:14:48Z
dc.date.available	2025-09-18T17:14:48Z
dc.date.issued	2025-09-18
dc.date.submitted	2025-09-15
dc.description.abstract	Reinforcement Learning (RL) has emerged as a powerful paradigm for solving complex sequential decision-making problems. However, its effectiveness is fundamentally dependent on the availability of a well-specified reward function, the design of which is often a significant challenge. Inverse Reinforcement Learning (IRL) offers a compelling solution to this problem by enabling an agent to infer an underlying reward function from expert demonstrations. This approach has become a cornerstone of imitation learning, allowing machines to acquire sophisticated behaviors by observing human experts. A critical assumption underpinning most IRL research is that the demonstrators, while potentially suboptimal, are acting in good faith. This thesis challenges that assumption by formally investigating a significant yet underexplored security vulnerability: the susceptibility of IRL algorithms to intentionally malicious demonstrators. We address the scenario where an adversary seeks to corrupt the learning process by strategically injecting a small number of deceptive demonstrations into a training dataset, with the goal of degrading the performance of the final deployed policy. This research formalizes the problem of adversarial demonstration attacks within the IRL framework. The adversary’s objective is to design a malicious policy that generates trajectories capable of manipulating the inferred reward function. To ensure the attack remains covert, the malicious demonstrations must be statistically similar to the genuine expert demonstrations. We introduce a similarity constraint, based on the expected feature counts of trajectories, that forces the adversarial behavior to remain within a plausible, non detectable margin of the expert’s behavior. The core of our investigation is to determine whether such a constrained, malicious policy can be systematically designed and to quantify the extent of performance degradation it can induce on a policy learned from the corrupted reward function. To address this problem, we propose a novel optimization-based framework for generating the adversarial policy. The framework models the adversary’s strategy as a constrained optimization problem over the space of state-action occupancy measures. The objective is to find a policy that minimizes the expected cumulative reward according to the true, ground-truth reward function, thereby maximizing the performance loss of the agent that will learn from it. This minimization is subject to two key sets of constraints: (1) the feature-matching similarity constraint that ensures the deceptive nature of the attack, and (2) the standard Bellman flow constraints that ensure the resulting occupancy measure corresponds to a valid policy under the environment’s dynamics. A time-varying stochastic policy is then extracted from the solution to this optimization problem, providing a concrete method for generating the malicious demonstration trajectories. The effectiveness of this framework is empirically validated through a series of controlled simulation studies targeting the widely-used Maximum Entropy (MaxEnt) IRL algorithm. Our experiments are conducted in two distinct grid-world environments: ‘CliffWorld‘, which represents a safety-critical task with significant negative rewards, and ‘Four Rooms‘, a more complex navigation environment with a larger state space. We systematically evaluate the impact of varying the fraction of injected malicious data and the strictness of the similarity constraint. The performance of our proposed adversarial method is benchmarked against both a baseline of expert-only demonstrations and a scenario where random, non-strategic noise is injected into the dataset. The results of our investigation reveal a significant vulnerability in MaxEnt IRL. We demonstrate that injecting even a small fraction of malicious demonstrations, as little as 10%, can cause a disproportionately severe degradation in the performance of the deployed policy. This performance drop is substantially greater than that caused by injecting an equivalent amount of random noise, confirming the targeted nature of our adversarial generation framework. The conclusions underscore the need for the development of robust defense mechanisms and adversarially-aware IRL algorithms to ensure the safe and reliable deployment of learning agents in real-world, high-stakes applications.
dc.identifier.uri	https://hdl.handle.net/10012/22475
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.relation.uri	https://github.com/CL2-UWaterloo/adversarial-attacks-irl
dc.subject	Reinforcement Learning (RL) Inverse Reinforcement Learning (IRL) Imitation Learning Reward Function Inference Adversarial Attacks Adversarial Demonstrations Malicious Demonstrators Security in Machine Learning Robustness in IRL
dc.subject	Inverse Reinforcement Learning (IRL)
dc.subject	Reward Function Inference
dc.subject	Adversarial Attacks
dc.subject	Robustness in IRL
dc.subject	Maximum Entropy IRL (MaxEnt IRL)
dc.title	Vulnerabilities in Maximum Entropy Inverse Reinforcement Learning under Adversarial Demonstrations
dc.type	Master Thesis
uws-etd.degree	Master of Applied Science
uws-etd.degree.department	Electrical and Computer Engineering
uws-etd.degree.discipline	Electrical and Computer Engineering
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	Vardhan Pant, Yash
uws.contributor.affiliation1	Faculty of Engineering
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Alipanah_Arezoo.pdf
Size:: 1.49 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Electrical and Computer Engineering