Vulnerabilities in Maximum Entropy Inverse Reinforcement Learning under Adversarial Demonstrations

dc.contributor.authorAlipanah, Arezoo
dc.date.accessioned2025-09-18T17:14:48Z
dc.date.available2025-09-18T17:14:48Z
dc.date.issued2025-09-18
dc.date.submitted2025-09-15
dc.description.abstractReinforcement Learning (RL) has emerged as a powerful paradigm for solving complex sequential decision-making problems. However, its effectiveness is fundamentally dependent on the availability of a well-specified reward function, the design of which is often a significant challenge. Inverse Reinforcement Learning (IRL) offers a compelling solution to this problem by enabling an agent to infer an underlying reward function from expert demonstrations. This approach has become a cornerstone of imitation learning, allowing machines to acquire sophisticated behaviors by observing human experts. A critical assumption underpinning most IRL research is that the demonstrators, while potentially suboptimal, are acting in good faith. This thesis challenges that assumption by formally investigating a significant yet underexplored security vulnerability: the susceptibility of IRL algorithms to intentionally malicious demonstrators. We address the scenario where an adversary seeks to corrupt the learning process by strategically injecting a small number of deceptive demonstrations into a training dataset, with the goal of degrading the performance of the final deployed policy. This research formalizes the problem of adversarial demonstration attacks within the IRL framework. The adversary’s objective is to design a malicious policy that generates trajectories capable of manipulating the inferred reward function. To ensure the attack remains covert, the malicious demonstrations must be statistically similar to the genuine expert demonstrations. We introduce a similarity constraint, based on the expected feature counts of trajectories, that forces the adversarial behavior to remain within a plausible, non detectable margin of the expert’s behavior. The core of our investigation is to determine whether such a constrained, malicious policy can be systematically designed and to quantify the extent of performance degradation it can induce on a policy learned from the corrupted reward function. To address this problem, we propose a novel optimization-based framework for generating the adversarial policy. The framework models the adversary’s strategy as a constrained optimization problem over the space of state-action occupancy measures. The objective is to find a policy that minimizes the expected cumulative reward according to the true, ground-truth reward function, thereby maximizing the performance loss of the agent that will learn from it. This minimization is subject to two key sets of constraints: (1) the feature-matching similarity constraint that ensures the deceptive nature of the attack, and (2) the standard Bellman flow constraints that ensure the resulting occupancy measure corresponds to a valid policy under the environment’s dynamics. A time-varying stochastic policy is then extracted from the solution to this optimization problem, providing a concrete method for generating the malicious demonstration trajectories. The effectiveness of this framework is empirically validated through a series of controlled simulation studies targeting the widely-used Maximum Entropy (MaxEnt) IRL algorithm. Our experiments are conducted in two distinct grid-world environments: ‘CliffWorld‘, which represents a safety-critical task with significant negative rewards, and ‘Four Rooms‘, a more complex navigation environment with a larger state space. We systematically evaluate the impact of varying the fraction of injected malicious data and the strictness of the similarity constraint. The performance of our proposed adversarial method is benchmarked against both a baseline of expert-only demonstrations and a scenario where random, non-strategic noise is injected into the dataset. The results of our investigation reveal a significant vulnerability in MaxEnt IRL. We demonstrate that injecting even a small fraction of malicious demonstrations, as little as 10%, can cause a disproportionately severe degradation in the performance of the deployed policy. This performance drop is substantially greater than that caused by injecting an equivalent amount of random noise, confirming the targeted nature of our adversarial generation framework. The conclusions underscore the need for the development of robust defense mechanisms and adversarially-aware IRL algorithms to ensure the safe and reliable deployment of learning agents in real-world, high-stakes applications.
dc.identifier.urihttps://hdl.handle.net/10012/22475
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.relation.urihttps://github.com/CL2-UWaterloo/adversarial-attacks-irl
dc.subjectReinforcement Learning (RL) Inverse Reinforcement Learning (IRL) Imitation Learning Reward Function Inference Adversarial Attacks Adversarial Demonstrations Malicious Demonstrators Security in Machine Learning Robustness in IRL
dc.subjectInverse Reinforcement Learning (IRL)
dc.subjectReward Function Inference
dc.subjectAdversarial Attacks
dc.subjectRobustness in IRL
dc.subjectMaximum Entropy IRL (MaxEnt IRL)
dc.titleVulnerabilities in Maximum Entropy Inverse Reinforcement Learning under Adversarial Demonstrations
dc.typeMaster Thesis
uws-etd.degreeMaster of Applied Science
uws-etd.degree.departmentElectrical and Computer Engineering
uws-etd.degree.disciplineElectrical and Computer Engineering
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorVardhan Pant, Yash
uws.contributor.affiliation1Faculty of Engineering
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Alipanah_Arezoo.pdf
Size:
1.49 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: