Oppositional Reinforcement Learning with Applications
MetadataShow full item record
Machine intelligence techniques contribute to solving real-world problems. Reinforcement learning (RL) is one of the machine intelligence techniques with several characteristics that make it suitable for the applications, for which the model of the environment is not available to the agent. In real-world applications, intelligent agents generally face a very large state space which limits the usability of reinforcement learning. The condition for convergence of reinforcement learning implies that each state-action pair must be visited infinite times, a condition which can be considered impossible to be satisfied in many practical situations. The goal of this work is to propose a class of new techniques to overcome this problem for off-policy, step-by-step (incremental) and model-free reinforcement learning with discrete state and action space. The focus of this research is using the design characteristics of RL agent to improve its performance regarding the running time while maintaining an acceptable level of accuracy. One way of improving the performance of the intelligent agents is using the model of environment. In this work, a special type of knowledge about the agent actions is employed to improve its performance because in many applications the model of environment may only be known partially or not at all. The concept of opposition is employed in the framework of reinforcement learning to achieve this goal. One of the components of RL agent is the action. For each action we define its associate opposite action. The actions and opposite actions are implemented in the framework of reinforcement learning to update the value function resulting in a faster convergence. At the beginning of this research the concept of opposition is incorporated in the components of reinforcement learning, states, actions, and reinforcement signal which results in introduction of the oppositional target domain estimation algorithm, OTE. OTE reduces the search and navigation area and accelerates the speed of search for a target. The OTE algorithm is limited to the applications, in which the model of the environment is provided for the agent. Hence, further investigation is conducted to extend the concept of opposition to the model-free reinforcement learning algorithms. This extension contributes to the generating of several algorithms based on using the concept of opposition for Q(lambda) technique. The design of reinforcement learning agent depends on the application. The emphasize of this research is on the characteristics of the actions. Hence, the primary challenge of this work is design and incorporation of the opposite actions in the framework of RL agents. In this research, three different applications, namely grid navigation, elevator control problem, and image thresholding are implemented to address this challenge in context of different applications. The design challenges and some solutions to overcome the problems and improve the algorithms are also investigated. The opposition-based Q(lambda) algorithms are tested for the applications mentioned earlier. The general idea behind the opposition-based Q(lambda) algorithms is that in Q-value updating, the agent updates the value of an action in a given state. Hence, if the agent knows the value of opposite action then instead of one value, the agent can update two Q-values at the same time without taking its corresponding opposite action causing an explicit transition to opposite state. If the agent knows both values of action and its opposite action for a given state, then it can update two Q-values. This accelerates the learning process in general and the exploration phase in particular. Several algorithms are outlined in this work. The OQ(lambda) will be introduced to accelerate Q(lambda) algorithm in discrete state spaces. The NOQ(lambda) method is an extension of OQ(lambda) to operate in a broader range of non-deterministic environments. The update of the opposition trace in OQ(lambda) depends on the next state of the opposite action (which generally is not taken by the agent). This limits the usability of this technique to the deterministic environments because the next state should be known to the agent. NOQ(lambda) will be presented to update the opposition trace independent of knowing the next state for the opposite action. The results show the improvement of the performance in terms of running time for the proposed algorithms comparing to the standard Q(lambda) technique.