Reinforcement Learning

It is learning what to do to maximize a numerical reward signal. Mapping SITUATIONS to ACTIONS. A situation is the sum of the environmnt and actors.

Problems and solution methods are DISTINCT.

Reinforcement is not supervised learning. Supervised learning is learning from a training set of labeled examples by an outside force. In supervised learning, you are trying to extract a hidden structure in the data. In reinforcement learning, you may learn the hidden structure but that is not the primary objective. In this sense, there are supervised learning, unupservised learning, and reinforcement learning.

A key interation is the trade-off between exploration and exploitation. Does the agent pick its prior action or pick a new action. The interaction of exploit or explore does not exist purely in the supervised/unsupervised learning.

A key feature of reinforcement is that it considers the whole problem of agents interacting in uncertain environments. Machine learning can train a new predictive ability but how does that ability play in the overall goal? Additional planning is then needed to do make it work. This “limitation” on the isolated subproblem is the issue.

All reinforcement learning agents have explicit goals, sense environment aspects, and choose actions to influce ntheir environment.

The core cycle is all interactions between an active decision-making agent and its environment whereby the agent seeks a goal despite the uncertainity of the environment.

The core elements of a reinforcement learning is:

A policy: A mapping from the perceived environment to the actions taken in those environment, a state machine
A reward signal: The update by the environment to the agent signaling the reward of its actions
A value signal (optional): What is good in the long run; what makes the agent act for the long term good
A model of the environment (optional)

The interplay between value and reward is that reward is given immediately by the environment but value (a judgement call) is always an estimate that we may attempt to reestimate it again and again.

When the agent has a model of the environment, we call that RL with planning as oppose to trail-and-error learning.

Genetic algorithms, genetic programming, simulated annealing, and other optimization methods never estimate valule functions. Multiple static policies each interacting over a period of time. Evolutionary methods.

A hill-climb method => Generate and evaluate a policy with incremental improvement

#Tic Toc Problem

An evolutionary method directly searches the space of all possible policies. Every policy would be played against the opponent and the final probability is obtained.

Map every state of the game to a number (creating a lookup table of each configuration of the game to an id). The id then maps to a probability of the agent’s win rate on that state. The estimate is the state’s value and the entire table is the learned value function. Given you are the X player. All X have 100% success. All O have 0% failure. All other states are inited to 50%. Greedily so when you select a move with the highest value. Exploratory is when you choose another move.

The value update function is given:

V(S_{t}) \leftarrow V(S_{t}) + \alpha[V(S_{t+1}-V(S_{t})]

The current value of the earlier state is updated to move it closer to the value of the latter state. After a move, the original state value is updated to be the original state value + alpha multipled by the difference of the new state’s value and the original value. Alpha is known as the step-size parameter which represents a rate of learning. It is a temporal-difference learning method. This update will converge on an optimal policy. Note that if the alpha is not reduced and the opponent does change their approach over time, this will perform ok ish.

Evolutionary methods hold the policy fixed and plays many games. The result is an unbiased estimate of the probability.What happens in the game is irrlevant credit is provided only at the end ignoring the interplay in the middle.

One of the key feature is that the reinforcement learning will learn to setup multi-move traps that will lure in shortsighted players. Effectively, planning and looking ahead without a model of the opponent.

#Final Notes

Reinforcement learning uses the formal framework of Markov decision processes to define the interaction between a learning agent and its environment. This framework must allow for a sense of cause/effect, a sense of uncertainty and nondeterminism, and existence of explicit goals. Value and value functions are vital.

#History of RL Two threads. One thread is the learning by trial and error. The second is optimal control usng value functions and dynamic programming. A third thread which is temporal-difference methods which is a mix of the two.

“Optimal control” refers to the problem of designing a controller to min/max a dynamical system’s state. The class of methods to solve optimal control problems is dynamic programming and then the discrete stochastic version is Markov decision processes (MDPs).

Here’s an example from drones. The state of the drone can be given by its position, velocity, orientation, and angular rates, a 12x1 column vector.

Dynamic programming is the only feasible way of solving general stochastic optimal control problems but suffers from the curse of dimensionality. DP to partially observable MDPs.