Reinforcement Learning
Optimal Control
To min/max a measure of a dynamical system’s behavior over time.
You define a value function, “optimal return” function” to define a functional equation, the Bellman equation
Vt(s)=maxa[R(s,a)+Vt+1(f(s,a))]Vt(s):Valueofbeinginstatesattimeta:actiontakenatstatesR(s,a):rewardfortakingactionainstatesf(s,a):nextstateresultingfroms,a
Reading right to left, we say that we want to select the action that maxs the reward and the value of the next state arrived at from this action.
The value of the next state is recursive implying that it will go down the path.
The stochastic form (MDPs):
V(s)=amax[R(s,a)+γs′∑P(s′∣s,a)V(s′)]γ:Discountfactorbetween0and1;ifunder1,guaranteesconvergeovertimeP(s′∣s,a):probabilityoftransitioningtos′fromswithactiona
The value of the present state then is the value from selecting the action that give the max sum of the reward of that action and the weighted average of value from all possible states afterwards discounted by a percentage.
Q(s,a)=R(s,a)+γ∗s′∑P(s′∣s,a)a′maxQ(s′,a′)R(s,a):therewardoftheactioningivenstatesγ:thediscountfactorP(s′∣s,a)a′maxQ(s′,a′):thesummationoftheweightedprobabilityagainstthemaxselectedaction