Skip to content

Reinforcement learning

Published: at 12:00 AMWritten by __

Reinforcement Learning

Optimal Control

To min/max a measure of a dynamical system’s behavior over time.

You define a value function, “optimal return” function” to define a functional equation, the Bellman equation

Vt(s)=amax[R(s,a)+Vt+1(f(s,a))]Vt(s):Valueofbeinginstatesattimeta:actiontakenatstatesR(s,a):rewardfortakingactionainstatesf(s,a):nextstateresultingfroms,aV_{t}(s) = \underset{\text{max}}{\text{a}}[R(s,a)+V_{t+1}(f(s,a))] \newline V_{t}(s) : Value of being in state s at time t \newline a: action taken at state s \newline R(s,a): reward for taking action a in state s \newline f(s,a): next state resultingfrom s,a

Reading right to left, we say that we want to select the action that maxs the reward and the value of the next state arrived at from this action. The value of the next state is recursive implying that it will go down the path.

The stochastic form (MDPs):

V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]γ:Discountfactorbetween0and1;ifunder1,guaranteesconvergeovertimeP(ss,a):probabilityoftransitioningtosfromswithactionaV(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V(s') \right] \newline \gamma : Discount factor between 0 and 1; if under 1, guarantees converge over time \newline P(s' \mid s, a): probability of transitioning to s' from s with action a

The value of the present state then is the value from selecting the action that give the max sum of the reward of that action and the weighted average of value from all possible states afterwards discounted by a percentage.

Q(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a)R(s,a):therewardoftheactioningivenstatesγ:thediscountfactorP(ss,a)maxaQ(s,a):thesummationoftheweightedprobabilityagainstthemaxselectedaction Q(s,a) = R(s,a)+\gamma*\sum_{s'} P(s' \mid s, a) \max_{a'} Q(s', a') \newline R(s,a): the reward of the action in given state s \gamma : the discount factor P(s' \mid s, a) \max_{a'} Q(s', a') : the summation of the weighted probability against the max selected action

Previous Post
Reinforcement learning
Next Post
How to configure AstroPaper theme