T&C LAB-AI
Robotics
Reinforcement Learning
Lecture 11
Jeong-Yean Yang
2020/12/10
1
T&C LAB-AI
Examples
5
2
T&C LAB-AI
Robotics
State Value, V(s) in 2Dim Space
• l11mc.py and l11td.py
• Grid world ( w=10, h=10)
3
Random movement
N, E, W,S
N: yy+1
S: yy-1
E: xx+1
W: xx-1
Initial position: (1,1)
Goal position : (10,10)
Reward at goal : r=+1
Others: r=0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
Agent
T&C LAB-AI
Robotics
State Value: V(s)
• Optimal path: Follows the Maximum State Value, V(s)
4
0
2
4
6
8
10
0
5
10
0
1
2
3
4
T&C LAB-AI
Robotics
MC-based 2D Problem: l11mc.py
• S0= initial position
– (1,1) is the starting position
• V(s)= 2 dimension
– V(1,1) ,v(1,2)… V(1,10)
– V(2,1),V(2,2)…. V(2,10)
– ….
– V(10,1),V(10,2),…. V(10,10)
• History, h is also 2 Dim.
5
Terminal
1
1
2
10
,
,
,...
1
2
2
10
Initial
h
1,1
10,1
10,10
1,10
T&C LAB-AI
Robotics
Monte Carlo Update
• MC visits the past state
– The Agent may visit the state, s more than one.
– If the number of visits increases, V(s) is updated by return R
in many times.
– With many visits, Return value is smoothly updated like color
blurring by finger
6
History
Return
pp. 22 from lecture 10
T&C LAB-AI
Robotics
Monte Carlo Method
State Value l11mc.py
7
After 1 episode
After 50 episode
After 200 episode
T&C LAB-AI
Robotics
Temporal Difference 2D problem: l11td.py
8
( )
(1
) ( )
( )
( ')
V s
V s
r s
V s
Value update by TD
SO is the old state
S is the new state
by action, a like
S = S0+a,
T&C LAB-AI
Robotics
TD learning in 2D : l11td.py
• Episodes : 100
• Alpha=0.01
• Gamma=0.99
9
1 episode
After 100 episode
10 episode
T&C LAB-AI
Robotics
Why TD is So Faster than MC?
• In many cases, TD is faster than MC
• “But, it is NOT clearly Proven”.
• We know that MC is sensitive to alpha value
• See next example
10
Alpha=0.01
Alpha=0.001
MC
MC
T&C LAB-AI
Robotics
Why TD get so Smooth V(s)?
• TD is the function of (s, so)
• MC is the function of (R, s)
• Which one is better?
– TD looks better. Many local maxima is not good for climbing
– However, Blurring with TD makes Biased V(s) sometimes.
11
TD
MC
( )
(1
) ( )
( )
( ')
V s
V s
r s
V s
It is the
Highest
Mountain?
T&C LAB-AI
Q-Learning
6
12
T&C LAB-AI
Robotics
State Value Knows Where to GO,
but does not teach Which Action to do.
• State value, V(s): Expected returns of observed state
• From Sense(s)-and-Action(a),
How we choose the best action?
• State value is the INDIRECT legend.
• We want Direct Legend, that is the BEST action.
13
(?)
'
Action
s
s
T&C LAB-AI
Robotics
People Believes that
I know my State, but it is NOT True
14
T&C LAB-AI
Robotics
State Example (I)
You Cannot Observe Everything!
• State,s = Your consciousness
15
...
Hunger
Money
s
Health
Grade
'
...
*
Hunger
Money
s
Health
Grade
Action:
Presentation
Observed
State
Unobserved
State
T&C LAB-AI
Robotics
State Example (II)
It is NOT your Turn Environment Dynamics
• Think Tic-Tac-Toe
16
9
*
*
*
,3
19683
*
*
*
*
X
s
O
S
S’
A
*
*
'
*
*
*
*
O
X
s
O
Environment
(opponent player)
Do action of X mark
T&C LAB-AI
Robotics
State Example (II)
It is NOT your Turn Environment Dynamics
• Think Tic-Tac-Toe
17
9
*
*
*
,3
19683
*
*
*
*
X
s
O
S
S’
A
*
*
'
*
*
*
O
X
s
O
X
S’
*
*
'
*
*
*
O
X
s
O
X
Six
Possible
cases
T&C LAB-AI
Robotics
Environment Dynamics makes
my prediction from S to S’ to be Wrong!
• Agent wants moves
From S(0,0) to S(1,0).
But, what kinds of action
can do this?
+1 or right move is NOT
the Answer in stochastic
world
18
s
A
S’
S’
S’
…
Action and Next state is NOT
directly associated.
V(0,0)
V(1,0)
V(2,0)
Which action is OK?
Uncertainties
T&C LAB-AI
Robotics
Q(s,a) space instead of State Value,V(s)
• Q space : State-and-Action Space (S-A space)
• In Q space, all possible actions are considered with a
given
19
s
A1
A2
A3
A4
s
A1
A2
A3
A4
S’
S’
S’
S’
T&C LAB-AI
Robotics
Q-Learning
• Instead of TD-based learning with State Value, V(s),
• Q-learning uses Q space, Q(s,a)
• Think Expectation,
20
: ( )
( )
( ')
TD V s
r s
V s
'
: ( , )
( , )
max ( ', ')
a
Q Learning Q s a
r s a
Q s a
'
( , )
(1
) ( , )
( , )
max ( ', ')
a
Q s a
Q s a
r s a
Q s a
T&C LAB-AI
Robotics
Update Rule of
TD- VS. Q-Learning
• Q learning in State-and-Action space
• V(s’) is not defined in SA space.
• The discounted maximum Q is updated for state S. 21
S’
A’1
A’2
A’3
A’4
S
a
S’
S
a=?
V(s)
V(s’)
r(s)
( ')
V s
r(s,a)
'
max
( ', ')
a
Q s a
maxQ
maxQ
T&C LAB-AI
Robotics
Q-
Learning’s Two Stages.
• 1. Exploration
– Exploration is based on an agent’s Experience.
– It is episodic memory.
– An agent tries to explore the target space in a random way.
– All Returns and Actions are stored into Q values.
• 2. Exploitation
– get the best actions
– Using episodic memory during exploration, an agent tries to find the
best(or optimized) actions in every steps.
• Question :What is the goal of Exploitation?
– It is not to reach goals, but an agent tries to get more rewards
22
T&C LAB-AI
Robotics
Grid World Test
test4 or ex/ml/l11q1
• Plot MaxQ in each state.
• It is faster than TD.
• Q(s,a) indicated which
way an agent goes!
• Find the best way
( find the max Q)
Q(s,Left)=0.1, Q(s,Right)=0.3
Q(s,Up)=0.8, Q(s, Down)=0.01
Now, action is Up!
23
0
2
4
6
8
10
0
2
4
6
8
10
30
32
34
36
38
40
42
44
46
x
Max Q
y
T&C LAB-AI
Robotics
Labyrinth Test
• Test5
• Map data: 0 for empty, 1 for wall
24
wall
Exploration Result
Exploitation
T&C LAB-AI
HW. Q-Learning
7
25
T&C LAB-AI
Robotics
Q-Learning : l11q1.py
• Q-learning has two modes.
• 1. Exploration: random searching for update Q value
• 2. Exploitation: Following Maximum Q value
– An agent follows Maximum Q value
– Argmax(Q(s,a)) = a* Best policy(action)
26
'
( , )
(1
) ( , )
( , )
max ( ', ')
a
Q s a
Q s a
r s a
Q s a
T&C LAB-AI
Robotics
Q-Learning with Q-value class
• Q-value also has actions
27
( , )
if max of action, a= 4
( , )
[0, 0, 0, 0]
s
x y
Q s a
v
1~4
max ( , )
a
Q s a
*
4
*
arg max ( , )
(0 ~ 3)
a
a
Q s a
mi
a
T&C LAB-AI
Robotics
Exploration
28
'
( , )
(1
) ( , )
( , )
max ( ', ')
a
Q s a
Q s a
r s a
Q s a
'
( , )
(
,
)
s
x y
s
xo yo
T&C LAB-AI
Robotics
Result of l11q1.py
• Exploration with 100 episodes
• Draw Qmax value
29
T&C LAB-AI
Robotics
Exploitation with l11q1.py
• Start s=si(0,0)
• Repeat:
– Find a*= argmax(Q(s,a))
– Do the best action, a*
– Then, we get S’
– If S’ is terminal , then stops
– S S’
30
T&C LAB-AI
Robotics
Complete your Q-Learning
• With a given l11q1.py, exploration result is like this
– Exploit mode stops at s=(1,3)
31
Exploit mode
stops at (1,3)
T&C LAB-AI
Robotics
From l11q1.py, Answer the questions
• Prob. 1. Why an agent stops at s=(1,3)
– Hint) see the Qmax picture. You see local maxima.
• Prob. 2. Complete your Q-learning
– Exploitation MUST stop at s=(9,9)
– What code should be changed in ‘l11q1.py’..
– Hint) If you understand Q learning, It is not so hard..
32
T&C LAB-AI
Robotics
Prob. 3. Add Noise( l11q2.py)
• Probability of 70%, action works
good.
• Otherwise, action is corrupted
• Prob. 3.1: Complete your Q-
learning
• Prob. 3.2: What happens on
Qmax graph?
• Prob. 3.3: What happens on
Exploit Mode?
• Prob. 3.4: if we increase
corruption percentage with 70%,
what happens?
Prob. 3.5: Explain why RL is good
in this hard noisy environment
33
Right action
Corrupted action
T&C LAB-AI
Robotics
Prob.4 Add Noise on Exploration and
Exploitation. ( l11q4.py)
• Prob. 4.1 “Complete your
Learning”
Explain the exploitation results
• Prob. 4.2 If we increase noise,
What happens?
34
Noises on Exploitation.
Noise corrupts the best optimal action.
T&C LAB-AI
Robotics
Prob. 5. with l11q5.py
If an agent does not stop at Terminal,
What happens at Qmax graph?
• 1. Add Noises on Action.
• 2. Agent does NOT stop at Terminal.
• 3. After 500000 actions, STOP the episode.
• What happens?
• What is the difference with the result of Prob. 1
– Hint) See the maximum Qmax value
• Why the maximum Qmax value is so different?
35
T&C LAB-AI
Tic-Tac-Toe
8
36
T&C LAB-AI
Robotics
Tic-Tac-Toe
• How many states is in Tic-Tac-Toe?
• The number of End-Game is 958.
• The First Offence wins game with 626
37
First offence = 626/958
Second offence = 332/958
T&C LAB-AI
Robotics
Tic-Tac-Toe in Q-learning
• State
– S=[0,0,0,0,0,0,0,0,0]
• RL agent takes ‘o=2’ and Human does ‘x=1’, and blank
is 0
• Q(s,a)
– Possible actions are also 0~8
• Example)
– 1. s=[0,0,0,0,0,0,0,0,0]
– 2. RL does action=4
– 3. then s* =[0,0,0,0,2,0,0,0,0]
– 4. Human does action =0 Environmental changes
– 5. finally, s’ = [1,0,0,0,2,0,0,0,0]
38
0 1 2
3 4 5
6 7 8
T&C LAB-AI
Robotics
How we determine Reward?
• If RL(o) wins a game, then obtains reward, r = 1
• If RL(o) loses a game, then obtains reward, r= -1
• Otherwise, r=0
• How it works?
– Agent attempts to WIN a game,
– No defense..
• If RL wins a game, r = 1
• If RL loses a game, r= -10
39
T&C LAB-AI
Robotics
How to determine Q Space?
• Q space is very complex and high dimensions
• Every turns Q space is added
– Check if there is same Q?
• Update Q
– Otherwise,
• Create a new Q
40
T&C LAB-AI
Robotics
See Example
ex/ml/l11ttt
• All learned Q space has number of 8618
• Learning by explore()
• In each step, you can check which action is the best
41
11
12
13
21
22
23
31
32
33
[
,
,
,
,
,
,
,
,
]
0
1
2
ij
ij
ij
s
s
s
s
s
s
s
s
s
s
s
for empty
s
for X
s
for O