mecㆍviewer v1.4 :: lecture11.pdf

T&C LAB-AI

Robotics

Reinforcement Learning

Lecture 11

Jeong-Yean Yang

2020/12/10

T&C LAB-AI

Examples

T&C LAB-AI

Robotics

State Value, V(s) in 2Dim Space

• l11mc.py and l11td.py
• Grid world ( w=10, h=10)

Random movement

N, E, W,S

N: yy+1

S: yy-1

E: xx+1

W: xx-1

Initial position: (1,1)

Goal position : (10,10)

Reward at goal : r=+1

Others: r=0

Agent

T&C LAB-AI

Robotics

State Value: V(s)

• Optimal path: Follows the Maximum State Value, V(s)

T&C LAB-AI

Robotics

MC-based 2D Problem: l11mc.py

• S0= initial position

– (1,1) is the starting position

• V(s)= 2 dimension

– V(1,1) ,v(1,2)… V(1,10)
– V(2,1),V(2,2)…. V(2,10)
– ….
– V(10,1),V(10,2),…. V(10,10)

• History, h is also 2 Dim.

Terminal

,...

Initial

 

   

 

  

   

 

 

   

 

1,1

10,1

10,10

1,10

T&C LAB-AI

Robotics

Monte Carlo Update

• MC visits the past state

–  The Agent may visit the state, s more than one.
– If the number of visits increases, V(s) is updated by return R

in many times.

– With many visits, Return value is smoothly updated like color

blurring by finger

History

Return

pp. 22 from lecture 10

T&C LAB-AI

Robotics

Monte Carlo Method

State Value l11mc.py

After 1 episode

After 50 episode

After 200 episode

T&C LAB-AI

Robotics

Temporal Difference 2D problem: l11td.py





( )

) ( )

( )

( ')

V s

r s

V s





 



Value update by TD

SO is the old state

S is the new state

by action, a like

S = S0+a,

T&C LAB-AI

Robotics

TD learning in 2D : l11td.py

• Episodes : 100
• Alpha=0.01
• Gamma=0.99

1 episode

After 100 episode

10 episode

T&C LAB-AI

Robotics

Why TD is So Faster than MC?

• In many cases, TD is faster than MC
• “But, it is NOT clearly Proven”.

• We know that MC is sensitive to alpha value
• See next example

Alpha=0.01

Alpha=0.001

T&C LAB-AI

Robotics

Why TD get so Smooth V(s)?

• TD is the function of (s, so)
• MC is the function of (R, s)
• Which one is better?

– TD looks better. Many local maxima is not good for climbing
– However, Blurring with TD makes Biased V(s) sometimes.





( )

) ( )

( )

( ')

V s

r s

V s











It is the

Highest

Mountain?

T&C LAB-AI

Q-Learning

T&C LAB-AI

Robotics

State Value Knows Where to GO,

but does not teach Which Action to do.

• State value, V(s): Expected returns of observed state

• From Sense(s)-and-Action(a),

How we choose the best action?

• State value is the INDIRECT legend.
• We want Direct Legend, that is the BEST action.

(?)

Action



T&C LAB-AI

Robotics

People Believes that

I know my State, but it is NOT True

T&C LAB-AI

Robotics

State Example (I)

You Cannot Observe Everything!

• State,s = Your consciousness

...

Hunger

Money

Health

Grade



































...

Hunger

Money

Health

Grade



































Action:

Presentation

Observed

State

Unobserved

State

T&C LAB-AI

Robotics

State Example (II)

It is NOT your Turn  Environment Dynamics

• Think Tic-Tac-Toe

19683

 

 



 

 

S’

 

 



 

 

Environment

(opponent player)

Do action of X mark

T&C LAB-AI

Robotics

State Example (II)

It is NOT your Turn  Environment Dynamics

• Think Tic-Tac-Toe

19683

 

 



 

 

S’

 

 



 

 

S’

 

 



 

 

Six

Possible

cases

T&C LAB-AI

Robotics

Environment Dynamics makes

my prediction from S to S’ to be Wrong!

• Agent wants moves
From S(0,0) to S(1,0).

But, what kinds of action
can do this?

+1 or right move is NOT
the Answer in stochastic
world

S’

…

Action and Next state is NOT

directly associated.

V(0,0)

V(1,0)

V(2,0)

Which action is OK?

Uncertainties

T&C LAB-AI

Robotics

Q(s,a) space instead of State Value,V(s)

• Q space : State-and-Action Space (S-A space)

• In Q space, all possible actions are considered with a

given

S’

T&C LAB-AI

Robotics

Q-Learning

• Instead of TD-based learning with State Value, V(s),
• Q-learning uses Q space, Q(s,a)

• Think Expectation,

: ( )

( )

( ')

TD V s

r s

V s







: ( , )

( , )

max ( ', ')

Q Learning Q s a

r s a

Q s a









( , )

) ( , )

( , )

max ( ', ')

Q s a

r s a

Q s a









 







T&C LAB-AI

Robotics

Update Rule of

TD- VS. Q-Learning

• Q learning in State-and-Action space
• V(s’) is not defined in SA space.
• The discounted maximum Q is updated for state S. 21

S’

A’1

A’2

A’3

A’4

S’

a=?

V(s)

V(s’)

r(s)

( ')

V s



r(s,a)

max

( ', ')

Q s a



maxQ



T&C LAB-AI

Robotics

Q-

Learning’s Two Stages.

• 1. Exploration

– Exploration is based on an agent’s Experience.
– It is episodic memory.
– An agent tries to explore the target space in a random way.
– All Returns and Actions are stored into Q values.

• 2. Exploitation

– get the best actions
– Using episodic memory during exploration, an agent tries to find the

best(or optimized) actions in every steps.

• Question :What is the goal of Exploitation?

– It is not to reach goals, but an agent tries to get more rewards

T&C LAB-AI

Robotics

Grid World Test

test4 or ex/ml/l11q1

• Plot MaxQ in each state.
• It is faster than TD.

• Q(s,a) indicated which

way an agent goes!

• Find the best way
( find the max Q)

Q(s,Left)=0.1, Q(s,Right)=0.3

Q(s,Up)=0.8, Q(s, Down)=0.01

 Now, action is Up!

Max Q

T&C LAB-AI

Robotics

Labyrinth Test

• Test5
• Map data: 0 for empty, 1 for wall

wall

Exploration Result

Exploitation

T&C LAB-AI

HW. Q-Learning

T&C LAB-AI

Robotics

Q-Learning : l11q1.py

• Q-learning has two modes.
• 1. Exploration: random searching for update Q value

• 2. Exploitation: Following Maximum Q value

– An agent follows Maximum Q value
– Argmax(Q(s,a)) = a*  Best policy(action)

( , )

) ( , )

( , )

max ( ', ')

Q s a

r s a

Q s a









 







T&C LAB-AI

Robotics

Q-Learning with Q-value class

• Q-value also has actions

( , )

if max of action, a= 4

( , )

[0, 0, 0, 0]

x y

Q s a



 

1~4

max ( , )

Q s a



arg max ( , )

(0 ~ 3)

Q s a



T&C LAB-AI

Robotics

Exploration

( , )

) ( , )

( , )

max ( ', ')

Q s a

r s a

Q s a









 







( , )

(

)

x y

xo yo



T&C LAB-AI

Robotics

Result of l11q1.py

• Exploration with 100 episodes

• Draw Qmax value

T&C LAB-AI

Robotics

Exploitation with l11q1.py

• Start s=si(0,0)
• Repeat:

– Find a*= argmax(Q(s,a))
– Do the best action, a*
– Then, we get S’
– If S’ is terminal , then stops
– S S’

T&C LAB-AI

Robotics

Complete your Q-Learning

• With a given l11q1.py, exploration result is like this

– Exploit mode stops at s=(1,3)

Exploit mode

stops at (1,3)

T&C LAB-AI

Robotics

From l11q1.py, Answer the questions

• Prob. 1. Why an agent stops at s=(1,3)

– Hint) see the Qmax picture. You see local maxima.

• Prob. 2. Complete your Q-learning

– Exploitation MUST stop at s=(9,9)
– What code should be changed in ‘l11q1.py’..
– Hint) If you understand Q learning, It is not so hard..

T&C LAB-AI

Robotics

Prob. 3. Add Noise( l11q2.py)

• Probability of 70%, action works

good.

• Otherwise, action is corrupted
• Prob. 3.1: Complete your Q-

learning

• Prob. 3.2: What happens on

Qmax graph?

• Prob. 3.3: What happens on

Exploit Mode?

• Prob. 3.4: if we increase

corruption percentage with 70%,

what happens?

Prob. 3.5: Explain why RL is good
in this hard noisy environment

Right action

Corrupted action

T&C LAB-AI

Robotics

Prob.4 Add Noise on Exploration and

Exploitation. ( l11q4.py)

• Prob. 4.1 “Complete your

Learning”

Explain the exploitation results

• Prob. 4.2 If we increase noise,
What happens?

Noises on Exploitation.


Noise corrupts the best optimal action.

T&C LAB-AI

Robotics

Prob. 5. with l11q5.py

If an agent does not stop at Terminal,

What happens at Qmax graph?

• 1. Add Noises on Action.
• 2. Agent does NOT stop at Terminal.
• 3. After 500000 actions, STOP the episode.

• What happens?
• What is the difference with the result of Prob. 1

– Hint) See the maximum Qmax value

• Why the maximum Qmax value is so different?

T&C LAB-AI

Tic-Tac-Toe

T&C LAB-AI

Robotics

Tic-Tac-Toe

• How many states is in Tic-Tac-Toe?
• The number of End-Game is 958.
• The First Offence wins game with 626

First offence = 626/958

Second offence = 332/958

T&C LAB-AI

Robotics

Tic-Tac-Toe in Q-learning

• State

– S=[0,0,0,0,0,0,0,0,0]

• RL agent takes ‘o=2’ and Human does ‘x=1’, and blank

is 0

• Q(s,a)

– Possible actions are also 0~8

• Example)

– 1. s=[0,0,0,0,0,0,0,0,0]
– 2. RL does action=4
– 3. then s* =[0,0,0,0,2,0,0,0,0]
– 4. Human does action =0   Environmental changes
– 5. finally, s’ = [1,0,0,0,2,0,0,0,0]

0 1 2

3 4 5

6 7 8

T&C LAB-AI

Robotics

How we determine Reward?

• If RL(o) wins a game, then obtains reward, r = 1
• If RL(o) loses a game, then obtains reward, r= -1
• Otherwise, r=0

• How it works?

– Agent attempts to WIN a game,
– No defense..

• If RL wins a game, r = 1
• If RL loses a game, r= -10

T&C LAB-AI

Robotics

How to determine Q Space?

• Q space is very complex and high dimensions
• Every turns Q space is added

– Check if there is same Q?

• Update Q

– Otherwise,

• Create a new Q

T&C LAB-AI

Robotics

See Example

ex/ml/l11ttt

• All learned Q space has number of 8618
• Learning by explore()
• In each step, you can check which action is the best

[

]

for empty

for X

for O





