PDF문서lecture11.pdf

닫기

background image

T&C LAB-AI

Robotics

Reinforcement Learning 

Lecture 11

Jeong-Yean Yang

2020/12/10

1


background image

T&C LAB-AI

Examples

5

2


background image

T&C LAB-AI

Robotics

State Value, V(s) in 2Dim Space

• l11mc.py and l11td.py
• Grid world ( w=10, h=10)

3

Random  movement

N, E, W,S

N: yy+1

S: yy-1

E: xx+1

W: xx-1

Initial position: (1,1)

Goal position : (10,10)

Reward at goal : r=+1

Others: r=0

1

2

3

4

5

6

7

8

9

10

1

2

3

4

5

6

7

8

9

10

Agent


background image

T&C LAB-AI

Robotics

State Value: V(s)

• Optimal path: Follows the Maximum State Value, V(s)

4

0

2

4

6

8

10

0

5

10

0

1

2

3

4


background image

T&C LAB-AI

Robotics

MC-based 2D Problem: l11mc.py

• S0= initial position

– (1,1) is the starting position

• V(s)= 2 dimension

– V(1,1) ,v(1,2)… V(1,10) 
– V(2,1),V(2,2)…. V(2,10)
– ….
– V(10,1),V(10,2),…. V(10,10)

• History, h is also 2 Dim.

5

Terminal

1

1

2

10

,

,

,...

1

2

2

10

Initial

h

 

   

 

  

   

 

 

   

 

1,1

10,1

10,10

1,10


background image

T&C LAB-AI

Robotics

Monte Carlo Update

• MC visits the past state

–  The Agent may visit the state, s more than one.
– If the number of visits increases, V(s) is updated by return R 

in many times.

– With many visits, Return value is smoothly updated like color 

blurring by finger

6

History

Return

pp. 22 from lecture 10


background image

T&C LAB-AI

Robotics

Monte Carlo Method

State Value l11mc.py

7

After 1 episode

After 50 episode

After 200 episode


background image

T&C LAB-AI

Robotics

Temporal Difference 2D problem: l11td.py 

8

( )

(1

) ( )

( )

( ')

V s

V s

r s

V s

 

Value update by TD

SO is the old state

S is the new state 

by action, a like

S = S0+a,


background image

T&C LAB-AI

Robotics

TD learning in 2D : l11td.py

• Episodes : 100
• Alpha=0.01
• Gamma=0.99

9

1 episode

After 100 episode

10 episode


background image

T&C LAB-AI

Robotics

Why TD is So Faster than MC?

• In many cases, TD is faster than MC
• “But, it is NOT clearly Proven”.

• We know that MC is sensitive to alpha value
• See next example

10

Alpha=0.01

Alpha=0.001

MC

MC


background image

T&C LAB-AI

Robotics

Why TD get so Smooth V(s)?

• TD is the function of (s, so)
• MC is the function of (R, s)
• Which one is better?  

– TD looks better. Many local maxima is not good for climbing
– However, Blurring with TD makes Biased V(s) sometimes.

11

TD

MC

( )

(1

) ( )

( )

( ')

V s

V s

r s

V s

It is the

Highest 

Mountain?


background image

T&C LAB-AI

Q-Learning

6

12


background image

T&C LAB-AI

Robotics

State Value Knows Where to GO,

but does not teach Which Action to do.

• State value, V(s): Expected returns of observed state

• From Sense(s)-and-Action(a),

How we choose the best action?

• State value is the INDIRECT legend.
• We want Direct Legend, that is the BEST action.

13

(?)

'

Action

s

s


background image

T&C LAB-AI

Robotics

People Believes that

I know my State, but it is NOT True

14


background image

T&C LAB-AI

Robotics

State Example (I)

You Cannot Observe Everything!

• State,s = Your consciousness 

15

...

Hunger

Money

s

Health

Grade

'

...

*

Hunger

Money

s

Health

Grade

Action:

Presentation

Observed

State

Unobserved

State


background image

T&C LAB-AI

Robotics

State Example (II)

It is NOT your Turn  Environment Dynamics

• Think Tic-Tac-Toe

16

9

*

*

*

,3

19683

*

*

*

*

X

s

O

 

 

 

 

 

 

 

 

 

 

 

 

 

 

S

S’

A

*

*

'

*

*

*

*

O

X

s

O

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Environment

(opponent player)

Do action of X mark


background image

T&C LAB-AI

Robotics

State Example (II)

It is NOT your Turn  Environment Dynamics

• Think Tic-Tac-Toe

17

9

*

*

*

,3

19683

*

*

*

*

X

s

O

 

 

 

 

 

 

 

 

 

 

 

 

 

 

S

S’

A

*

*

'

*

*

*

O

X

s

O

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

S’

*

*

'

*

*

*

O

X

s

O

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Six

Possible 

cases


background image

T&C LAB-AI

Robotics

Environment Dynamics makes

my prediction from S to S’ to be Wrong!

• Agent wants moves
From S(0,0) to S(1,0).

But, what kinds of action 
can do this?

+1 or right move is NOT 
the Answer in stochastic 
world

18

s

A

S’

S’

S’

Action and Next state is NOT 

directly associated.

V(0,0)

V(1,0)

V(2,0)

Which action is OK?

Uncertainties


background image

T&C LAB-AI

Robotics

Q(s,a) space instead of State Value,V(s)

• Q space : State-and-Action Space (S-A space)

• In Q space, all possible actions are considered with a 

given 

19

s

A1

A2

A3

A4

s

A1

A2

A3

A4

S’

S’

S’

S’


background image

T&C LAB-AI

Robotics

Q-Learning

• Instead of TD-based learning with State Value, V(s),
• Q-learning uses Q space, Q(s,a)

• Think Expectation,

20

: ( )

( )

( ')

TD V s

r s

V s

'

: ( , )

( , )

max ( ', ')

a

Q Learning Q s a

r s a

Q s a

'

( , )

(1

) ( , )

( , )

max ( ', ')

a

Q s a

Q s a

r s a

Q s a

 


background image

T&C LAB-AI

Robotics

Update Rule of

TD- VS. Q-Learning

• Q learning in State-and-Action space
• V(s’) is not defined in SA space. 
• The discounted maximum Q is updated for state S.  21

S’

A’1

A’2

A’3

A’4

S

a

S’

S

a=?

V(s)

V(s’)

r(s)

( ')

V s

r(s,a)

'

max

( ', ')

a

Q s a

maxQ

maxQ


background image

T&C LAB-AI

Robotics

Q-

Learning’s Two Stages.

• 1. Exploration

– Exploration is based on an agent’s Experience. 
– It is episodic memory.
– An agent tries to explore the target space in a random way.
– All Returns and Actions are stored into Q values.

• 2. Exploitation

– get the best actions
– Using episodic memory during exploration, an agent tries to find the 

best(or optimized) actions in every steps.

• Question :What is the goal of Exploitation?

– It is not to reach goals, but an agent tries to get more rewards

22


background image

T&C LAB-AI

Robotics

Grid World Test

test4 or ex/ml/l11q1

• Plot MaxQ in each state.
• It is faster than TD.

• Q(s,a) indicated which 

way an agent goes!

• Find the best way
( find the max Q)

Q(s,Left)=0.1, Q(s,Right)=0.3

Q(s,Up)=0.8, Q(s, Down)=0.01

 Now, action is Up!

23

0

2

4

6

8

10

0

2

4

6

8

10

30

32

34

36

38

40

42

44

46

x

Max Q

y


background image

T&C LAB-AI

Robotics

Labyrinth Test

• Test5
• Map data: 0 for empty, 1 for wall

24

wall

Exploration Result

Exploitation


background image

T&C LAB-AI

HW. Q-Learning

7

25


background image

T&C LAB-AI

Robotics

Q-Learning : l11q1.py

• Q-learning has two modes.
• 1. Exploration: random searching for update Q value

• 2. Exploitation: Following Maximum Q value

– An agent follows Maximum Q value
– Argmax(Q(s,a)) = a*  Best policy(action)

26

'

( , )

(1

) ( , )

( , )

max ( ', ')

a

Q s a

Q s a

r s a

Q s a

 


background image

T&C LAB-AI

Robotics

Q-Learning with Q-value class

• Q-value also has actions

27

( , )

if   max  of action, a= 4

( , )

[0, 0, 0, 0]

s

x y

Q s a

v

 

1~4

max ( , )

a

Q s a

*

4

*

arg max ( , )

(0 ~ 3)

a

a

Q s a

mi

a


background image

T&C LAB-AI

Robotics

Exploration

28

'

( , )

(1

) ( , )

( , )

max ( ', ')

a

Q s a

Q s a

r s a

Q s a

 

'

( , )

(

,

)

s

x y

s

xo yo


background image

T&C LAB-AI

Robotics

Result of l11q1.py

• Exploration with 100 episodes

• Draw Qmax value

29


background image

T&C LAB-AI

Robotics

Exploitation with l11q1.py

• Start s=si(0,0)
• Repeat:

– Find a*= argmax(Q(s,a))
– Do the best action, a*
– Then, we get S’
– If S’ is terminal , then stops
– S S’

30


background image

T&C LAB-AI

Robotics

Complete your Q-Learning

• With a given l11q1.py,  exploration result is like this

– Exploit mode stops at s=(1,3) 

31

Exploit mode 

stops at (1,3)


background image

T&C LAB-AI

Robotics

From l11q1.py, Answer the questions

• Prob. 1. Why an agent stops at s=(1,3)

– Hint) see the Qmax picture. You see local maxima.

• Prob. 2. Complete your Q-learning

– Exploitation MUST stop at s=(9,9)
– What code should be changed in ‘l11q1.py’..
– Hint) If you understand Q learning, It is not so hard..

32


background image

T&C LAB-AI

Robotics

Prob. 3. Add Noise( l11q2.py)

• Probability of 70%, action works 

good.

• Otherwise, action is corrupted
• Prob. 3.1: Complete your Q-

learning

• Prob. 3.2: What happens on 

Qmax graph?

• Prob. 3.3: What happens on 

Exploit Mode?

• Prob. 3.4: if we increase 

corruption percentage with 70%, 

what happens?

Prob. 3.5: Explain why RL is good 
in this hard noisy environment

33

Right action

Corrupted action


background image

T&C LAB-AI

Robotics

Prob.4 Add Noise on Exploration and 

Exploitation. ( l11q4.py)

• Prob. 4.1 “Complete your 

Learning” 

Explain the exploitation results

• Prob. 4.2 If we increase noise,
What happens?

34

Noises on Exploitation.

Noise corrupts the best optimal action.


background image

T&C LAB-AI

Robotics

Prob. 5. with l11q5.py

If an agent does not stop at Terminal,

What happens at Qmax graph?

• 1. Add Noises on Action.
• 2. Agent does NOT stop at Terminal.
• 3. After 500000 actions, STOP the episode.

• What happens?
• What  is the difference with the result of Prob. 1

– Hint) See the maximum Qmax value

• Why the maximum Qmax value is so different?

35


background image

T&C LAB-AI

Tic-Tac-Toe

8

36


background image

T&C LAB-AI

Robotics

Tic-Tac-Toe

• How many states is in Tic-Tac-Toe?
• The number of End-Game is 958.
• The First Offence wins game with 626

37

First offence = 626/958

Second offence = 332/958


background image

T&C LAB-AI

Robotics

Tic-Tac-Toe in Q-learning

• State

– S=[0,0,0,0,0,0,0,0,0] 

• RL agent takes ‘o=2’ and Human does ‘x=1’, and blank 

is 0

• Q(s,a)

– Possible actions are also 0~8 

• Example)

– 1. s=[0,0,0,0,0,0,0,0,0]   
– 2. RL does action=4  
– 3. then s* =[0,0,0,0,2,0,0,0,0]
– 4. Human does action =0   Environmental changes
– 5. finally, s’ = [1,0,0,0,2,0,0,0,0]

38

0 1 2

3 4 5

6 7 8


background image

T&C LAB-AI

Robotics

How we determine Reward?

• If RL(o) wins a game, then obtains reward, r = 1
• If RL(o) loses a game, then obtains reward, r= -1
• Otherwise, r=0

• How it works?

– Agent attempts to WIN a game, 
– No defense..

• If RL wins a game, r = 1
• If RL loses a game, r= -10

39


background image

T&C LAB-AI

Robotics

How to determine Q Space?

• Q space is very complex and high dimensions
• Every turns Q space is added

– Check if there is same Q?

• Update Q

– Otherwise, 

• Create a new Q

40


background image

T&C LAB-AI

Robotics

See Example

ex/ml/l11ttt

• All learned Q space has number of 8618
• Learning by explore()
• In each step, you can check which action is the best

41

11

12

13

21

22

23

31

32

33

[

,

,

,

,

,

,

,

,

]

 

 

2

 

ij

ij

ij

s

s

s

s

s

s

s

s

s

s

s

for empty

s

for X

s

for O