mecㆍviewer v1.4 :: lecture9.pdf

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Robot Learning:

Reinforcement Learning

Lecture 9

양정연

2020/12/10

T&C LAB-AI

Process Learning
(Sequence Learning)

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

1. What learning methods try to learn?

• Think about Neural Network and Linear Regression
• Is it True?

Neural

Network

Data (x, y) are given

Goal is to find the proper

function, NN

1 )

(

)

f x

w x

 



Learning the function!

Linear

Regression

Data (x, y) are given

Goal is to find the proper

function, y=ax+b

Learning

method

( )

f x



Learning the function!

Linear Regression is Equal to Basic NN



Nonlinear Kernels works for creating Hyperplane.

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Definition of Hyperplane

• Hyperplane is a subspace that

– reduces the dimensionality of an original space.

– 2 Dim. Space is projected on a New Space with y=wx.

y=wx



{ , |

}

x y y





{ , |

}

x y y





Linear

Transformation

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

2. What learning method try to learn?

• Think about Neural Network and Control
• Control is the Modeling based Method.

Neural

Network

Data (x, y) are given

Goal is to find the proper

function, NN

( )

f x



Learning the function!

Math. Model

PD control

K e K e





Error, e is given and we

design a controller,

Design or Model a function

Learning

method

We don’t

care

about the

output, T

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

3. If we use NN in the control,

What happens?

• Think Control Diagram
•  Input and output are different in every time  Failed

Neural

Network

Data (x, y) are given

Goal is to find the proper

function, NN

( )

f x



Learning the function!

Learning

method

controller

System

(e)



T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

4. It is different with Function Estimation.

It tries to learn Dynamic System.

• Think Control Diagram
•  Input and output are different in every time  Failed

Neural

Network

Data (x, y) are given

Goal is to find the proper

function, NN

( )

f x



Learning the function!

Learning

method

System

Data ‘e’ are given, but ‘T’ are NOT.

Goal is to find the NN that makes x to be xd.



Eg. Recurrent Neural Network (RNN)



Tries to solve System Dynamics.

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

How do Learning this case?

• Something is different with Neural Network…

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Learning Goal is “to learn a Process”

• Process is a series of sequential events.
• Design issues

– We can make a process like a instant data

– Each dot in a space is a process  Learning a process
– Every dots in a space are a non process but an image.

• How to learn a process?

– There are two types of popular methods.
– 1. Recurrent Neural Network, RNN
– 2. Reinforcement Learning, RL
– (3. Classifier  Hidden Markov Model, HMM)

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Types of Learning methods

• Learning Function

• Learning Dynamics (or Sequences)

Sample

Learning

Module

( )

NN learns

NN X

f X



Learning

Module

x 

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

How to learn Dynamics or Sequences

• 1. Recurrent Neural Network (RNN)
• 2. Recursive Neural Network (Becoming Perished)

• 3. Reinforcement Learning

– However, it has some different features(Stochastic)

-1

-0.5

0.5

-4

-3

-2

-1

state space

2000

4000

6000

8000

10000

-1

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

time(ms)

)

( )

mx cx kx

F t

 



T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Remind Tic-Tac-Toe

• Question 1: What is a state vector in Tic-Tac-Toe?

• Question 2: What is an input and an output?

– There are No Input and Output
– There are only States and Actions

x 

x  

[ , ,

, , , , , ]

X X O



 



[ , ,

, , , , X, ]

X O X X O



 

[ , ,

, ,

, , , ]

X X O



 



Action

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Input / Output is Similar to Control

Learning

Module

x 

Learning

Module

x 

( )

mx cx kx

u t

 



mx cx kx

 



u 

( , )

f X t



( , , )

f X u t



case u 

Environment

x 



case u 

T&C LAB-AI

What we Minimize?

Or What we Maximize?

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Local and Global Goal

• Minimize error is the strong issue in Learning Method

– In other words, Optimization is the Learning.

• Local Goal

– A dog tries to follow a bone  But failed

• Global Goal ( Nearly Global Optimization)

– A dog tries to avoid obstacles  locally weak but globally success

Local Goal

Global Goal

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Example

• Learning is achieved by several trials.
• 1st turn : follow the local goal, but get the goal

eventually.

• 2nd turn: unsure but a dog remembers the solution
• 3rd turn: Fully being learned.

Try to follow the local goal

It is Not sure, but finally

get the goal

Absolutely, success!

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Issue. 1. There are so Many, Many Paths.

• Step 1. Each path must be scored
• Step 2. I might find the best one.
• Step 3. But, I did not pass the best one  Try it again.

• Step 4. Finally Find the Best One.

Which one

is the optimal?

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Issue. 2. Stochastic Problem

• An agent follows the same path.
• But Scoring is not Accurate.

– 1st turn: the score was 1.
– 2nd turn: the score was 1.2
– 3rd turn: the score was 0.9

I believe

I can get

The score, 1!!

Environment

Is not equal.

Use

Stochastic

(Probabilistic)

Approaches

Scores

are not equal.

Why?

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Issue. 3. Expectation of ALL Cases

• Expectation is the terminology in Probability
• Every events are considered as Probabilistic cases.

Probably,

I expect

1.12 score!!

{ } lim

( ), is PDF, Not Probability

E R

R p R





















T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Expectation in Probability

• Expectation is the value that is weighted by probability

• Expectation of rolling a dice.

( )

, ( ) is the Probabilistic Density Function

( )

( ), ( ) is the Probabilistic Mass Function

E x

xp x dx p x

E x

x p x

p x











Prob. (

)

( )

{ }

p( )

)

,... 6

3.5

P x

p x

E x



       



T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

RL uses Return Value(Scores) in Future..

Draw!

Many cases occur.

What is the higher

Return?

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Scoring on every transition

-3

How we summarize everything?

T&C LAB-AI

Stochastic Environment

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Everything is Stochastically Determined

: Stochastic(Probabilistic) World

• Think the example

• Initial(Start) position: an agent starts its exploration.
• Terminal position: an agent ends its exploration.

• Which way (Left or Right) is better ?

– You answer the right turn.
– Why?  How can you teach it to an agent

Start

Reward: +1

Terminal

Reward: +2

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Our Agent has No Information

• An agent has two buttons

• When an agent finish its exploration, it gets a Reward.

• “An agent is blind”

Agent

moves

randomly

Agent does NOT know which button is left or right

How it works?

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Do 1000 Explorations

• Run l9test1.py

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Result is ..

Result Sum of

Left
goals

Sum of
Right
goals

506

494

485

515

518

482

498

502

503

497

517

483

485

515

What the…!

No Meaning!!

• It is just Left or Right example.
• If we do many iterations, left becomes equal to right.

– Ex) left  500 , right  500, Probability is 0.5

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

What we Learn from this STUPID Example

• 1. if we push Red or Green button “Randomly”,

– We can go left or right terminal
– Thus, probability of Left and Right is

• 2. However, we know that “Right” will get better reward

– How can we express this?

(

)

0.5 (

)

0.5

p left

p right



(

)

(

)

1.5

left

right

r p left

p right









At an initial state, we expect that

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Sum of Rewards is Good for Learning

Initial state

Reward: +1

Left

Terminal

Right

Terminal

Reward: +2

Left Terminal

Right Terminal

Number of cases

503

497

Sum of Reward

503

994

average

503/503=1.0

994/497=2.0

Avg

= 1

Avg

= 2

Finally, I can say that Average of

Reward at the RIGHT is better!

Run l9test2.py

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Well, World is not so easy…

• If it is a tough road, an agent receives -0.1 reward.

– There are some Monsters as in video games.

• In each Movement, reward is -0.1…

– Which way is better?

Initial state

Reward: +1

Terminal

Reward: +2

Distance=3

Distance=7

Damage

-0.1

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

See l9test3.py

• How it works?

Initial reward

Monster attack

Final reward at

each terminal

Left
Terminal

Right
Terminal

1st test

723
Avg= -0.817

277
Avg=-0.807

2nd test

711
Avg= -0.677

289
Avg=-1.154

Which one is better?

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

l9test3.py Test

1000  10000 times.

• Which one is better?
• With 10000 tests,
“Left” is better.

1000 times

10000 times

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

See the Result, Carefully

• Optimal might be

– Left Distance is 3 and

Left reward is 1.

 1 -3*0.1 = 0.7.

– Right Distance is 7 and

Right reward is 2.
Then, 2- 7*0.1 = 1.3??

• Why Avg. left and right

is so different?

It is a stochastic world.

An Random agent CANNOT reach

at Right Terminal within 7 steps.

(Probably Not)

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Extend This Concept

• Sum is Good?

– Case 1) s=[ 3, 2, 1, 0] Sum of rewards= 1-3*0.1=0.7 (Best)
– Case 2) s=[ 3,2,3,2,3,2,3,2,3,2,3,2,3,2,3,2,3,2,1,0] Sum =

1-19*0.1 = -0.9

• Average is Good?

– Case 1) 3 turns  0.7/3 = 0.233
– Case 2) 19 turns  -0.9/19= -0.047

Initial state

Reward: +1

Terminal

Reward: +2

Distance=3

Distance=7

Damage

-0.1

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Summary

• Average of all rewards

– Meaningless [3,2,3,2,3,2…. ] only reduces average rewards.
– Increasing average of all rewards is to find the optimal path

• Why Not Sum?

– Summation is just a Cost function in deterministic ways

• Why Average?

– Average is an alternative expression of Expectation
– Our problem is in a stochastic world  Use Probabilistic method

 



{ }

E R





: max(E{R})

optimal path

