“TD learning” means “temporal-difference learning”, which is a combination of Monte Carlo ideas(MC) and dynamic programming (DP) ideas. Here we mainly focus on the differences between TD and MC. As we know, based on the MC, the goal for updating is to converge to the real return, which means only an episode is over, can the update occur, while TD methods, whose updating targets can be immediately found as the next time step ends, appear much more direct. To show the difference more visually, the following update rules are given:
M
C
:
V
(
S
t
)
←
V
(
S
t
)
+
α
[
G
t
?
V
(
S
t
)
]
MC:V(S_t) \leftarrow V(S_t) + \alpha [G_t-V(S_t)]
MC:V(St?)←V(St?)+α[Gt??V(St?)]
T
D
??
m
e
t
h
o
d
s
:
V
(
S
t
)
←
V
(
S
t
)
+
α
[
R
t
+
1
+
γ
V
(
S
t
+
1
)
?
V
(
S
t
)
]
TD \:\: methods:V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1}+ \gamma V(S_{t+1})-V(S_t)]
TDmethods:V(St?)←V(St?)+α[Rt+1?+γV(St+1?)?V(St?)]
From the update rules above, the target of TD update is entirely based on the next time step, such we call it TD(0), or one-step TD, because it is a special case of TD(
λ
\lambda
λ).
Here comes the pseudo-code of TD(0):
These two approaches were originally proposed to solve the problem of ensuring all actions are selected infinitely in MC methods.
The main difference between the two is whether the policy for generating the collection data sequence is consistent with the policy to be evaluated and improved for the actual decision. With reference to the description of every-visit MC method suitable for non-stationary environment, we describe the process of updating an existing policy as:
v
a
l
u
e
←
v
a
l
u
e
+
α
(
t
a
r
g
e
t
??
v
a
l
u
e
?
v
a
l
u
e
)
value \leftarrow value+\alpha(target \ \ value-value)
value←value+α(target??value?value)
Here, the “target value” and “value” in parentheses imply the policy to be evaluated and the one to generate data sets. If the two are based on the same policy, then the algorithm is on-policy, otherwise, off-policy.
Theoretically speaking, on-policy algorithms can only be trained by the immediate data generated by the currently optimizing policy, which means that once the policy network data has been updated by a data set, the “currently optimizing” policy changes immediately, and it becomes a new rule to generate a new data set to continually train the next step and only the next step. With constantly shifting policy of generating training data set, on-policy algorithm finally approach the goal to get the same effect as choosing the best action step by step, however, the “best” might be the “locally optimal” instead of “globally optimal”.
On the contrary, the policy used for sample generation in off-policy algorithms is not the same as the one used for evaluation, the latter policy is usually constant, say, “the best”, but the policy actually selected to generate data sets can’t be the best all the time. Based on this article, we can conclude that the action based on experience is essentially off-policy.
Here comes the explanation from textbook:
On-policy methods attempt to evaluate or improve a policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.
Sarsa is a kind of on-policy TD control method,which describes an episode consists of an alternating sequence of states and state-action pairs. Noticing that the unit of transition we consider here is state-action pair instead of just state. The rule for updating action values can be formulated as:
Q
(
S
t
,
A
t
)
←
Q
(
S
t
,
A
t
)
+
α
[
R
t
+
1
+
γ
Q
(
S
t
+
1
,
A
t
+
1
)
?
Q
(
S
t
,
A
t
)
]
Q(S_t, A_t)\leftarrow Q(S_t, A_t)+\alpha [R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})-Q(S_t, A_t)]
Q(St?,At?)←Q(St?,At?)+α[Rt+1?+γQ(St+1?,At+1?)?Q(St?,At?)]
It’s obvious that the update occurs after every transition from a non-terminal
S
t
S_t
St?. Once
S
t
+
1
S_{t+1}
St+1?is the terminal state,
Q
(
S
t
+
1
,
A
t
+
1
)
Q(S_{t+1}, A_{t+1})
Q(St+1?,At+1?) must be zero.
In brief, the core of the rule we introduced above is a quintuple,
(
S
t
,
A
t
,
R
t
+
1
,
S
t
+
1
,
A
t
+
1
)
(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})
(St?,At?,Rt+1?,St+1?,At+1?), that describe a transition from one state-action pair to another. And it’s backup diagram is as follows:
The on-policy attribute of Sarsa is embodied in that as the process of estimating
q
π
q_{\pi}
qπ?,
A
t
+
1
A_{t+1}
At+1? is guided by a certain policy
π
\pi
π, so
Q
(
S
t
+
1
,
A
t
+
1
)
Q(S_{t+1}, A_{t+1})
Q(St+1?,At+1?) is also the Q-value based on the policy
π
\pi
π, which stays constant through out the process.
The complete process of the S a r s a Sarsa Sarsa algorithm can be described by the following pseudo-code:
[1]. Reinforcement Learning-An introduction
[2]. https://zhuanlan.zhihu.com/p/59792208
If there is infringement, delete immediately
目录 ? 简介 优势 架构图 特点 应用场景 软件环境 简介 Cat.1的全称是LTE UE-Cat...
ajax可以实现局部刷新页面,即在不刷新整个页面的情况下更新页面的局部信息。 项...
本文实例为大家分享了微信小程序自定义左上角胶囊样式的具体代码,供大家参考,...
使用ajax获取用户所在地的天气,供大家参考,具体内容如下 1.要获取用户归属地的...
无论在工作还是面试中,关于 SQL 中不要用SELECT *,都是大家听烂了的问题,虽说...
前言 本文将介绍Net Core的一些基础知识和如何NginX下发布Net Core的WebApi项目...
前言 WAF工作原理 在实际的渗透测试过程中经常会碰到网站存在WAF的情况。网站存...
AJAX从一个域请求另一个域会有跨域的问题。那么如何在nginx上实现ajax跨域请求呢...
本文实例汇总了各类常见语言清除网页缓存方法。分享给大家供大家参考。具体实现...
首先 下载flex-iframe.swc ,并添加到libs中 下面直接上代码 复制代码 代码如下:...