当前位置:主页 > 查看内容

Sarsa: One of classical algorithms of RL

发布时间:2021-06-01 00:00| 位朋友查看

简介:Contents What is TD learning? On policy and Off-policy A brief introduction of Sarsa References What is TD learning? “TD learning” means “ temporal-difference learning ”, which is a combination of Monte Carlo ideas(MC) and dynamic prog……

What is TD learning?

“TD learning” means “temporal-difference learning”, which is a combination of Monte Carlo ideas(MC) and dynamic programming (DP) ideas. Here we mainly focus on the differences between TD and MC. As we know, based on the MC, the goal for updating is to converge to the real return, which means only an episode is over, can the update occur, while TD methods, whose updating targets can be immediately found as the next time step ends, appear much more direct. To show the difference more visually, the following update rules are given:
M C : V ( S t ) ← V ( S t ) + α [ G t ? V ( S t ) ] MC:V(S_t) \leftarrow V(S_t) + \alpha [G_t-V(S_t)] MC:V(St?)V(St?)+α[Gt??V(St?)]

T D ?? m e t h o d s : V ( S t ) ← V ( S t ) + α [ R t + 1 + γ V ( S t + 1 ) ? V ( S t ) ] TD \:\: methods:V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1}+ \gamma V(S_{t+1})-V(S_t)] TDmethods:V(St?)V(St?)+α[Rt+1?+γV(St+1?)?V(St?)]
From the update rules above, the target of TD update is entirely based on the next time step, such we call it TD(0), or one-step TD, because it is a special case of TD( λ \lambda λ).
Here comes the pseudo-code of TD(0):

在这里插入图片描述

On policy and Off-policy

These two approaches were originally proposed to solve the problem of ensuring all actions are selected infinitely in MC methods.

The main difference between the two is whether the policy for generating the collection data sequence is consistent with the policy to be evaluated and improved for the actual decision. With reference to the description of every-visit MC method suitable for non-stationary environment, we describe the process of updating an existing policy as:
v a l u e ← v a l u e + α ( t a r g e t ?? v a l u e ? v a l u e ) value \leftarrow value+\alpha(target \ \ value-value) valuevalue+α(target??value?value)
Here, the “target value” and “value” in parentheses imply the policy to be evaluated and the one to generate data sets. If the two are based on the same policy, then the algorithm is on-policy, otherwise, off-policy.

Theoretically speaking, on-policy algorithms can only be trained by the immediate data generated by the currently optimizing policy, which means that once the policy network data has been updated by a data set, the “currently optimizing” policy changes immediately, and it becomes a new rule to generate a new data set to continually train the next step and only the next step. With constantly shifting policy of generating training data set, on-policy algorithm finally approach the goal to get the same effect as choosing the best action step by step, however, the “best” might be the “locally optimal” instead of “globally optimal”.

On the contrary, the policy used for sample generation in off-policy algorithms is not the same as the one used for evaluation, the latter policy is usually constant, say, “the best”, but the policy actually selected to generate data sets can’t be the best all the time. Based on this article, we can conclude that the action based on experience is essentially off-policy.

Here comes the explanation from textbook:

On-policy methods attempt to evaluate or improve a policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.

A brief introduction of Sarsa

Sarsa is a kind of on-policy TD control method,which describes an episode consists of an alternating sequence of states and state-action pairs. Noticing that the unit of transition we consider here is state-action pair instead of just state. The rule for updating action values can be formulated as:
Q ( S t , A t ) ← Q ( S t , A t ) + α [ R t + 1 + γ Q ( S t + 1 , A t + 1 ) ? Q ( S t , A t ) ] Q(S_t, A_t)\leftarrow Q(S_t, A_t)+\alpha [R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})-Q(S_t, A_t)] Q(St?,At?)Q(St?,At?)+α[Rt+1?+γQ(St+1?,At+1?)?Q(St?,At?)]
It’s obvious that the update occurs after every transition from a non-terminal S t S_t St?. Once S t + 1 S_{t+1} St+1?is the terminal state, Q ( S t + 1 , A t + 1 ) Q(S_{t+1}, A_{t+1}) Q(St+1?,At+1?) must be zero.

In brief, the core of the rule we introduced above is a quintuple, ( S t , A t , R t + 1 , S t + 1 , A t + 1 ) (S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) (St?,At?,Rt+1?,St+1?,At+1?), that describe a transition from one state-action pair to another. And it’s backup diagram is as follows:
the black dot represents state and the circle represents state-action pair
The on-policy attribute of Sarsa is embodied in that as the process of estimating q π q_{\pi} qπ?, A t + 1 A_{t+1} At+1? is guided by a certain policy π \pi π, so Q ( S t + 1 , A t + 1 ) Q(S_{t+1}, A_{t+1}) Q(St+1?,At+1?) is also the Q-value based on the policy π \pi π, which stays constant through out the process.

The complete process of the S a r s a Sarsa Sarsa algorithm can be described by the following pseudo-code:

在这里插入图片描述

References

[1]. Reinforcement Learning-An introduction
[2]. https://zhuanlan.zhihu.com/p/59792208

If there is infringement, delete immediately

;原文链接:https://blog.csdn.net/WZX_Hello/article/details/115408182
本站部分内容转载于网络,版权归原作者所有,转载之目的在于传播更多优秀技术内容,如有侵权请联系QQ/微信:153890879删除,谢谢!
上一篇:Python--selenium学习笔记 下一篇:没有了

推荐图文


随机推荐