当前位置：主页 > 查看内容

Sarsa: One of classical algorithms of RL

发布时间：2021-06-01 00:00| 有位朋友查看

简介：Contents What is TD learning? On policy and Off-policy A brief introduction of Sarsa References What is TD learning? “TD learning” means “ temporal-difference learning ”, which is a combination of Monte Carlo ideas(MC) and dynamic prog……

What is TD learning?
On policy and Off-policy
A brief introduction of Sarsa
References

What is TD learning?

“TD learning” means “temporal-difference learning”, which is a combination of Monte Carlo ideas(MC) and dynamic programming (DP) ideas. Here we mainly focus on the differences between TD and MC. As we know, based on the MC, the goal for updating is to converge to the real return, which means only an episode is over, can the update occur, while TD methods, whose updating targets can be immediately found as the next time step ends, appear much more direct. To show the difference more visually, the following update rules are given:
$MC:V(S_t) \leftarrow V(S_t) + \alpha [G_t-V(S_t)]$

$\:\: methods:V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1}+ \gamma V(S_{t+1})-V(S_t)]$
From the update rules above, the target of TD update is entirely based on the next time step, such we call it TD(0), or one-step TD, because it is a special case of TD( $\lambda$ ).
Here comes the pseudo-code of TD(0):

在这里插入图片描述

On policy and Off-policy

These two approaches were originally proposed to solve the problem of ensuring all actions are selected infinitely in MC methods.

The main difference between the two is whether the policy for generating the collection data sequence is consistent with the policy to be evaluated and improved for the actual decision. With reference to the description of every-visit MC method suitable for non-stationary environment, we describe the process of updating an existing policy as:
$\leftarrow value+\alpha(target \ \ value-value)$
Here, the “target value” and “value” in parentheses imply the policy to be evaluated and the one to generate data sets. If the two are based on the same policy, then the algorithm is on-policy, otherwise, off-policy.

Theoretically speaking, on-policy algorithms can only be trained by the immediate data generated by the currently optimizing policy, which means that once the policy network data has been updated by a data set, the “currently optimizing” policy changes immediately, and it becomes a new rule to generate a new data set to continually train the next step and only the next step. With constantly shifting policy of generating training data set, on-policy algorithm finally approach the goal to get the same effect as choosing the best action step by step, however, the “best” might be the “locally optimal” instead of “globally optimal”.

On the contrary, the policy used for sample generation in off-policy algorithms is not the same as the one used for evaluation, the latter policy is usually constant, say, “the best”, but the policy actually selected to generate data sets can’t be the best all the time. Based on this article, we can conclude that the action based on experience is essentially off-policy.

Here comes the explanation from textbook:

On-policy methods attempt to evaluate or improve a policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.

A brief introduction of Sarsa

Sarsa is a kind of on-policy TD control method，which describes an episode consists of an alternating sequence of states and state-action pairs. Noticing that the unit of transition we consider here is state-action pair instead of just state. The rule for updating action values can be formulated as:
$Q(S_t, A_t)\leftarrow Q(S_t, A_t)+\alpha [R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})-Q(S_t, A_t)]$
It’s obvious that the update occurs after every transition from a non-terminal $S_t$ . Once $S_{t+1}$ is the terminal state, $Q(S_{t+1}, A_{t+1})$ must be zero.

In brief, the core of the rule we introduced above is a quintuple, $S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ , that describe a transition from one state-action pair to another. And it’s backup diagram is as follows:
the black dot represents state and the circle represents state-action pair
The on-policy attribute of Sarsa is embodied in that as the process of estimating $q_{\pi}$ , $A_{t+1}$ is guided by a certain policy $\pi$ , so $Q(S_{t+1}, A_{t+1})$ is also the Q-value based on the policy $\pi$ , which stays constant through out the process.

The complete process of the $S a r s a$ algorithm can be described by the following pseudo-code:

在这里插入图片描述

References

[1]. Reinforcement Learning-An introduction
[2]. https://zhuanlan.zhihu.com/p/59792208

If there is infringement, delete immediately

；原文链接：https://blog.csdn.net/WZX_Hello/article/details/115408182
本站部分内容转载于网络，版权归原作者所有，转载之目的在于传播更多优秀技术内容，如有侵权请联系QQ/微信：153890879删除，谢谢！

上一篇：Python--selenium学习笔记 下一篇：没有了

随机推荐

CPU方案简介UIS8190 - LTE CAT.1模块

目录 ? 简介优势架构图特点应用场景软件环境简介 Cat.1的全称是LTE UE-Cat...
ajax跳转到新的jsp页面的方法

ajax可以实现局部刷新页面，即在不刷新整个页面的情况下更新页面的局部信息。项...
微信小程序自定义胶囊样式

本文实例为大家分享了微信小程序自定义左上角胶囊样式的具体代码，供大家参考，...
ajax获取用户所在地天气的方法

使用ajax获取用户所在地的天气，供大家参考，具体内容如下 1.要获取用户归属地的...
为什么大家都说“SELECT *”效率低？

无论在工作还是面试中，关于 SQL 中不要用SELECT *，都是大家听烂了的问题，虽说...
Net Core Web Api项目与在NginX下发布的

前言本文将介绍Net Core的一些基础知识和如何NginX下发布Net Core的WebApi项目...
SQL注入绕过WAF两道题目

前言 WAF工作原理在实际的渗透测试过程中经常会碰到网站存在WAF的情况。网站存...
深入浅析Nginx实现AJAX跨域请求问题

AJAX从一个域请求另一个域会有跨域的问题。那么如何在nginx上实现ajax跨域请求呢...
各类常见语言清除网页缓存方法汇总

本文实例汇总了各类常见语言清除网页缓存方法。分享给大家供大家参考。具体实现...
flex内嵌html网页示例代码

首先下载flex-iframe.swc ，并添加到libs中下面直接上代码复制代码代码如下:...

Sarsa: One of classical algorithms of RL

Contents

What is TD learning?

On policy and Off-policy

A brief introduction of Sarsa

References

推荐图文

Yii框架安装简明教程

戏说编码发展史

Scala 环境搭建及IDEA工具的配置使用教程

asp生成静态HTML(动态读取)

Windows 10 太阳谷更新界面猜想：圆角弹窗，开始菜

Ajax + PHP session制作购物车

随机推荐

CPU方案简介UIS8190 - LTE CAT.1模块

ajax跳转到新的jsp页面的方法

微信小程序自定义胶囊样式

ajax获取用户所在地天气的方法

为什么大家都说“SELECT *”效率低？

Net Core Web Api项目与在NginX下发布的

SQL注入绕过WAF两道题目

深入浅析Nginx实现AJAX跨域请求问题

各类常见语言清除网页缓存方法汇总

flex内嵌html网页示例代码

关于我们