5. Supervisor Behaviour

本部分应与 STDLIB 中的supervisor(3)手册页一起阅读，其中提供了有关监督员行为的所有详细信息。

5.1监督原则

Supervisor负责启动，停止和监视其子进程。Supervisor的基本思想是通过在必要时重新启动它们来保持子进程的活动。

哪些子进程启动和监视由列表指定child specifications。子进程按此列表指定的顺序启动，并以相反的顺序终止。

5.2 示例

管理员启动服务器的回调模块gen_server Behaviour可能如下所示：

-module(ch_sup).
-behaviour(supervisor).

-export([start_link/0]).
-export([init/1]).

start_link() ->
    supervisor:start_link(ch_sup, []).

init(_Args) ->
    SupFlags = #{strategy => one_for_one, intensity => 1, period => 5},
    ChildSpecs = [#{id => ch3,
                    start => {ch3, start_link, []},
                    restart => permanent,
                    shutdown => brutal_kill,
                    type => worker,
                    modules => [cg3]}],
    {ok, {SupFlags, ChildSpecs}}.

SupFlags返回值中的变量init/1代表supervisor flags。

ChildSpecs返回值from中的变量init/1是一个列表child specifications。

5.3 主管标志

这是主管标志的类型定义：

sup_flags() = #{strategy => strategy(),         % optional
                intensity => non_neg_integer(), % optional
                period => pos_integer()}        % optional
    strategy() = one_for_all
               | one_for_one
               | rest_for_one
               | simple_one_for_one

strategy指定restart strategy。
intensity和period指定maximum restart intensity。

5.4重新启动策略

重启策略strategy由回调函数返回的主管标志映射中的键指定init：

SupFlags = #{strategy => Strategy, ...}

地图中的strategy键是可选的。如果没有给出，则默认为one_for_one。

one_for_one

如果子进程终止，则只有该进程重新启动。

图5.1：One_For_One Supervision

one_for_all

如果子进程终止，则所有其他子进程都会终止，然后重新启动所有子进程（包括已终止的子进程）。

Figure 5.2：One_For_All Supervision

rest_for_one

如果子进程终止，则子进程的其余部分（也就是子进程在启动顺序中的终止进程之后）终止。然后终止子进程和其他子进程重新启动。

simple_one_for_one

另见simple-one-for-one supervisors。

5.5最大重启强度

主管具有内置机制来限制在给定时间间隔内可能发生的重新启动次数。这由两个键intensity和period由回调函数返回的主管标志映射指定init：

SupFlags = #{intensity => MaxR, period => MaxT, ...}

如果MaxR在最后MaxT几秒内发生了多次重新启动，监督程序会终止所有子进程，然后自行终止。在这种情况下，主管本身的终止原因将是shutdown。

当主管终止时，下一个上级主管会采取一些行动。它要么重新启动终止的监督员，要么自行终止。

重启机制的目的是为了防止出于同样的原因一个进程重复死亡的情况，只能重新启动。

这些intensity键和period键在主管标志图中是可选的。如果他们没有给出，他们分别默认为1和5。

调整强度和时间

默认值是每5秒重新启动1次。对于大多数系统来说，这被选择为安全的，即使是深度监管层次，但您可能会想要调整您的特定用例的设置。

首先，强度决定了您想要容忍的重启次数。例如，如果导致成功重新启动，您可能想要接受最多5次或10次尝试的突发，即使在同一秒内也是如此。

其次，如果碰撞事件继续发生，但不足以使监督人员放弃，则需要考虑持续失败率。如果您将强度设置为10，并将周期设置为1，那么主管将允许子进程每秒持续重启最多10次，直到有人进行手动干预为止，您的日志中将填充崩溃报告。

因此，您应该设定足够长的时间，以便您可以接受主管以这种速度继续前进。例如，如果您选择了强度值5，那么将时间段设置为30秒将使您在任何更长的时间段内每6秒重新启动一次，这意味着您的日志不会太快填满，并且您将有机会观察失败并应用修复。

这些选择很大程度上取决于您的问题域。如果您没有实时监控和快速解决问题的能力（例如在嵌入式系统中），那么您最好每分钟重新启动一次，然后主管应放弃并升级到下一级以尝试清除错误自动。另一方面，如果更重要的是即使在高失败率情况下也继续尝试，您可能需要每秒重复1-2次的持续速率。

避免常见错误：

不要忘记考虑爆率。如果您将亮度设置为1并将时间段设置为6，则它将提供与5/30或10/60相同的持续错误率，但不会允许连续快速尝试2次重新尝试。这可能不是你想要的。
如果你想容忍突发事件，不要把时间段设置得很高。如果您将强度设置为5并将周期设置为3600（一小时），则主管将允许短时间重新启动5次，但如果在接近一小时后再次看到另一次重启，则会放弃。您可能希望将这些崩溃视为单独的事件，因此将期限设置为5分钟或10分钟将更合理。
如果您的应用程序具有多个监控级别，则不要简单地将重启强度设置为所有级别上的相同值。请记住，重新启动的总次数（在顶级主管放弃和终止应用程序之前）将是失败子进程之上的所有主管的强度值的乘积。例如，如果最高级别允许10次重新启动，并且下一级别也允许10次，则低于该级别的崩溃的子级将重新启动100次，这可能是过度的。在这种情况下，允许顶级主管最多重新启动3次可能是更好的选择。

5.6 子规范

子规范的类型定义如下所示：

child_spec() = #{id => child_id(),       % mandatory
                 start => mfargs(),      % mandatory
                 restart => restart(),   % optional
                 shutdown => shutdown(), % optional
                 type => worker(),       % optional
                 modules => modules()}   % optional
    child_id() = term()
    mfargs() = {M :: module(), F :: atom(), A :: [term()]}
    modules() = [module()] | dynamic
    restart() = permanent | transient | temporary
    shutdown() = brutal_kill | timeout()
    worker() = worker | supervisor

id用于由主管在内部识别子规格。该id关键是强制性的。请注意，此标识符偶尔被称为“名称”。尽可能使用术语“标识符”或“id”，但为了保持向后兼容性，仍然可以找到一些“名称”的出现，例如在错误消息中。
start定义用于启动子进程的函数调用。它是一个模块函数参数元组apply(M, F, A)。它将（或导致）致电以下任何一项：

- `supervisor:start_link`
- `gen_server:start_link`
- `gen_statem:start_link`
- `gen_event:start_link`
- A function compliant with these functions. For details, see the `supervisor(3)` manual page.

start键是强制性的。

restart 定义何时终止子进程将被重新启动。

- A `permanent` child process is always restarted.
- A `temporary` child process is never restarted (not even when the supervisor restart strategy is `rest_for_one` or `one_for_all` and a sibling death causes the temporary process to be terminated).
- A `transient` child process is restarted only if it terminates abnormally, that is, with an exit reason other than `normal`, `shutdown`, or `{shutdown,Term}`.

restart键是可选的。如果没有给出，permanent将使用默认值。

shutdown 定义了子进程如何终止。

-  `brutal_kill` means that the child process is unconditionally terminated using `exit(Child, kill)`.
- An integer time-out value means that the supervisor tells the child process to terminate by calling `exit(Child, shutdown)` and then waits for an exit signal back. If no exit signal is received within the specified time, the child process is unconditionally terminated using `exit(Child, kill)`.
- If the child process is another supervisor, it is to be set to `infinity` to give the subtree enough time to shut down. It is also allowed to set it to `infinity`, if the child process is a worker. See the warning below:

警告

将关闭时间设置为infinity子进程是工作者时要小心。因为在这种情况下，监督树的终止取决于子进程；它必须以安全的方式实施，其清理过程必须始终返回。

shutdown键是可选的。如果没有给出，并且孩子是类型的worker，则使用默认值5000; 如果孩子是类型的supervisor，则使用默认值infinity。

type指定子进程是主管还是工作者。该type键是可选的。如果没有给出，worker将使用默认值。
modules是一个包含一个元素的列表[Module]，其中Module是回调模块的名称，如果子进程是一个管理程序gen_server，gen_statem。如果子进程是gen_event，则值应为dynamic。发布处理程序在升级和降级时使用此信息，请参阅Release Handling。该modules键是可选的。如果没有给出，则默认为[M]，M从哪里来的孩子的开始{M,F,A}。

示例：上例中用于启动服务器的子规范ch3如下所示：

#{id => ch3,
  start => {ch3, start_link, []},
  restart => permanent,
  shutdown => brutal_kill,
  type => worker,
  modules => [ch3]}

或简化，依靠默认值：

#{id => ch3,
  start => {ch3, start_link, []}
  shutdown => brutal_kill}

示例：有关从以下章节开始活动管理器的子规范gen_event：

#{id => error_man,
  start => {gen_event, start_link, [{local, error_man}]},
  modules => dynamic}

服务器和事件管理器都是已注册的过程，可以预期这些过程始终可以访问。因此他们被指定为permanent。

ch3在终止前不需要做任何清理。因此，不需要关闭时间，但是brutal_kill足够了。error_man可能需要一些时间才能清理事件处理程序，因此关闭时间设置为5000毫秒（这是默认值）。

示例：启动另一个主管的子规范：

#{id => sup,
  start => {sup, start_link, []},
  restart => transient,
  type => supervisor} % will cause default shutdown=>infinity

5.7 启动Supervisor

在前面的示例中，Supervisor通过调用ch_sup:start_link()以下命令启动：

start_link() ->
    supervisor:start_link(ch_sup, []).

ch_sup:start_link调用函数supervisor:start_link/2，它产生并链接到一个新进程，一个主管。

第一个参数ch_sup是，回调模块的名称，即init回调函数所在的模块。
第二个参数，[]是一个传递给回调函数的术语init。在这里，init不需要任何indata并忽略这个论点。

在这种情况下，主管没有注册。相反，它的pid必须使用。名称可以通过调用supervisor:start_link({local, Name}, Module, Args)或指定supervisor:start_link({global, Name}, Module, Args)。

新的主管进程调用回调函数ch_sup:init([])。init应归还{ok, {SupFlags, ChildSpecs}}：

init(_Args) ->
    SupFlags = #{},
    ChildSpecs = [#{id => ch3,
                    start => {ch3, start_link, []},
                    shutdown => brutal_kill}],
    {ok, {SupFlags, ChildSpecs}}.

然后，Supervisor根据启动规范中的子规格启动其所有子进程。在这种情况下，有一个子进程，ch3。

supervisor:start_link是同步的。直到所有子进程启动后才会返回。

5.8 添加子进程

除了静态监督树之外，还可以通过以下调用将动态子进程添加到现有主管：

supervisor:start_child(Sup, ChildSpec)

Sup是主管的pid或名称。ChildSpec是一个child specification。

添加start_child/2的子进程与其他子进程的行为方式相同，但有一个重要的例外：如果超级用户死了并且被重新创建，那么动态添加到主管的所有子进程都会丢失。

5.9 停止子进程

任何子进程（静态或动态）都可以根据关闭规范停止：

supervisor:terminate_child(Sup, Id)

通过以下调用删除停止的子进程的子规范：

supervisor:delete_child(Sup, Id)

Sup是主管的pid或名称。Id是与id键中的值相关联的值child specification。

与动态添加的子进程一样，如果管理程序本身重新启动，则删除静态子进程的效果将丢失。

5.10 简化one_for_one Supervisors

具有重启策略simple_one_for_one的one_for_one主管是一个简化的主管，其中所有子进程都是同一进程的动态添加实例。

以下是simple_one_for_one主管回调模块的示例：

-module(simple_sup).
-behaviour(supervisor).

-export([start_link/0]).
-export([init/1]).

start_link() ->
    supervisor:start_link(simple_sup, []).

init(_Args) ->
    SupFlags = #{strategy => simple_one_for_one,
                 intensity => 0,
                 period => 1},
    ChildSpecs = [#{id => call,
                    start => {call, start_link, []},
                    shutdown => brutal_kill}],
    {ok, {SupFlags, ChildSpecs}}.

启动时，主管不启动任何子进程。相反，所有的子进程都是通过调用动态添加的：

supervisor:start_child(Sup, List)

Sup是主管的pid或名称。List是任意的术语列表，它们被添加到子规范中指定的参数列表中。如果启动函数被指定为{M, F, A}，则通过调用来启动子进程apply(M, F, A++List)。

例如，向simple_sup上面添加一个孩子：

supervisor:start_child(Pid, [id1])

结果是子进程是通过调用启动的apply(call, start_link, []++[id1])，或者实际上是：

call:start_link(id1)

A child under a simple_one_for_one supervisor can be terminated with the following:

supervisor:terminate_child(Sup, Pid)

Sup is the pid, or name, of the supervisor and Pid is the pid of the child.

Because a simple_one_for_one supervisor can have many children, it shuts them all down asynchronously. This means that the children will do their cleanup in parallel and therefore the order in which they are stopped is not defined.

5.11 Stopping

Since the supervisor is part of a supervision tree, it is automatically terminated by its supervisor. When asked to shut down, it terminates all child processes in reversed start order according to the respective shutdown specifications, and then terminates itself.

本文档系腾讯云开发者社区成员共同维护，如有问题请联系 cloudcommunity@tencent.com

最后更新于：2019-03-06