【Linux】《how linux work》第八章流程和资源利用的近距离观察

阿东

发布于 2024-04-22 17:44:46

870

发布于 2024-04-22 17:44:46

Chapter 8. A Closer Look at Processes and Resource Utilization（第 8 章流程和资源利用的近距离观察）

This chapter takes you deeper into the relationships between processes, the kernel, and system resources. There are three basic kinds of hardware resources: CPU, memory, and I/O. Processes vie for these resources, and the kernel’s job is to allocate resources fairly. The kernel itself is also a resource—a software resource that processes use to perform tasks such as creating new processes and communicating with other processes. Many of the tools that you see in this chapter are often thought of as performance-monitoring tools. They’re particularly helpful if your system is slowing to a crawl and you’re trying to figure out why. However, you shouldn’t get too distracted by performance; trying to optimize a system that’s already working correctly is often a waste of time. Instead, concentrate on understanding what the tools actually measure, and you’ll gain great insight into how the kernel works.

本章将深入介绍进程、内核和系统资源之间的关系。硬件资源主要有三种：CPU、内存和I/O。

进程争夺这些资源，而内核的工作是公平地分配资源。

内核本身也是一种资源，进程可以使用它来执行任务，如创建新进程和与其他进程通信。本章中的许多工具通常被视为性能监控工具。

如果您的系统变得缓慢，您可以使用这些工具来找出原因。

然而，不要过于关注性能；试图优化一个已经正常工作的系统通常是浪费时间。

相反，应该集中精力理解这些工具实际测量的内容，从而深入了解内核的工作原理。

8.1 Tracking Processes（追踪进程）

You learned how to use ps in 2.16 Listing and Manipulating Processes to list processes running on your system at a particular time. The ps command lists current processes, but it does little to tell you how processes change over time. Therefore, it won’t really help you to determine which process is using too much CPU time or memory.

您已经学会了如何使用ps命令在2.16节“列出和操作进程”中列出系统上运行的进程。

ps命令列出当前进程，但它很少告诉您进程如何随时间变化。

因此，它无法真正帮助您确定哪个进程使用了过多的CPU时间或内存。

The top program is often more useful than ps because it displays the current system status as well as many of the fields in a ps listing, and it updates the display every second. Perhaps most important is that top shows the most active processes (that is, those currently taking up the most CPU time) at the top of its display

与ps相比，top程序通常更有用，因为它显示当前系统状态以及ps列表中的许多字段，并且每秒更新一次显示。最重要的是，top显示最活跃的进程（即当前占用最多CPU时间的进程）在其显示的顶部。

You can send commands to top with keystrokes. These are some of the most important commands:

您可以使用按键向top发送命令。

以下是一些最重要的命令：

Two other utilities for Linux, similar to top, offer an enhanced set of views and features: atop and htop. Most of the extra features are available from other utilities. For example, htop has many of abilities of the lsof command described in the next section.

与 top 类似，Linux 上的另外两个实用程序提供了一套增强的视图和功能：atop 和 htop。

大多数额外的功能都可以从其他工具中获得。

例如，htop 拥有下一节所述的 lsof 命令的许多功能。

8.2 Finding Open Files with lsof（用 lsof 查找打开的文件）

The lsof command lists open files and the processes using them. Because Unix places a lot of emphasis on files, lsof is among the most useful tools for finding trouble spots. But lsof doesn’t stop at regular files— it can list network resources, dynamic libraries, pipes, and more.

lsof 命令列出打开的文件和使用这些文件的进程。

由于 Unix 非常重视文件，因此 lsof 是查找故障点最有用的工具之一。

但 lsof 并不局限于普通文件，它还能列出网络资源、动态库、管道等。

8.2.1 Reading the lsof Output（读取 lsof 输出）

Running lsof on the command line usually produces a tremendous amount of output. Below is a fragment of what you might see. This output includes open files from the init process as well as a running vi process:

在命令行上运行 lsof 通常会产生大量输出。

下面是你可能看到的一个片段。

该输出包括来自初始进程和正在运行的 vi 进程的打开文件：

$ lsof
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
init 1 root cwd DIR 8,1 4096 2 /
init 1 root rtd DIR 8,1 4096 2 /
init 1 root mem REG 8, 47040 9705817 /lib/i386-linuxgnu/libnss_files-2.15.so
init 1 root mem REG 8,1 42652 9705821 /lib/i386-linuxgnu/libnss_nis-2.15.so
init 1 root mem REG 8,1 92016 9705833 /lib/i386-linuxgnu/libnsl-2.15.so
--snip--
vi 22728 juser cwd DIR 8,1 4096 14945078 /home/juser/w/c
vi 22728 juser 4u REG 8,1 1288 1056519 /home/juser/w/c/f
--snip--

The output shows the following fields (listed in the top row):

输出显示了以下字段（按照顶部行的顺序列出）：

o COMMAND. The command name for the process that holds the file descriptor. o PID. The process ID. o USER. The user running the process. o FD. This field can contain two kinds of elements. In the output above, the FD column shows the purpose of the file. The FD field can also list the file descriptor of the open file—a number that a process uses together with the system libraries and kernel to identify and manipulate a file. o TYPE. The file type (regular file, directory, socket, and so on). o DEVICE. The major and minor number of the device that holds the file. o SIZE. The file’s size. o NODE. The file’s inode number. o NAME. The filename.

o COMMAND：持有文件描述符的进程的命令名称。 o PID：进程ID。 o USER：运行该进程的用户。 o FD：该字段可以包含两种类型的元素。在上面的输出中，FD列显示了文件的用途。FD字段还可以列出打开文件的文件描述符，这是一个进程与系统库和内核一起使用的数字，用于标识和操作文件。 o TYPE：文件类型（普通文件、目录、套接字等）。 o DEVICE：持有文件的设备的主要和次要编号。 o SIZE：文件的大小。 o NODE：文件的inode号。 o NAME：文件名。

The lsof(1) manual page contains a full list of what you might see for each field, but you should be able to figure out what you’re looking at just by looking at the output. For example, look at the entries with cwd in the FD field as highlighted in bold. These lines indicate the current working directories of the processes. Another example is the very last line, which shows a file that the user is currently editing with vi

lsof(1)手册页包含了每个字段可能出现的完整列表，但是通过查看输出，您应该能够弄清楚您正在查看什么。

例如，查看FD字段中以cwd加粗显示的条目。

这些行指示了进程的当前工作目录。

另一个例子是最后一行，显示了用户当前正在使用vi编辑的文件。

输出显示了以下字段（按照顶部行的顺序列出）：

lsof(1)手册页包含了每个字段可能出现的完整列表，但是通过查看输出，您应该能够弄清楚您正在查看什么。例如，查看FD字段中以cwd加粗显示的条目。

这些行指示了进程的当前工作目录。另一个例子是最后一行，显示了用户当前正在使用vi编辑的文件。

8.2.2 Using lsof（使用 lsof）

There are two basic approaches to running lsof:

运行lsof有两种基本方法：

o List everything and pipe the output to a command like less, and then search for what you’re looking for. This can take a while due to the amount of output generated. o Narrow down the list that lsof provides with command-line options. You can use command-line options to provide a filename as an argument and have lsof list only the entries that match the argument. For example, the following command displays entries for open files in /usr:

列出所有内容并将输出导入到类似less的命令中，然后搜索你要查找的内容。
由于生成的输出量很大，这可能需要一些时间。
使用命令行选项缩小lsof提供的列表。
你可以使用命令行选项提供一个文件名作为参数，并让lsof只列出与该参数匹配的条目。例如，下面的命令会显示/usr目录中打开文件的条目。

$ lsof /usr

To list the open files for a particular process ID, run:

要列出特定进程 ID 的打开文件，请运行

$ lsof -p pid

For a brief summary of lsof’s many options, run lsof -h. Most options pertain to the output format. (See Chapter 10 for a discussion of the lsof network features.)

要了解lsof的许多选项的简要概述，请运行lsof -h。大多数选项与输出格式有关。

（有关lsof网络功能的讨论，请参见第10章。）

NOTE lsof is highly dependent on kernel information. If you upgrade your kernel and you’re not routinely updating everything, you might need to upgrade lsof. In addition, if you perform a distribution update to both the kernel and lsof, the updated lsof might not work until you reboot with the new kernel. 注意：lsof高度依赖于内核信息。如果您升级了内核，而且您没有定期更新所有内容，您可能需要升级lsof。此外，如果您同时对内核和lsof进行了发行版更新，则更新后的lsof可能在您使用新内核重新启动之前无法正常工作。

8.3 Tracing Program Execution and System Calls（追踪程序执行和系统调用）

The tools we’ve seen so far examine active processes. However, if you have no idea why a program dies almost immediately after starting up, even lsof won’t help you. In fact, you’d have a difficult time even running lsof concurrently with a failed command.

到目前为止，我们看到的工具都是用于检查活动进程的。

然而，如果您不知道为什么一个程序在启动后几乎立即崩溃，即使是lsof也无法帮助您。

实际上，您甚至很难在命令失败的同时运行lsof。

The strace (system call trace) and ltrace (library trace) commands can help you discover what a program attempts to do. These tools produce extraordinarily large amounts of output, but once you know what to look for, you’ll have more tools at your disposal for tracking down problems.

strace（系统调用跟踪）和 ltrace（库跟踪）命令可以帮助您发现程序试图做什么。

这些工具产生了非常大量的输出，但是一旦您知道要寻找什么，您将拥有更多的工具来追踪问题。

8.3.1 strace

Recall that a system call is a privileged operation that a user-space process asks the kernel to perform, such as opening and reading data from a file. The strace utility prints all the system calls that a process makes. To see it in action, run this command:

请回忆一下，系统调用是用户空间进程向内核请求执行的特权操作，例如打开和读取文件中的数据。strace实用程序打印出进程所进行的所有系统调用。

要看它的实际效果，请运行以下命令：

$ strace cat /dev/null

In Chapter 1, you learned that when one process wants to start another process, it invokes the fork() system call to spawn a copy of itself, and then the copy uses a member of the exec() family of system calls to start running a new program. The strace command begins working on the new process (the copy of the original process) just after the fork() call. Therefore, the first lines of the output from this command should show execve() in action, followed by a memory initialization call, brk(), as follows:

在第 1 章中，我们了解到当一个进程想要启动另一个进程时，它会调用 fork() 系统调用来生成一个自身的副本，然后副本使用 exec() 系列系统调用的一个成员来开始运行一个新程序。

就在 fork() 调用之后，strace 命令开始在新进程（原始进程的副本）上运行。

因此，该命令输出的第一行应显示 execve() 正在运行，随后是内存初始化调用 brk()，如下所示：

execve("/bin/cat", ["cat", "/dev/null"], [/* 58 vars */]) = 0
brk(0) = 0x9b65000

The next part of the output deals primarily with loading shared libraries. You can ignore this unless you really want to know what the shared library system does.

输出的下一部分主要涉及加载共享库。

除非你真的想知道共享库系统是做什么的，否则可以忽略这部分内容。

access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or 
directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
0) = 0xb77b5000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or 
directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
--snip--
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200^\1"..., 
1024)= 1024

In addition, skip past the mmap output until you get to the lines that look like this:

此外，跳过 mmap 输出，直到看到类似这样的行：

fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 6), ...}) = 0
open("/dev/null", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
fadvise64_64(3, 0, 0, POSIX_FADV_SEQUENTIAL)= 0
read(3,"", 32768) = 0
close(3) = 0
close(1) = 0
close(2) = 0
exit_group(0) = ?

This part of the output shows the command at work. First, look at the open() call, which opens a file. The 3 is a result that means success (3 is the file descriptor that the kernel returns after opening the file). Below that, you see where cat reads from /dev/null (the read() call, which also has 3 as the file descriptor). Then there’s nothing more to read, so the program closes the file descriptor and exits with exit_group().

这部分输出显示了命令的运行情况。

首先看打开文件的 open() 调用。

3 代表成功的结果（3 是打开文件后内核返回的文件描述符）。

下面是 cat 从 /dev/null 读取的内容（read()调用，文件描述符也是 3）。

然后就没什么可读取的了，所以程序关闭了文件描述符，并通过 exit_group() 退出。

What happens when there’s a problem? Try strace cat not_a_file instead and examine the open() call in the resulting output:

出现问题时会怎样？

试试 strace cat not_a_file，然后检查输出结果中的 open() 调用：

open("not_a_file", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or 
directory)

Because open() couldn’t open the file, it returned -1 to signal an error. You can see that strace reports the exact error and gives you a small description of the error.

由于 open() 无法打开文件，它返回了-1 表示出错。

你可以看到，strace 报告了确切的错误，并给出了错误的一小段描述。

Missing files are the most common problems with Unix programs, so if the system log and other log information aren’t very helpful and you have nowhere else to turn, strace can be of great use. You can even use it on daemons that detach themselves. For example:

文件丢失是 Unix 程序最常见的问题，因此如果系统日志和其他日志信息帮不上什么忙，而你又无处求助，strace 就能派上大用场。

你甚至可以把它用在自行分离的守护进程上。例如

$ strace -o crummyd_strace -ff crummyd

In this example, the -o option to strace logs the action of any child process that crummyd spawns into crummyd_strace.pid, where pid is the process ID of the child process.

在这个例子中，strace命令的-o选项将crummyd生成的任何子进程的操作记录到crummyd_strace.pid文件中，其中pid是子进程的进程ID。

8.3.2 ltrace（追踪）

The ltrace command tracks shared library calls. The output is similar to that of strace, which is why we’re mentioning it here, but it doesn’t track anything at the kernel level. Be warned that there are many more shared library calls than system calls. You’ll definitely need to filter the output, and ltrace itself has many built-in options to assist you.

ltrace命令用于跟踪共享库调用。

输出与strace类似，这也是为什么我们在这里提到它的原因，但它不会跟踪内核级别的任何内容。

请注意，共享库调用比系统调用要多得多。

您肯定需要过滤输出，并且ltrace本身有许多内置选项可帮助您。

NOTE See 15.1.4 Shared Libraries for more on shared libraries. The ltrace command doesn’t work on statically linked binaries. 注意有关共享库的更多信息，请参阅 15.1.4 共享库。 ltrace 命令不适用于静态链接的二进制文件。

8.4 Threads（线程）

In Linux, some processes are divided into pieces called threads. A thread is very similar to a process—it has an identifier (TID, or thread ID), and the kernel schedules and runs threads just like processes. However, unlike separate processes, which usually do not share system resources such as memory and I/O connections with other processes, all threads inside a single process share their system resources and some memory.

在Linux中，一些进程被划分为称为线程的片段。

线程与进程非常相似——它有一个标识符（TID，或线程ID），内核会像调度和运行进程一样调度和运行线程。

然而，与通常不与其他进程共享系统资源（如内存和I/O连接）的独立进程不同，单个进程内的所有线程共享其系统资源和一些内存。

8.4.1 Single-Threaded and Multithreaded Processes（单线程和多线程进程）

Many processes have only one thread. A process with one thread is single-threaded, and a process with more than one thread is multithreaded. All processes start out single-threaded. This starting thread is usually called the main thread. The main thread may then start new threads in order for the process to become multithreaded, similar to the way a process can call fork() to start a new process.

许多进程只有一个线程。只有一个线程的进程被称为单线程进程，而有多个线程的进程被称为多线程进程。

所有进程最初都是单线程的。这个起始线程通常被称为主线程。

然后，主线程可以启动新线程，使进程变为多线程，类似于进程可以调用fork()来启动一个新进程。

NOTE It’s rare to refer to threads at all when a process is single-threaded. This book will not mention threads unless multithreaded processes make a difference in what you see or experience. 注意当进程是单线程的时候，很少提到线程。除非多线程进程会对你所见或体验的内容产生影响，本书不会提到线程。

The primary advantage of a multithreaded process is that when the process has a lot to do, threads can run simultaneously on multiple processors, potentially speeding up computation. Although you can also achieve simultaneous computation with multiple processes, threads start faster than processes, and it is often easier and/or more efficient for threads to intercommunicate using their shared memory than it is for processes to communicate over a channel such as a network connection or a pipe.

多线程进程的主要优势在于，当进程有很多事情要做时，线程可以在多个处理器上同时运行，从而可能加快计算速度。

虽然你也可以通过多个进程实现同时计算，但是线程比进程启动更快，而且线程使用共享内存进行相互通信通常更容易和/或更高效，而进程之间的通信则需要使用网络连接或管道等通道。

Some programs use threads to overcome problems managing multiple I/O resources. Traditionally, a process would sometimes use fork() to start a new subprocess in order to deal with a new input or output stream. Threads offer a similar mechanism without the overhead of starting a new process.

一些程序使用线程来解决管理多个I/O资源的问题。

传统上，一个进程有时会使用fork()来启动一个新的子进程，以处理新的输入或输出流。

线程提供了一种类似的机制，但不需要启动一个新进程的开销。

8.4.2 Viewing Threads（查看主题）

By default, the output from the ps and top commands shows only processes. To display the thread information in ps, add the m option. Here is some sample output:

默认情况下，ps 和 top 命令的输出只显示进程。要在 ps 中显示线程信息，请添加 m 选项。下面是一些输出示例：

Example 8-1. Viewing threads with ps m

例 8-1. 使用 ps m 查看线程

$ ps m
 PID TTY STAT TIME COMMAND
3587 pts/3 - 0:00 bash?
 - - Ss 0:00 -
3592 pts/4 - 0:00 bash?
 - - Ss 0:00 -
12287 pts/8 - 0:54 /usr/bin/python /usr/bin/gm-notify?
 - - SL1 0:48 -
 - - SL1 0:00 -
 - - SL1 0:06 -
 - - SL1 0:00 -

Example 8-1 shows processes along with threads. Each line with a number in the PID column (at ?, ?, and ?) represents a process, as in the normal ps output. The lines with the dashes in the PID column represent the threads associated with the process. In this output, the processes at ? and ? have only one thread each, but process 12287 at ? is multithreaded with four threads.

例 8-1 显示了进程和线程。

PID 列（?、? 和 ?）中带有数字的每一行代表一个进程，与正常的 ps 输出一样。

PID 列中的破折号线代表与进程相关的线程。

在此输出中，? 和 ? 处的进程各有一个线程，但 ? 处的进程 12287 是多线程的，有四个线程。

If you would like to view the thread IDs with ps, you can use a custom output format. This example shows only the process IDs, thread IDs, and command:

如果想用 ps 查看线程 ID，可以使用自定义输出格式。本例只显示了进程 ID、线程 ID 和命令：

Example 8-2. Showing process IDs and thread IDs with ps m

例 8-2. 用 ps m 显示进程 ID 和线程 ID

$ ps m -o pid,tid,command
 PID TID COMMAND
3587 - bash
 - 3587 -
3592 - bash
 - 3592 -
12287 - /usr/bin/python /usr/bin/gm-notify
 - 12287 -
 - 12288 -
 - 12289 -
 - 12295 -

The sample output in Example 8-2 corresponds to the threads shown in Example 8-1. Notice that the thread IDs of the single-threaded processes are identical to the process IDs; this is the main thread. For the multithreaded process 12287, thread 12287 is also the main thread.

在示例8-2中的示例输出对应于示例8-1中显示的线程。

请注意，单线程进程的线程ID与进程ID相同，这是主线程。

对于多线程进程12287，线程12287也是主线程。

NOTE Normally, you won’t interact with individual threads as you would processes. You need to know a lot about how a multithreaded program was written in order to act on one thread at a time, and even then, doing so might not be a good idea. 注意通常情况下，您不会像处理进程一样与单个线程进行交互。要逐个线程进行操作，您需要了解有关多线程程序的许多信息，即使这样做可能不是一个好主意。

Threads can confuse things when it comes to resource monitoring because individual threads in a multithreaded process can consume resources simultaneously. For example, top doesn’t show threads by default; you’ll need to press H to turn it on. For most of the resource monitoring tools that you’re about to see, you’ll have to do a little extra work to turn on the thread display.

线程在资源监控方面可能会引起混淆，因为多线程进程中的各个线程可以同时消耗资源。

例如，默认情况下，top不显示线程；您需要按下H键来打开线程显示。

对于即将看到的大多数资源监控工具，您需要做一些额外的工作来打开线程显示。

8.5 Introduction to Resource Monitoring（资源监测简介）

Now we’ll discuss some topics in resource monitoring, including processor (CPU) time, memory, and disk I/O. We’ll examine utilization on a systemwide scale, as well as on a per-process basis.

现在，我们将讨论资源监控中的一些主题，包括处理器（CPU）时间、内存和磁盘 I/O。

我们将检查整个系统和每个进程的利用率。

Many people touch the inner workings of the Linux kernel in the interest of improving performance. However, most Linux systems perform well under a distribution’s default settings, and you can spend days trying to tune your machine’s performance without meaningful results, especially if you don’t know what to look for. So rather than think about performance as you experiment with the tools in this chapter, think about seeing the kernel in action as it divides resources among processes.

为了提高性能，很多人都会接触 Linux 内核的内部工作原理。

然而，大多数 Linux 系统在发行版的默认设置下性能良好，你可能要花费数天时间来调整机器的性能，却得不到有意义的结果，尤其是如果你不知道要注意什么的话。

因此，在使用本章中的工具进行实验时，与其考虑性能，不如看看内核在进程间分配资源时的运行情况。

8.6 Measuring CPU Time（测量 CPU 时间）

To monitor one or more specific processes over time, use the -p option to top, with this syntax:

要在一段时间内监控一个或多个特定进程，请使用 top 的 -p 选项，语法如下：

$ top -p pid1 [-p pid2 ...]

To find out how much CPU time a command uses during its lifetime, use time. Most shells have a built-in time command that doesn’t provide extensive statistics, so you’ll probably need to run /usr/bin/time. For example, to measure the CPU time used by ls, run

要想知道一条命令在其生命周期内占用了多少 CPU 时间，可以使用 time。

大多数 shell 都有一个内置的 time 命令，但并不提供大量的统计数据，所以你可能需要运行 /usr/bin/time。

例如，要测量 ls 占用的 CPU 时间，运行

$ /usr/bin/time ls

After ls terminates, time should print output like that below. The key fields are in boldface:

ls 终止后，time 打印输出应如下所示。关键字段用粗体表示：

0.05user 0.09system 0:00.44elapsed 31%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (125major+51minor)pagefaults 0swaps

o User time. The number of seconds that the CPU has spent running the program’s own code. On modern processors, some commands run so quickly, and therefore the CPU time is so low, that time rounds down to zero.

用户时间。CPU花费在运行程序自身代码的秒数。

在现代处理器上，某些命令运行得非常快，因此CPU时间非常低，时间会被四舍五入为零。

o System time. How much time the kernel spends doing the process’s work (for example, reading files and directories).

系统时间。内核花费在执行进程工作的时间（例如，读取文件和目录）。

o Elapsed time. The total time it took to run the process from start to finish, including the time that the CPU spent doing other tasks. This number is normally not very useful for performance measurement, but subtracting the user and system time from elapsed time can give you a general idea of how long a process spends waiting for system resources.

The remainder of the output primarily details memory and I/O usage. You’ll learn more about the page fault output in 8.9 Memory.

经过的时间。从开始到结束运行进程所花费的总时间，包括CPU花费在其他任务上的时间。

这个数字通常对性能测量没有太大用处，但从经过的时间中减去用户时间和系统时间可以让你大致了解进程等待系统资源的时间。

输出的其余部分主要详细说明了内存和I/O使用情况。

你将在8.9内存中了解更多关于页面错误输出的内容。

8.7 Adjusting Process Priorities（调整流程优先级）

You can change the way the kernel schedules a process in order to give the process more or less CPU time than other processes. The kernel runs each process according to its scheduling priority, which is a number between –20 and 20, with –20 being the foremost priority. (Yes, this can be confusing.)

您可以改变内核调度进程的方式，使该进程获得比其他进程更多或更少的 CPU 时间。

内核会根据每个进程的调度优先级来运行进程，调度优先级是一个介于 -20 和 20 之间的数字，其中 -20 的优先级最高。

(是的，这可能会引起混淆）。

The ps -l command lists the current priority of a process, but it’s a little easier to see the priorities in action with the top command, as shown here:

ps -l 命令会列出进程的当前优先级，但使用top命令更容易查看优先级，如图所示：

$ top
Tasks: 244 total, 2 running, 242 sleeping, 0 stopped, 0 zombie
Cpu(s): 31.7%us, 2.8%sy, 0.0%ni, 65.4%id, 0.2%wa, 0.0%hi, 0.0%si, 
0.0%st
Mem: 6137216k total, 5583560k used, 553656k free, 72008k buffers
Swap: 4135932k total, 694192k used, 3441740k free, 767640k cached
 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28883 bri 20 0 1280m 763m 32m S 58 12.7 213:00.65 chromium-
browse
1175 root 20 0 210m 43m 28m R 44 0.7 14292:35 Xorg
4022 bri 20 0 413m 201m 28m S 29 3.4 3640:13 chromiumbrowse
4029 bri 20 0 378m 206m 19m S 2 3.5 32:50.86 chromiumbrowse
3971 bri 20 0 881m 359m 32m S 2 6.0 563:06.88 chromiumbrowse
5378 bri 20 0 152m 10m 7064 S 1 0.2 24:30.21 compiz
3821 bri 20 0 312m 37m 14m S 0 0.6 29:25.57 soffice.bin
4117 bri 20 0 321m 105m 18m S 0 1.8 34:55.01 chromiumbrowse
4138 bri 20 0 331m 99m 21m S 0 1.7 121:44.19 chromiumbrowse
4274 bri 20 0 232m 60m 13m S 0 1.0 37:33.78 chromiumbrowse
4267 bri 20 0 1102m 844m 11m S 0 14.1 29:59.27 chromiumbrowse
2327 bri 20 0 301m 43m 16m S 0 0.7 109:55.65 unity-2d-shell

In the top output above, the PR (priority) column lists the kernel’s current schedule priority for the process. The higher the number, the less likely the kernel is to schedule the process if others need CPU time. The schedule priority alone does not determine the kernel’s decision to give CPU time to a process, and it changes frequently during program execution according to the amount of CPU time that the process consumes.

在上面的输出中，PR（优先级）列显示了内核对进程的当前调度优先级。

数字越高，如果其他进程需要CPU时间，内核调度该进程的可能性就越小。

调度优先级本身并不能决定内核是否将CPU时间分配给进程，并且根据进程消耗的CPU时间，在程序执行过程中频繁变化。

Next to the priority column is the nice value (NI) column, which gives a hint to the kernel’s scheduler. This is what you care about when trying to influence the kernel’s decision. The kernel adds the nice value to the current priority to determine the next time slot for the process.

在优先级列旁边是nice值（NI）列，它向内核的调度器提供了一个提示。

当您想要影响内核的决策时，这是您关心的内容。

内核将nice值添加到当前优先级，以确定进程的下一个时间片。

By default, the nice value is 0. Now, say you’re running a big computation in the background that you don’t want to bog down your interactive session. To have that process take a backseat to other processes and run only when the other tasks have nothing to do, you could change the nice value to 20 with the renice command (where pid is the process ID of the process that you want to change):

默认情况下，nice值为0。现在，假设您在后台运行一个大型计算任务，您不希望它影响您的交互会话。

为了让该进程在其他任务没有任务时才运行，并且让其他进程有更高的优先级，您可以使用renice命令将nice值更改为20（其中pid是您想要更改的进程的进程ID）：

$ renice 20 pid

If you’re the superuser, you can set the nice value to a negative number, but doing so is almost always a bad idea because system processes may not get enough CPU time. In fact, you probably won’t need to alter nice values much because many Linux systems have only a single user, and that user does not perform much real computation. (The nice value was much more important back when there were many users on a single machine.)

如果你是超级用户，可以将 nice 值设置为负数，但这样做几乎总是个坏主意，因为系统进程可能得不到足够的 CPU 时间。

事实上，你可能并不需要过多修改 nice 值，因为许多 Linux 系统只有一个用户，而且该用户并不执行很多实际计算。

(在一台机器上有很多用户的时候，nice 值要重要得多）。

8.8 Load Averages（负载平均值）

CPU performance is one of the easier metrics to measure. The load average is the average number of processes currently ready to run. That is, it is an estimate of the number of processes that are capable of using the CPU at any given time. When thinking about a load average, keep in mind that most processes on your system are usually waiting for input (from the keyboard, mouse, or network, for example), meaning that most processes are not ready to run and should contribute nothing to the load average. Only processes that are actually doing something affect the load average.

CPU 性能是比较容易衡量的指标之一。

平均负载是指当前准备运行的进程的平均数量。

也就是说，它是对任何给定时间内能够使用 CPU 的进程数量的估计。

在考虑平均负载时，请记住系统中的大多数进程通常都在等待输入（例如来自键盘、鼠标或网络的输入），这意味着大多数进程都没有准备好运行，因此不会对平均负载产生任何影响。

只有实际运行的进程才会影响平均负载。

8.8.1 Using uptime（使用 uptime）

The uptime command tells you three load averages in addition to how long the kernel has been running:

除了内核运行的时间外，uptime 命令还能告诉你三个负载平均值：

$ uptime
... up 91 days, ... load average: 0.08, 0.03, 0.01

The three bolded numbers are the load averages for the past 1 minute, 5 minutes, and 15 minutes, respectively. As you can see, this system isn’t very busy: An average of only 0.01 processes have been running across all processors for the past 15 minutes. In other words, if you had just one processor, it was only running userspace applications for 1 percent of the last 15 minutes. (Traditionally, most desktop systems would exhibit a load average of about 0 when you were doing anything except compiling a program or playing a game. A load average of 0 is usually a good sign, because it means that your processor isn’t being challenged and you’re saving power.)

三个加粗的数字分别是过去1分钟、5分钟和15分钟的平均负载。

正如你所见，这个系统并不是很忙：过去15分钟内，所有处理器上平均只有0.01个进程在运行。

换句话说，如果你只有一个处理器，在过去的15分钟内，它只有1%的时间在运行用户空间应用程序。

（传统上，除了编译程序或玩游戏之外，大多数桌面系统的负载平均值约为0。

负载平均值为0通常是一个好的迹象，因为这意味着你的处理器没有受到挑战，同时也节省了能量。）

NOTE User interface components on current desktop systems tend to occupy more of the CPU than those in the past. For example, on Linux systems, a web browser’s Flash plugin can be a particularly notorious resource hog, and Flash applications can easily occupy much of a system’s CPU and memory due to poor all-around implementation. 注意：当前桌面系统上的用户界面组件往往占用的CPU资源比过去多。例如，在Linux系统上，Web浏览器的Flash插件可能是一个特别臭名昭著的资源占用者，由于实现不佳，Flash应用程序很容易占用系统的大部分CPU和内存。

If a load average goes up to around 1, a single process is probably using the CPU nearly all of the time. To identify that process, use the top command; the process will usually rise to the the top of the display.

如果平均负载上升到 1 左右，则可能是一个进程几乎一直在使用 CPU。

要识别该进程，请使用 top 命令；该进程通常会出现在显示屏的顶部。

Most modern systems have more than one processor core or CPU, so multiple processes can easily run simultaneously. If you have two cores, a load average of 1 means that only one of the cores is likely active at any given time, and a load average of 2 means that both cores have just enough to do all of the time.

如果负载平均值升至约为1，一个进程可能几乎一直在使用CPU。

要识别该进程，可以使用top命令；该进程通常会升至显示屏的顶部。

大多数现代系统都有多个处理器核心或CPU，因此多个进程可以轻松同时运行。

如果你有两个核心，负载平均值为1意味着任何给定时间只有一个核心处于活动状态，负载平均值为2意味着两个核心一直有足够的工作量。

8.8.2 High Loads（高负荷）

A high load average does not necessarily mean that your system is having trouble. A system with enough memory and I/O resources can easily handle many running processes. If your load average is high and your system still responds well, don’t panic: The system just has a lot of processes sharing the CPU. The processes have to compete with each other for processor time, and as a result they’ll take longer to perform their computations than they would if they were each allowed to use the CPU all of the time. Another case where you might see a high load average as normal is a web server, where processes can start and terminate so quickly that the load average measurement mechanism can’t function effectively.

一个高负载平均值并不一定意味着您的系统出现了问题。

具有足够内存和I/O资源的系统可以轻松处理许多运行中的进程。

如果您的负载平均值很高，但系统仍然响应良好，不要惊慌：系统只是有很多进程共享CPU。

这些进程必须相互竞争处理器时间，因此它们执行计算的时间比如果它们每个都被允许始终使用CPU要长。

另一个可能正常情况下看到高负载平均值的情况是Web服务器，在这种情况下，进程可以快速启动和终止，以至于负载平均值测量机制无法有效运作。

However, if you sense that the system is slow and the load average is high, you might be running into memory performance problems. When the system is low on memory, the kernel can start to thrash, or rapidly swap memory for processes to and from the disk. When this happens, many processes will become ready to run, but their memory might not be available, so they will remain in the ready-to-run state (and contribute to the load average) for much longer than they normally would.

然而，如果您感觉系统变慢，负载平均值很高，那么可能是内存性能问题。

当系统内存不足时，内核可能会开始频繁地将内存与进程之间进行交换。

当这种情况发生时，许多进程将准备好运行，但它们的内存可能不可用，因此它们将比通常更长时间保持在准备运行状态（并对负载平均值做出贡献）。

We’ll now look at memory in much more detail.

现在我们将更详细地讨论内存。

8.9 Memory（内存）

One of the simplest ways to check your system’s memory status as a whole is to run the free command or view /proc/meminfo to see how much real memory is being used for caches and buffers. As we’ve just mentioned, performance problems can arise from memory shortages. If there isn’t much cache/buffer memory being used (and the rest of the real memory is taken), you may need more memory. However, it’s too easy to blame a shortage of memory for every performance problem on your machine.

检查系统内存状态的最简单方法之一是运行free命令或查看/proc/meminfo，以查看用于缓存和缓冲区的实际内存使用量。

正如我们刚才提到的，内存不足可能导致性能问题。

如果没有使用很多缓存/缓冲区内存（而其余的实际内存已被占用），您可能需要更多的内存。

然而，很容易将内存不足归咎于机器上的每个性能问题。

8.9.1 How Memory Works（内存的工作原理）

Recall from Chapter 1 that the CPU has a memory management unit (MMU) that translates the virtual memory addresses used by processes into real ones. The kernel assists the MMU by breaking the memory used by processes into smaller chunks called pages. The kernel maintains a data structure, called a page table, that contains a mapping of a processes’ virtual page addresses to real page addresses in memory. As a process accesses memory, the MMU translates the virtual addresses used by the process into real addresses based on the kernel’s page table.

回顾第1章中的内容，CPU具有一个内存管理单元（MMU），用于将进程使用的虚拟内存地址转换为实际地址。

内核通过将进程使用的内存分割成称为页的较小块来帮助MMU。

内核维护一个称为页表的数据结构，其中包含进程的虚拟页地址与内存中的实际页地址之间的映射。

当进程访问内存时，MMU根据内核的页表将进程使用的虚拟地址转换为实际地址。

A user process does not actually need all of its pages to be immediately available in order to run. The kernel generally loads and allocates pages as a process needs them; this system is known as on-demand paging or just demand paging. To see how this works, consider how a program starts and runs as a new process:

用户进程实际上并不需要立即可用的所有页来运行。

内核通常在进程需要它们时加载和分配页；这个系统被称为按需分页或只是需求分页。

为了了解这是如何工作的，请考虑程序作为新进程启动和运行的方式：

The kernel loads the beginning of the program’s instruction code into memory pages.
The kernel may allocate some working-memory pages to the new process.
As the process runs, it might reach a point where the next instruction in its code isn’t in any of the pages that the kernel initially loaded. At this point, the kernel takes over, loads the necessary pages into memory, and then lets the program resume execution.
Similarly, if the program requires more working memory than was initially allocated, the kernel handles it by finding free pages (or by making room) and assigning them to the process
内核将程序的指令代码的开头加载到内存页中。
内核可能为新进程分配一些工作内存页。
当进程运行时，它可能达到一个点，其中它的代码中的下一条指令不在内核最初加载的任何页中。此时，内核接管，将所需的页加载到内存中，然后让程序继续执行。
类似地，如果程序需要的工作内存超过了最初分配的内存，内核通过找到空闲页（或腾出空间）并将其分配给进程来处理。

8.9.2 Page Faults

If a memory page is not ready when a process wants to use it, the process triggers a page fault. In the event of a page fault, the kernel takes control of the CPU from the process in order to get the page ready. There are two kinds of page faults: minor and major.

如果一个进程想要使用的内存页尚未准备好，那么该进程将触发一个页错误。

在发生页错误时，内核从进程那里接管CPU，以准备好该页。

有两种类型的页错误：次要页错误和主要页错误。

Minor Page Faults（页面小故障）

A minor page fault occurs when the desired page is actually in main memory but the MMU doesn’t know where it is. This can happen when the process requests more memory or when the MMU doesn’t have enough space to store all of the page locations for a process. In this case, the kernel tells the MMU about the page and permits the process to continue. Minor page faults aren’t such a big deal, and many occur as a process runs. Unless you need maximum performance from some memory-intensive program, you probably shouldn’t worry about them.

当所需的页实际上在主存中，但MMU不知道它在哪里时，发生次要页错误。

这可能发生在进程请求更多内存时，或者当MMU没有足够的空间来存储进程的所有页位置时。

在这种情况下，内核告诉MMU有关该页的信息，并允许进程继续执行。次要页错误并不是什么大问题，很多次都会发生。

除非您需要从一些内存密集型程序中获得最大的性能，否则您可能不需要担心它们。

Major Page Faults（主要页面故障）

A major page fault occurs when the desired memory page isn’t in main memory at all, which means that the kernel must load it from the disk or some other slow storage mechanism. A lot of major page faults will bog the system down because the kernel must do a substantial amount of work to provide the pages, robbing normal processes of their chance to run.

当所需的内存页根本不在主存中时，发生主要页错误，这意味着内核必须从磁盘或其他较慢的存储机制中加载它。

大量的主要页错误会拖慢系统，因为内核必须做大量的工作来提供页，从而剥夺正常进程运行的机会。

Some major page faults are unavoidable, such as those that occur when you load the code from disk when running a program for the first time. The biggest problems happen when you start running out of memory and the kernel starts to swap pages of working memory out to the disk in order to make room for new pages.

一些主要页错误是不可避免的，例如在首次运行程序时从磁盘加载代码时发生的错误。

当您开始内存不足并且内核开始将工作内存的页交换到磁盘以腾出空间来容纳新的页时，问题就变得更严重了。

Watching Page Faults（观察网页故障）

You can drill down to the page faults for individual processes with the ps, top, and time commands. The following command shows a simple example of how the time command displays page faults. (The output of the cal command doesn’t matter, so we’re discarding it by redirecting that to /dev/null.)

您可以使用ps、top和time命令来查看各个进程的页面错误。下面的命令展示了time命令如何显示页面错误的一个简单示例。（cal命令的输出并不重要，我们通过将其重定向到/dev/null来丢弃它。）

$ /usr/bin/time cal > /dev/null
0.00user 0.00system 0:00.06elapsed 0%CPU (0avgtext+0avgdata 
3328maxresident)k
648inputs+0outputs (2major+254minor)pagefaults 0swaps

As you can see from the bolded text, when this program ran, there were 2 major page faults and 254 minor ones. The major page faults occurred when the kernel needed to load the program from the disk for the first time. If you ran the command again, you probably wouldn’t get any major page faults because the kernel would have cached the pages from the disk.

从加粗的文本中可以看出，当该程序运行时，发生了2个重要页面错误和254个次要页面错误。

重要页面错误发生在内核首次从磁盘加载程序时。

如果再次运行该命令，可能不会发生重要页面错误，因为内核已经将页面缓存到了磁盘。

If you’d rather see the page faults of processes as they’re running, use top or ps. When running top, use f to change the displayed fields and u to display the number of major page faults. (The results will show up in a new, nFLT column. You won’t see the minor page faults.)

如果您希望在进程运行时查看页面错误，请使用top或ps命令。

在运行top时，使用f来更改显示的字段，使用u来显示重要页面错误的数量。

（结果将显示在一个新的nFLT列中，您将看不到次要页面错误。）

When using ps, you can use a custom output format to view the page faults for a particular process. Here’s an example for process ID 20365:

在使用ps时，您可以使用自定义的输出格式来查看特定进程的页面错误。以下是针对进程ID 20365的示例：

$ ps -o pid,min_flt,maj_flt 20365
 PID MINFL MAJFL
20365 834182 23

The MINFL and MAJFL columns show the numbers of minor and major page faults. Of course, you can combine this with any other process selection options, as described in the ps(1) manual page.

MINFL 和 MAJFL 列显示次要和主要页面故障的数量。

当然，您也可以将其与任何其他流程选择选项相结合，详见 ps(1) 手册页面。

Viewing page faults by process can help you zero in on certain problematic components. However, if you’re interested in your system performance as a whole, you need a tool to summarize CPU and memory action across all processes.

按进程查看页面故障可以帮助你找到某些有问题的组件。

不过，如果你对系统的整体性能感兴趣，就需要一个工具来汇总所有进程的 CPU 和内存运行情况。

8.10 Monitoring CPU and Memory Performance with vmstat（使用vmstat监控CPU和内存性能）

Among the many tools available to monitor system performance, the vmstat command is one of the oldest, with minimal overhead. You’ll find it handy for getting a high-level view of how often the kernel is swapping pages in and out, how busy the CPU is, and IO utilization.

在众多可用于监控系统性能的工具中，vmstat命令是最古老且开销最小的之一。

您会发现它非常方便，可以提供关于内核页面交换频率、CPU忙碌程度和IO利用率的整体视图。

The trick to unlocking the power of vmstat is to understand its output. For example, here’s some output from vmstat 2, which reports statistics every 2 seconds:

解锁vmstat的威力的关键在于理解其输出。例如，这是使用vmstat 2命令每2秒报告一次统计数据的一些输出：

$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system-- ----
cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 320416 3027696 198636 1072568 0 0 1 1 2 0 15 2 83 0
2 0 320416 3027288 198636 1072564 0 0 0 1182 407 636 1 0 99 0
1 0 320416 3026792 198640 1072572 0 0 0 58 281 537 1 0 99 0
0 0 320416 3024932 198648 1074924 0 0 0 308 318 541 0 0 99 1
0 0 320416 3024932 198648 1074968 0 0 0 0 208 416 0 0 99 0
0 0 320416 3026800 198648 1072616 0 0 0 0 207 389 0 0 100 0

The output falls into categories: procs for processes, memory for memory usage, swap for the pages pulled in and out of swap, io for disk usage, system for the number of times the kernel switches into kernel code, and cpu for the time used by different parts of the system

输出分为几个类别：procs代表进程，memory代表内存使用情况，swap代表从交换区中换入和换出的页面，io代表磁盘使用情况，system代表内核切换到内核代码的次数，cpu代表系统不同部分使用的时间。

The preceding output is typical for a system that isn’t doing much. You’ll usually start looking at the second line of output—the first one is an average for the entire uptime of the system. For example, here the system has 320416KB of memory swapped out to the disk (swpd) and around 3025000KB (3 GB) of real memory free. Even though some swap space is in use, the zero-valued si (swap-in) and so (swap-out) columns report that the kernel is not currently swapping anything in or out from the disk. The buff column indicates the amount of memory that the kernel is using for disk buffers (see 4.2.5 Disk Buffering, Caching, and Filesystems).

前面的输出对于一个没有做太多事情的系统来说是典型的。

通常你会从输出的第二行开始查看，第一行是整个系统运行时间的平均值。例如，在这个例子中，系统将320416KB的内存交换到磁盘(swpd)，并且大约有3025000KB（3GB）的真实内存空闲。

尽管有一些交换空间在使用，但是零值的si（换入）和so（换出）列显示内核当前没有从磁盘中交换任何内容。

buff列表示内核用于磁盘缓冲区的内存量（参见4.2.5磁盘缓冲、缓存和文件系统）。

On the far right, under the CPU heading, you see the distribution of CPU time in the us, sy, id, and wa columns. These list (in order) the percentage of time that the CPU is spending on user tasks, system (kernel) tasks, idle time, and waiting for I/O. In the preceding example, there aren’t too many user processes running (they’re using a maximum of 1 percent of the CPU); the kernel is doing practically nothing, while the CPU is sitting around doing nothing 99 percent of the time.

在最右边的CPU标题下，你可以看到CPU时间在us、sy、id和wa列中的分布情况。

它们按顺序列出了CPU在用户任务、系统（内核）任务、空闲时间和等待I/O上所花费的时间的百分比。

在前面的例子中，没有太多用户进程在运行（它们最多使用1%的CPU）；内核几乎没有做任何事情，而CPU在99%的时间里都闲置。

Now, watch what happens when a big program starts up sometime later (the first two lines occur right before the program runs):

现在，看看当一个大程序在稍后启动时会发生什么（前两行发生在程序运行之前）：

Example 8-3. Memory activity

例子8-3.内存活动

procs -----------memory---------- ---swap-- -----io---- -system-- ----
cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 320412 2861252 198920 1106804 0 0 0 0 2477 4481 25 2 72 
0?
1 0 320412 2861748 198924 1105624 0 0 0 40 2206 3966 26 2 72 0
1 0 320412 2860508 199320 1106504 0 0 210 18 2201 3904 26 2 71 1
1 1 320412 2817860 199332 1146052 0 0 19912 0 2446 4223 26 3 63 8
2 2 320284 2791608 200612 1157752 202 0 4960 854 3371 5714 27 3 51 
18?
1 1 320252 2772076 201076 1166656 10 0 2142 1190 4188 7537 30 3 53 
14
0 3 320244 2727632 202104 1175420 20 0 1890 216 4631 8706 36 4 46 
14

As you can see at ? in Example 8-3, the CPU starts to see some usage for an extended period, especially from user processes. Because there is enough free memory, the amount of cache and buffer space used starts to increase as the kernel starts to use the disk more.

正如你在示例8-3中所看到的，CPU开始在一个较长的时间内出现一些使用情况，尤其是来自用户进程。

由于有足够的空闲内存，缓存和缓冲区使用的量开始增加，因为内核开始更多地使用磁盘。

Later on, we see something interesting: Notice at ? that the kernel pulls some pages into memory that were once swapped out (the si column). This means that the program that just ran probably accessed some pages shared by another process. This is common; many processes use the code in certain shared libraries only when starting up.

稍后，我们看到一些有趣的现象：请注意在?处，内核将一些曾经被交换出去的页面调入内存（si列）。

这意味着刚刚运行的程序可能访问了其他进程共享的某些页面。

这是很常见的；许多进程只在启动时使用某些共享库中的代码。

Also notice from the b column that a few processes are blocked (prevented from running) while waiting for memory pages. Overall, the amount of free memory is decreasing, but it’s nowhere near being depleted. There’s also a fair amount of disk activity, as seen by the increasing numbers in the bi (blocks in) and bo (blocks out) columns.

还请注意b列中有一些进程被阻塞（无法运行），因为它们在等待内存页面。总体而言，空闲内存的数量在减少，但远未耗尽。

同时，磁盘活动也相当频繁，可以从bi（块输入）和bo（块输出）列中看出。

The output is quite different when you run out of memory. As the free space depletes, both the buffer and cache sizes decrease because the kernel increasingly needs the space for user processes. Once there is nothing left, you’ll start to see activity in the so (swapped out) column as the kernel starts moving pages onto the disk, at which point nearly all of the other output columns change to reflect the amount of work that the kernel is doing. You see more system time, more data going in and out of the disk, and more processes blocked because the memory they want to use is not available (it has been swapped out).

当内存耗尽时，输出会有很大的变化。

随着空闲空间的减少，缓冲区和缓存大小也会减小，因为内核越来越需要这些空间来供用户进程使用。

一旦没有剩余空间，你将开始在so（交换出）列中看到活动，此时内核开始将页面移到磁盘上，几乎所有其他输出列都会根据内核的工作量发生变化。

你会看到更多的系统时间，更多的数据进出磁盘，以及更多的进程被阻塞，因为它们想要使用的内存不可用（已经被交换出）。

We haven’t explained all of the vmstat output columns. You can dig deeper into them in the vmstat(8) manual page, but you might have to learn more about kernel memory management first from a class or a book like Operating System Concepts, 9th edition (Wiley, 2012) in order to understand them.

我们没有解释所有的vmstat输出列。

你可以在vmstat(8)的手册页中深入了解它们，但为了理解它们，你可能需要先从课程或者像《操作系统概念》（第9版，Wiley，2012） 这样的书籍中更多地了解内核内存管理。

8.11 I/O Monitoring（输入/输出监控）

By default, vmstat shows you some general I/O statistics. Although you can get very detailed per-partition resource usage with vmstat -d, you’ll get a lot of output from this option, which might be overwhelming. Instead, try starting out with a tool just for I/O called iostat.

默认情况下，vmstat 会显示一些一般的 I/O 统计信息。

虽然使用 vmstat -d 可以获得非常详细的每个分区资源使用情况，但该选项会产生大量输出，可能会让人难以承受。

相反，你可以尝试从名为 iostat 的 I/O 工具开始。

8.11.1 Using iostat

Like vmstat, when run without any options, iostat shows the statistics for your machine’s current uptime:

与 vmstat 一样，在不带任何选项的情况下运行时，iostat 会显示机器当前正常运行时间的统计数据：

$ iostat
[kernel information]
avg-cpu: %user %nice %system %iowait %steal %idle
 4.46 0.01 0.67 0.31 0.00 94.55
Device: tp s kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 4.6 7 7.2 8 49.86 9493727 65011716
sde 0.0 0 0.0 0 0.00 1230 0

The avg-cpu part at the top reports the same CPU utilization information as other utilities that you’ve seen in this chapter, so skip down to the bottom, which shows you the following for each device:

顶部的 avg-cpu 部分报告的 CPU 利用率信息与本章中的其他实用程序相同，因此请跳到底部，它将显示每个设备的以下信息：

image.png

Another similarity to vmstat is that you can give an interval argument, such as iostat 2, to give an update every 2 seconds. When using an interval, you might want to display only the device report by using the -d option (such as iostat -d 2).

与vmstat类似的另一个特点是，你可以提供一个间隔参数，比如iostat 2，以便每2秒更新一次。

当使用间隔时，你可能希望只显示设备报告，可以使用-d选项（比如iostat -d 2）。

By default, the iostat output omits partition information. To show all of the partition information, use the -p ALL option. Because there are many partitions on a typical system, you’ll get a lot of output. Here’s part of what you might see:

默认情况下，iostat输出不包含分区信息。

要显示所有分区信息，请使用-p ALL选项。

由于典型系统上有许多分区，你将得到大量输出。以下是你可能看到的部分内容：

$ iostat -p ALL
--snip
--Device: tps kB_read/s kB_wrtn/s kB_read 
kB_wrtn
--snipsda 4.67 7.27 49.83 9496139 
65051472
sda1 4.38 7.16 49.51 9352969 
64635440
sda2 0.00 0.00 0.00 6 
0
sda5 0.01 0.11 0.32 141884 
416032
scd0 0.00 0.00 0.00 0 
0
--snip--
sde 0.00 0.00 0.00 1230 
0

In this example, sda1, sda2, and sda5 are all partitions of the sda disk, so there will be some overlap between the read and written columns. However, the sum of the partition columns won’t necessarily add up to the disk column. Although a read from sda1 also counts as a read from sda, keep in mind that you can read from sda directly, such as when reading the partition table.

在本例中，sda1、sda2 和 sda5 都是 sda 磁盘的分区，因此读取列和写入列之间会有一些重叠。

不过，分区列的总和并不一定等于磁盘列。

虽然从 sda1 的读取也算作从 sda 的读取，但请记住，您可以直接从 sda 读取，例如在读取分区表时。

8.11.2 Per-Process I/O Utilization and Monitoring: iotop（每进程 I/O 利用率和监控：iotop）

If you need to dig even deeper to see I/O resources used by individual processes, the iotop tool can help. Using iotop is similar to using top. There is a continuously updating display that shows the processes using the most I/O, with a general summary at the top:

如果需要更深入地查看单个进程使用的 I/O 资源，iotop 工具可以提供帮助。

使用 iotop 与使用 top 类似。

它有一个持续更新的显示屏，显示使用最多 I/O 的进程，顶部有一个总的摘要：

# iotop
Total DISK READ: 4.76 K/s | Total DISK WRITE: 333.31 K/s
 TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
 260 be/3 root 0.00 B/s 38.09 K/s 0.00 % 6.98 % [jbd2/sda1-
8]
2611 be/4 juser 4.76 K/s 10.32 K/s 0.00 % 0.21 % zeitgeistdaemon
2636 be/4 juser 0.00 B/s 84.12 K/s 0.00 % 0.20 % zeitgeistfts
1329 be/4 juser 0.00 B/s 65.87 K/s 0.00 % 0.03 % soffice.b~ashpipe=6
6845 be/4 juser 0.00 B/s 812.63 B/s 0.00 % 0.00 % chromium-browser
19069 be/4 juser 0.00 B/s 812.63 B/s 0.00 % 0.00 % rhythmbox

Along with the user, command, and read/write columns, notice that there is a TID column (thread ID) instead of a process ID. The iotop tool is one of the few utilities that displays threads instead of processes.

随着用户、命令和读/写列，注意到有一个TID列（线程ID）而不是进程ID。

iotop工具是为数不多显示线程而不是进程的实用工具之一。

The PRIO (priority) column indicates the I/O priority. It’s similar to the CPU priority that you’ve already seen, but it affects how quickly the kernel schedules I/O reads and writes for the process. In a priority such as be/4, the be part is the scheduling class, and the number is the priority level. As with CPU priorities, lower numbers are more important; for example, the kernel allows more time for I/O for a process with be/3 than one with be/4.

PRIO（优先级）列指示I/O优先级。

它类似于你已经见过的CPU优先级，但它影响内核为进程调度I/O读取和写入的速度。

在像be/4这样的优先级中，be部分是调度类，数字是优先级级别。

与CPU优先级一样，较低的数字更重要；

例如，内核为具有be/3的进程允许更多的时间进行I/O，而不是具有be/4的进程。

The kernel uses the scheduling class to add more control for I/O scheduling. You’ll see three scheduling classes from iotop:

内核使用调度类来增加对I/O调度的更多控制。你将从iotop中看到三个调度类：

o be Best-effort. The kernel does its best to fairly schedule I/O for this class. Most processes run under this I/O scheduling class. o rt Real-time. The kernel schedules any real-time I/O before any other class of I/O, no matter what. o idle Idle. The kernel performs I/O for this class only when there is no other I/O to be done. There is no priority level for the idle scheduling class.

o be 最佳努力。内核尽其所能公平地为该类别调度I/O。大多数进程在此I/O调度类下运行。

o rt 实时。内核在任何其他I/O类别之前调度任何实时I/O。

o idle 空闲。内核仅在没有其他I/O需要完成时才为此类别执行I/O操作。空闲调度类别没有优先级级别。

You can check and change the I/O priority for a process with the ionice utility; see the ionice(1) manual page for details. You probably will never need to worry about the I/O priority, though.

你可以使用ionice实用程序来检查和更改进程的I/O优先级；有关详细信息，请参阅ionice（1）手册页。但是，你可能永远不需要担心I/O优先级。

8.12 Per-Process Monitoring with pidstat

You’ve seen how you can monitor specific processes with utilities such as top and iotop. However, this display refreshes over time, and each update erases the previous output. The pidstat utility allows you to see the resource consumption of a process over time in the style of vmstat. Here’s a simple example for monitoring process 1329, updating every second:

您已经了解到如何使用top和iotop等工具来监视特定的进程。

然而，这些显示屏幕会随时间刷新，每次更新都会清除之前的输出。

pidstat工具允许您以vmstat的方式查看进程随时间的资源消耗情况。

下面是一个简单的示例，用于监视进程1329，每秒更新一次：

$ pidstat -p 1329 1
Linux 3.2.0-44-generic-pae (duplex) 07/01/2015 _i686_ (4 CPU)
09:26:55 PM PID %usr %system %guest %CPU CPU Command
09:27:03 PM 1329 8.00 0.00 0.00 8.00 1 myprocess
09:27:04 PM 1329 0.00 0.00 0.00 0.00 3 myprocess
09:27:05 PM 1329 3.00 0.00 0.00 3.00 1 myprocess
09:27:06 PM 1329 8.00 0.00 0.00 8.00 3 myprocess
09:27:07 PM 1329 2.00 0.00 0.00 2.00 3 myprocess
09:27:08 PM 1329 6.00 0.00 0.00 6.00 2 myprocess

The default output shows the percentages of user and system time and the overall percentage of CPU time, and it even tells you which CPU the process was running on. (The %guest column here is somewhat odd— it’s the percentage of time that the process spent running something inside a virtual machine. Unless you’re running a virtual machine, don’t worry about this.)

默认输出显示了用户和系统时间的百分比，以及CPU时间的总体百分比，甚至还告诉您进程在哪个CPU上运行。

（这里的%guest列有点奇怪，它是进程在虚拟机内运行的时间百分比。除非您正在运行虚拟机，否则不必担心这个。）

Although pidstat shows CPU utilization by default, it can do much more. For example, you can use the - r option to monitor memory and -d to turn on disk monitoring. Try them out, and then look at the pidstat(1) manual page to see even more options for threads, context switching, or just about anything else that we’ve talked about in this chapter.

虽然pidstat默认显示CPU利用率，但它还可以做更多的事情。例如，您可以使用-r选项来监视内存，使用-d选项来开启磁盘监视。

试试它们，然后查看pidstat(1)手册页面，以了解更多有关线程、上下文切换或本章中讨论的其他任何内容的选项。

8.13 Further Topics（进一步的主题）

One reason there are so many tools to measure resource utilization is that a wide array of resource types are consumed in many different ways. In this chapter, you’ve seen CPU, memory, and I/O as system resources being consumed by processes, threads inside processes, and the kernel.

有很多工具用于测量资源利用情况的一个原因是，有各种各样的资源类型以多种不同的方式被消耗。

在本章中，您已经看到了 CPU、内存和 I/O 作为系统资源被进程、进程内的线程和内核所消耗。

The other reason that the tools exist is that the resources are limited and, for a system to perform well, its components must strive to consume fewer resources. In the past, many users shared a machine, so it was necessary to make sure that each user had a fair share of resources. Now, although a modern desktop computer may not have multiple users, it still has many processes competing for resources. Likewise, high-performance network servers require intense system resource monitoring.

这些工具存在的另一个原因是资源是有限的，为了系统能够良好运行，其组件必须努力消耗更少的资源。

过去，许多用户共享一台机器，因此有必要确保每个用户都有公平的资源份额。

现在，尽管现代台式计算机可能没有多个用户，但仍然有许多进程竞争资源。

同样，高性能网络服务器需要进行强大的系统资源监控。

Further topics in resource monitoring and performance analysis include the following:

资源监控和性能分析的进一步主题包括以下内容：

o sar (System Activity Reporter) The sar package has many of the continuous monitoring capabilities of vmstat, but it also records resource utilization over time. With sar, you can look back at a particular time to see what your system was doing. This is handy when you have a past system event that you want to analyze

o sar（系统活动报告器） sar 软件包具有 vmstat 的许多连续监控功能，但它还记录了资源利用情况的变化。

通过 sar，您可以回顾特定时间以查看系统的运行情况。当您有一个过去的系统事件需要分析时，这非常方便。

o acct (Process accounting) The acct package can record the processes and their resource utilization.

o acct（进程记账） acct 软件包可以记录进程及其资源利用情况。

o Quotas. You can limit many system resources on a per-process or peruser basis. See /etc/security/limits.conf for some of the CPU and memory options; there’s also a limits.conf(5) manual page. This is a PAM feature, so processes are subject to this only if they’ve been started from something that uses PAM (such as a login shell). You can also limit the amount of disk space that a user can use with the quota system.

o 配额。您可以在每个进程或每个用户的基础上限制许多系统资源。

有关 CPU 和内存选项，请参阅 /etc/security/limits.conf；还有一个 limits.conf(5) 手册页。

这是一个 PAM 功能，因此只有从使用 PAM 的东西（如登录 shell）启动的进程才受到此限制。

您还可以使用配额系统限制用户可以使用的磁盘空间的数量。

If you’re interested in systems tuning and performance in particular, Systems Performance: Enterprise and the Cloud by Brendan Gregg (Prentice Hall, 2013) goes into much more detail.

如果您对系统调优和性能特别感兴趣，Brendan Gregg 的《系统性能：企业和云计算》（Prentice Hall，2013）提供了更详细的信息。

We also haven’t yet touched on the many, many tools that can be used to monitor network resource utilization. To use those, you first have to understand how the network works. That’s where we’re headed next.

我们还没有涉及用于监控网络资源利用情况的众多工具。

要使用这些工具，首先必须了解网络的工作原理。这就是我们接下来要讨论的内容。

本文参与?腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2024-04-16，如有侵权请联系?cloudcommunity@tencent.com 删除

内存

本文分享自懒时小窝微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与?腾讯云自媒体分享计划? ，欢迎热爱写作的你一起参与！