结合coredump分析dirty page和IO request处理逻辑(下)

原创

cdh

修改于 2022-03-23 22:36:16

9810

修改于 2022-03-23 22:36:16

文章被收录于专栏：笔记+笔记+

本文接上文"结合coredump分析dirty page和IO request处理逻辑(上)"

要进一步分析需要先了解下deadline调度器的相关原理：

deadline的电梯性

电梯调度器最重要的属性就是保持IO请求按适应磁头臂移动的方向去派发请求，从而尽可能的减少IO请求的在磁盘上的寻道时间，提升IO处理能力，类似于电梯移动算法，尽可能减少移动距离来提升运行效率，这就是IO调度器称之为电梯调度器的原因。

deadline的电梯性是通过批量(batching)机制来实现的，deadline每次派发完一个request请求都会将红黑树中排好序的下一个request(如果有)暂存到next_rq成员中，下一次再次调用deadline_dispatch_requests时直接派发next_rq成员中的request，同时按红黑树的顺序更新next_rq。如果一直按照这种思路派发，那么相反IO方向的请求有可能很长时间得不到响应，同一IO方向的请求也有可能出现超时。因此这种完全按红黑树的顺序派发请求的机制必须有个限制，deadline 中的batching和fifo_batch就是起这个作用的，每顺序派发一次，batching加一，如果batching超过了fifo_batch(默认16)就需要做饥饿检查和超时检查。fifo_batch设置过大会提升IO请求的吞吐率，增大IO请求的延迟，反之相反。这种同IO方向上的按顺序批次派发我们称只为批量机制，这种批量机制可能提前结束，如IO方向上没有可继续派发的IO请求了。

deadline的饥饿性

deadline的饥饿性说的是写饥饿，写饥饿产生的原因是deadline给予读方向的IO请求更高的优先级，即优先处理读请求，这是因为从实际应用角度来说读请求一般都是同步的，给予读更高的优先级有助于提升用户体验。总是一直优先处理读请求势必会造成写请求的饥饿，为此deadline引入starved和writes_starved成员变量来防止写饥饿，每优先处理一次读请求则计数器starved加一，一旦starved达到阀值writes_starved（默认2）且还有写请求，则下次派发时转向派发写请求。考虑到deadline的批量机制，默认情况下写请求最多让步于16 * 2 = 32个读请求。

deadline的超时性

deadline的饥饿性只会影响IO请求派发的方向，deadline的电梯性和超时性决定派发哪个IO请求。前面已经介绍了每个进入调度器的request都会按照超时先后挂接到fifo_list中，头部的request先超时，尾部的request后超时。在处理完一个批次的request之后，deadline会根据饥饿性确定的IO方向上有没有request超时，如果有超时的request则转向处理该request(fifo_list头部request)。如果还没有request超时,且同方向的next_rq有暂存的request，则继续批量化处理next_rq中的request。如此以来，fifo_batch并不是批量机制的上限，只是给超时和饥饿的request提供处理的机会，如果即没有超时又没有饥饿则当然继续按顺序批量化处理request会提升效率。

read_expire:读请求的超时时间设置，单位为ms。当一个读请求入队deadline的时候，其过期时间将被设置为当前时间＋read_expire，并放倒fifo_list中进行排序。

当从写IO转为处理读IO时会先检查是否有读超时的IO，有的话先处理超时IO，然后batching加一,有多个超时的读IO每批次(fifo_batch)只处理一个，接着按红黑树的顺序处理读IO直到处理下一批次(fifo_batch)前再次检查是否有超时的读IO，有则先处理一个超时读IO。

write_expire:写请求的超时时间设置，单位为ms。功能跟读请求类似。

当从读IO转为处理写IO时会先检查是否有写超时的IO，有的话先处理超时IO，然后batching加一,有多个超时的写IO每批次(fifo_batch)只处理一个，接着按红黑树的顺序处理写IO。

fifo_batch:在顺序（sort_list）请求进行处理的时候，deadline将顺序处理一批请求。每一批处理的请求个数为这个fifo_batch参数所限制的个数。在处理完fifo_batch个请求的过程中，不会产生是否超时的检查，也就不会产生额外的磁盘寻道时间，每顺序派发一次，batching加一，如果batching超过了fifo_batch(默认16)就需要查看同IO方向是否有超时的request。这个参数可以用来平衡顺序处理和饥饿时间的矛盾，当饥饿时间需要尽可能的符合预期的时候，我们可以调小这个值，以便尽可能多的检查是否有饥饿产生并及时处理。增大这个值当然也会增大吞吐量，但是会导致处理饥饿请求的延时变长。

writes_starved:这个值是用来判断当读队列不为空时，写队列的饥饿程度是否足够高，以让deadline放弃读请求的处理而处理写请求。当检查存在有写请求的时候，deadline并不会立即对写请求进行处理，而是给相关数据结构中的starved进行累计，如果这是第一次检查到有写请求进行处理，那么这个计数就为1。如果此时writes_starved值为2，则我们认为此时饥饿程度还不足够高，所以继续处理读请求。只有当starved >= writes_starved的时候，deadline才回去处理写请求。可以认为这个值是用来平衡deadline对读写请求处理优先级状态的，这个值越大，则写请求越被滞后处理，越小，写请求就越可以获得趋近于读请求的优先级。

void blk_mq_sched_insert_request(struct request *rq, bool at_head,
                                 bool run_queue, bool async)
{
        struct request_queue *q = rq->q;
        struct elevator_queue *e = q->elevator;
        struct blk_mq_ctx *ctx = rq->mq_ctx;
        struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
        /* flush rq in flush machinery need to be dispatched directly */
        if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
                blk_insert_flush(rq);
                goto run;
        }
        WARN_ON(e && (rq->tag != -1));
        if (blk_mq_sched_bypass_insert(hctx, !!e, rq))
                goto run;
        if (e && e->type->ops.mq.insert_requests) {
                LIST_HEAD(list);
                list_add(&rq->queuelist, &list);
                e->type->ops.mq.insert_requests(hctx, &list, at_head);//dd_insert_requests
        } else {
                spin_lock(&ctx->lock);
                __blk_mq_insert_request(hctx, rq, at_head);
                spin_unlock(&ctx->lock);
        }
run:
        if (run_queue)
         //这里会调用到blk_mq_do_dispatch_sched将request下发给scsi驱动层
                blk_mq_run_hw_queue(hctx, async);
                
}
 //dd_insert_request函数将request的成员queuelist地址&request.queuelist插入fifo_list,相关函数：list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]); 可以用crash命令：
list request.queuelist -s request -H &fifo_list.next -x 列出所有request信息
/*
 * add rq to rbtree and fifo
 */
static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
                              bool at_head)
{
        struct request_queue *q = hctx->queue;
        struct deadline_data *dd = q->elevator->elevator_data;
        const int data_dir = rq_data_dir(rq);
        /*
         * This may be a requeue of a write request that has locked its
         * target zone. If it is the case, this releases the zone lock.
         */
        blk_req_zone_write_unlock(rq);
        if (blk_mq_sched_try_insert_merge(q, rq))
                return;
        blk_mq_sched_request_inserted(rq);
        if (at_head || blk_rq_is_passthrough(rq)) {
                if (at_head)
                        list_add(&rq->queuelist, &dd->dispatch);
                else
                        list_add_tail(&rq->queuelist, &dd->dispatch);
        } else {
                deadline_add_rq_rb(dd, rq);
                if (rq_mergeable(rq)) {
                        elv_rqhash_add(q, rq);
                        if (!q->last_merge)
                                q->last_merge = rq;
                }
                /*
                 * set expire time and add to fifo list
                 */
                rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
                list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
        }
}
static struct request *__dd_dispatch_request(struct deadline_data *dd)
{
        struct request *rq, *next_rq;
        bool reads, writes;
        int data_dir;
        if (!list_empty(&dd->dispatch)) {
                rq = list_first_entry(&dd->dispatch, struct request, queuelist);
                list_del_init(&rq->queuelist);
                goto done;
        }
        reads = !list_empty(&dd->fifo_list[READ]);
        writes = !list_empty(&dd->fifo_list[WRITE]);
        /*
         * batches are currently reads XOR writes
         */
        rq = deadline_next_request(dd, WRITE);
        if (!rq)
                rq = deadline_next_request(dd, READ);
        if (rq && dd->batching < dd->fifo_batch)
                /* we have a next request are still entitled to batch */
                goto dispatch_request;
        /*
         * at this point we are not running a batch. select the appropriate
         * data direction (read / write)
         */
        if (reads) {
                BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
                if (deadline_fifo_request(dd, WRITE) &&
                    (dd->starved++ >= dd->writes_starved))
                        goto dispatch_writes;
                data_dir = READ;
                goto dispatch_find_request;
        }
        /*
         * there are either no reads or writes have been starved
         */
        if (writes) {
dispatch_writes:
                BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
                dd->starved = 0;
                data_dir = WRITE;
         goto dispatch_find_request;
        }
        return NULL;
dispatch_find_request:
        /*
         * we are not running a batch, find best request for selected data_dir
         */
        next_rq = deadline_next_request(dd, data_dir);
        if (deadline_check_fifo(dd, data_dir) || !next_rq) {
                /*
                 * A deadline has expired, the last request was in the other
                 * direction, or we have run out of higher-sectored requests.
                 * Start again from the request with the earliest expiry time.
                 */
                rq = deadline_fifo_request(dd, data_dir);
        } else {
                /*
                 * The last req was the same dir and we have a next request in
                 * sort order. No expired requests so continue on from here.
                 */
                rq = next_rq;
        }
        /*
         * For a zoned block device, if we only have writes queued and none of
         * them can be dispatched, rq will be NULL.
         */
        if (!rq)
                return NULL;
        dd->batching = 0;
dispatch_request:
        /*
         * rq is the selected appropriate request.
         */
        dd->batching++;
        deadline_move_request(dd, rq);
done:
        /*
         * If the request needs its target zone locked, do it.
         */
        blk_req_zone_write_lock(rq);
        rq->rq_flags |= RQF_STARTED;
        return rq;
}
crash> struct request_queue.elevator ffff8f64792d1c80
  elevator = 0xffff8f648e321400
crash> struct request_queue.elevator   ffff8f64792d1c80 -xo
struct request_queue {
  [ffff8f64792d1c98] struct elevator_queue *elevator;
}
crash>  struct elevator_queue.elevator_data 0xffff8f648e321400
  elevator_data = 0xffff8f6478d94000
crash> struct elevator_queue.elevator_data 0xffff8f648e321400 -xo
struct elevator_queue {
  [ffff8f648e321408] void *elevator_data;
}
crash> struct deadline_data  0xffff8f6478d94000
struct deadline_data {
  sort_list = {{
      rb_node = 0xffff8f6478816a20//READ
    }, {
      rb_node = 0x0//WRITE
    }},
  fifo_list = {{//读IO fifo
      next = 0xffff8f6478f9d3c0,//最早放入读fifo的request
      prev = 0xffff8f6478fe6a00//最近放入读fifo的request
    }, {//写IO fifo
      next = 0xffff8f6478d94020,
      prev = 0xffff8f6478d94020
    }},
 //数组第一个成员代表读request,第二个为写request  
 //deadline_dispatch_requests下一个下发的{读,写}struct request
  next_rq = {0xffff8f64788169c0, 0x0},
  batching = 3,
  starved = 0,
  fifo_expire = {500, 5000},
  fifo_batch = 16,
  writes_starved = 2,
  front_merges = 1,
  ......
}
  
  
crash> struct deadline_data.fifo_list  0xffff8f6478d94000 -xo
struct deadline_data {
  [ffff8f6478d94010] struct list_head fifo_list[2];
}
crash> px &((struct list_head *)0xffff8f6478d94010)[0]
$108 = (struct list_head *) 0xffff8f6478d94010
crash> px &((struct list_head *)0xffff8f6478d94010)[1]
$109 = (struct list_head *) 0xffff8f6478d94020
crash>
//deadline fifo list没有待处理的write request：
crash> list request.queuelist  -H 0xffff8f6478d94020
(empty)
crash>
统计待处理的读request数量：
crash> list request.queuelist  -H 0xffff8f6478d94010 | wc -l
79
crash>
crash>
列出每个读request需要读取的IO size：
crash> list request.queuelist -s request.__data_len -H 0xffff8f6478d94010  | grep __data_len | awk '{print $3}' | sort -nr | head
1048576
884736
524288
524288
524288
524288
163840
65536
45056
45056
crash>
宕机时该bdi设备正等待读取的所有request的IO 大小（bytes）总和：
crash> list request.queuelist -s request.__data_len -H 0xffff8f6478d94010  | grep __data_len | awk '{sum += $3} END {print sum}'
4866048
crash>
被选中下一个要操作的读request还在list中：
crash> list request.queuelist -H 0xffff8f6478d94010 | grep ffff8f64788169c0
ffff8f64788169c0
crash> request.start_time_ns,fifo_time 0xffff8f64788169c0
//request资源被分配出来的时间点（ktime_get_ns）
  start_time_ns = 27783830207906353     
   fifo_time = 32078496384//request在io调度器队列后处理的截止超时时间，单位jiffies
   io_start_time_ns = 0//request从调度器移出后被下发给scsi驱动时刻点，为0代表还未下发给scsi驱动层。

   

crash> p tk_core.timekeeper.tkr_mono | grep -w base
  base = 27783829828082680,//系统panic时的纳秒数
  
 crash> pd (27783829828082680-27783830207906353)
$49 = -379823673//触发panic后才分配的request
crash>


crash> p jiffies
jiffies = $34 = 32078496058
crash> p (32078496058-32078496384)
$36 = -326//先触发了panic后io request才被插入调度器队列
crash>

最早插入调度器队列request发生在panic后的第323 jiffies：
crash> list request.queuelist -s request.fifo_time -H 0xffff8f6478d94010 | grep fifo_time | awk '{print $3}' | sort -n | head -1
32078496381
crash>
crash> p jiffies
jiffies = $37 = 32078496058
crash>
crash> pd (32078496381-32078496058)
$38 = 323
crash>
调度器等待队列中最早分配的request是在panic后3706ms(376581409纳秒)左右被分配：
crash> list request.queuelist -s request.start_time_ns -H 0xffff8f6478d94010 | grep start_time_ns | awk '{print $3}' | sort -n | head -1
27783830204664089
//实际上ktime_get_ns在tk_core.timekeeper.tkr_mono.base的基础上还需要加上一个动态计算的偏移值，这里粗略取base值作为与start_time_ns对比
crash> p tk_core.timekeeper.tkr_mono.base
$58 = 27783829828082680/
crash> pd (27783830204664089-27783829828082680)
$48 = 376581409
crash>
从blk_mq_rq_to_pdu实现可以知道每个request的struct scsi_scmnd地址是紧跟着request地址后面，
原因可以查阅blk_mq_get_request函数分配request方法。
//获取request结构体大小：
crash> struct request -xo | grep SIZE
SIZE: 0x140
crash>
crash> px (0xffff8f64788169c0+0x140)
$63 = 0xffff8f6478816b00
crash>
crash> struct scsi_cmnd.sc_data_direction,request 0xffff8f6478816b00
  sc_data_direction = DMA_FROM_DEVICE//读IO,写IO DMA_TO_DEVICE
  request = 0xffff8f64788169c0
crash>

结合上面deadline调度器的原理接着继续分析coredump，写方向IO request已经全部下发(写请求已经被移出deadline write fifo list)，但是还有79个读request 待下发，并且fifo list中未处理的读request都是触发panic后才生成的。deadline调度器的实现原理是优先处理读IO，只有当没有读请求待处理或者已经连续处理读请求达到16 * 2 = 32个时才会去处理16个写请求，因此推测有大量读IO的导致写IO请求被延迟。

crash> struct deadline_data.fifo_list 0xffff8f6478d94000
  fifo_list = {{//read io request fifo
      next = 0xffff8f6478f9d3c0,
      prev = 0xffff8f6478fe6a00
    }, {//write io request fifo
      next = 0xffff8f6478d94020,
      prev = 0xffff8f6478d94020
    }},
  next_rq = {0xffff8f64788169c0, 0x0} //被选中的 下一个待下发给scsi驱动的层的request
crash>
crash> px &((struct list_head *)0xffff8f6478d94010)[0]
$13 = (struct list_head *) 0xffff8f6478d94010
crash> list request.queuelist -H 0xffff8f6478d94010 | wc -l 
79 //read io request fifo中队列中待处理request
crash>

//内核默认mq-deadline相关配置
/*
 * See Documentation/block/deadline-iosched.txt
 */
static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
static const int writes_starved = 2;    /* max times reads can starve a write */
static const int fifo_batch = 16;       /* # of sequential requests treated as one
                                     by the above parameters. For throughput. */


//机器上也可以看到mq-deadline相关配置：

# cat /sys/block/vdb/queue/scheduler
[mq-deadline] kyber none
# ls /sys/block/vdb/queue/iosched/
fifo_batch  front_merges  read_expire  write_expire  writes_starved

从io调度器fifo list移出后request需要下发给scsi 驱动层，那么scsi驱动层或者硬件有没可能存在request丢失或者没响应的情况呢？下面继续看下已经被从deadline fifo移除但还未真正完成IO请求落盘的io request状态：

blk_mq_do_dispatch_sched从mq-deadline选择合适的request从调度器fifo list移出后组合成scsi cmd下发：
/*
 * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
 * its queue by itself in its completion handler, so we don't need to
 * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
 */
static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
{
        struct request_queue *q = hctx->queue;
        struct elevator_queue *e = q->elevator;
        LIST_HEAD(rq_list);
        do {
                struct request *rq;
                if (e->type->ops.mq.has_work &&
                                !e->type->ops.mq.has_work(hctx))
                        break;
                if (!blk_mq_get_dispatch_budget(hctx))
                        break;
    //dd_dispatch_request从deadline fifo list选则合适的request  
                rq = e->type->ops.mq.dispatch_request(hctx);
                if (!rq) {
                        blk_mq_put_dispatch_budget(hctx);
                        break;
                }
                list_add(&rq->queuelist, &rq_list);
        } while (blk_mq_dispatch_rq_list(q, &rq_list, true));
}
bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,bool got_budget)
{
   ......
   ret = q->mq_ops->queue_rq(hctx, &bd);//scsi_queue_rq
   .......
}
crash> struct elevator_queue.type 0xffff8f648e321400
  type = 0xffffffff9f77f720
crash> struct elevator_queue.type 0xffff8f648e321400 -xo
struct elevator_queue {
  [ffff8f648e321400] struct elevator_type *type;
}
crash> struct elevator_type.ops 0xffffffff9f77f720
  ops = {
    sq = {
     ......
     .......
          },
    mq = {
      .......
      //dd_bio_merge
      bio_merge = 0xffffffff9e41bf10,
      request_merge = 0xffffffff9e41bd00,
      request_merged = 0xffffffff9e41bc10,
      requests_merged = 0xffffffff9e41c4a0,
      limit_depth = 0x0,
      prepare_request = 0xffffffff9e41b690,
      finish_request = 0xffffffff9e41bc60,
      //dd_insert_requests
      insert_requests = 0xffffffff9e41c1d0,
      //dd_dispatch_request
      dispatch_request = 0xffffffff9e41c570,
      has_work = 0xffffffff9e41b6a0,
      completed_request = 0x0,
      started_request = 0x0,
      requeue_request = 0x0,
      former_request = 0xffffffff9e3dd670,
      next_request = 0xffffffff9e3dd6a0,
      ......
    }
  }
  
crash> struct request_queue.mq_ops   ffff8f64792d1c80
  mq_ops = 0xffffffff9eeb07a0
crash> struct request_queue.mq_ops   ffff8f64792d1c80 -xo
struct request_queue {
  [ffff8f64792d1d78] const struct blk_mq_ops *mq_ops;
}
crash> struct blk_mq_ops 0xffffffff9eeb07a0
struct blk_mq_ops {
  queue_rq = 0xffffffff9e5f4ce0,//scsi_queue_rq
  commit_rqs = 0x0,
  get_budget = 0xffffffff9e5f3230, //scsi_mq_get_budget
  put_budget = 0xffffffff9e5f0c80,
  timeout = 0xffffffff9e5f1650,
  poll = 0x0,
  complete = 0xffffffff9e5f3780,
  init_hctx = 0x0,
  exit_hctx = 0x0,
  init_request = 0xffffffff9e5f1860,
  exit_request = 0xffffffff9e5f14c0,
  initialize_rq_fn = 0xffffffff9e5f14f0,
  cleanup_rq = 0xffffffff9e5f4a30,
  map_queues = 0xffffffff9e5f1b70,
  show_rq = 0xffffffff9e5fc1f0
}
crash>
crash> struct request_queue.queuedata   ffff8f64792d1c80
  queuedata = 0xffff8f6479604000
crash>
crash> struct scsi_device.device_busy,queue_depth,device_blocked,sdev_state 0xffff8f6479604000
  device_busy = {
    counter = 32 //大于queue_depth代表scsi忙
  }
  queue_depth = 32
  device_blocked = {
    counter = 0
  }
  sdev_state = SDEV_RUNNING//scsi状态正常
crash>
获取硬件队列信息：
crash>  struct request_queue.queue_hw_ctx ffff8f64792d1c80
  queue_hw_ctx = 0xffff8f648e322400//硬件队列数组地址
crash>
crash> rd 0xffff8f648e322400//单硬件队列，取第一个数组成员
ffff8f648e322400:  ffff8f648e322000                    . 2.d...
crash>
struct blk_mq_hw_ctx 地址为ffff8f648e322000
crash> struct blk_mq_hw_ctx.nr_ctx ffff8f648e322000
  nr_ctx = 80 //软件队列数跟cpu数量一致
crash>
static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
{
        if (data->flags & BLK_MQ_REQ_INTERNAL)
                return data->hctx->sched_tags;
        return data->hctx->tags;
}
static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
                unsigned int tag, unsigned int op)
{
        struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
 //从static_rqs可以获取所有的request       
        struct request *rq = tags->static_rqs[tag];
        req_flags_t rq_flags = 0;
        if (data->flags & BLK_MQ_REQ_INTERNAL) {
                rq->tag = -1;
                rq->internal_tag = tag;//存放tag
        } else {
                ......
              }
}
//每个硬件对应一个blk_mq_hw_ctx
crash> struct blk_mq_hw_ctx.sched_tags ffff8f648e322000
  sched_tags = 0xffff8f64792c8c00
//每个blk_mq_hw_ctx有256个tags，每个tag对应一个io request  
crash> struct blk_mq_tags.nr_tags,static_rqs,nr_reserved_tags 0xffff8f64792c8c00
  nr_tags = 256
  static_rqs = 0xffff8f6479605000
  nr_reserved_tags = 0
crash>
一共256个tags，抽查其中一个还在fifo list中的request进行验证：
crash> rd 0xffff8f6479605000 256 | head -2
ffff8f6479605000:  ffff8f6478dd0000 ffff8f6478dd2340   ...xd...@#.xd...
ffff8f6479605010:  ffff8f6478dd4680 ffff8f6478dd69c0   .F.xd....i.xd...
crash>
ffff8f6478dd2340在static_rqs的索引1位置，其tag号为1,可以看到其还在deadline调度器的read fifo list中：
crash> list request.queuelist  -H 0xffff8f6478d94010 | grep ffff8f6478dd2340
ffff8f6478dd2340
crash>
crash> struct request.internal_tag ffff8f6478dd2340
  internal_tag = 1
crash>
void blk_mq_free_request(struct request *rq)
{
        struct request_queue *q = rq->q;
        struct elevator_queue *e = q->elevator;
        struct blk_mq_ctx *ctx = rq->mq_ctx;
        struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
        if (rq->rq_flags & RQF_ELVPRIV) {
                if (e && e->type->ops.mq.finish_request)
                        e->type->ops.mq.finish_request(rq);
                if (rq->elv.icq) {
                        put_io_context(rq->elv.icq->ioc);
                        rq->elv.icq = NULL;
                }
        }
        ctx->rq_completed[rq_is_sync(rq)]++;
        if (rq->rq_flags & RQF_MQ_INFLIGHT)
                atomic_dec(&hctx->nr_active);
        if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
                laptop_io_completion(q->backing_dev_info);
        rq_qos_done(q, rq);
        if (blk_rq_rl(rq))
                blk_put_rl(blk_rq_rl(rq));
        WRITE_ONCE(rq->state, MQ_RQ_IDLE);
        if (refcount_dec_and_test(&rq->ref))
                __blk_mq_free_request(rq);
}
void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, struct blk_mq_tags *tags,
                    struct blk_mq_ctx *ctx, unsigned int tag)
{
        if (!blk_mq_tag_is_reserved(tags, tag)) {
                const int real_tag = tag - tags->nr_reserved_tags;
                BUG_ON(real_tag >= tags->nr_tags);
                sbitmap_queue_clear(&tags->bitmap_tags, real_tag, ctx->cpu);
        } else {
                BUG_ON(tag >= tags->nr_reserved_tags);
                sbitmap_queue_clear(&tags->breserved_tags, tag, ctx->cpu);
        }
}
void sbitmap_queue_clear(struct sbitmap_queue *sbq, unsigned int nr,
                         unsigned int cpu)
{
        sbitmap_clear_bit_unlock(&sbq->sb, nr);
        /*
         * Pairs with the memory barrier in set_current_state() to ensure the
         * proper ordering of clear_bit_unlock()/waitqueue_active() in the waker
         * and test_and_set_bit_lock()/prepare_to_wait()/finish_wait() in the
         * waiter. See the comment on waitqueue_active().
         */
        smp_mb__after_atomic();
        sbitmap_queue_wake_up(sbq);
        if (likely(!sbq->round_robin && nr < sbq->sb.depth))
                *per_cpu_ptr(sbq->alloc_hint, cpu) = nr;
}
static inline void sbitmap_clear_bit_unlock(struct sbitmap *sb,
                                            unsigned int bitnr)
{
        clear_bit_unlock(SB_NR_TO_BIT(sb, bitnr), __sbitmap_word(sb, bitnr));
}
crash> struct blk_mq_tags.bitmap_tags 0xffff8f64792c8c00 -xo
struct blk_mq_tags {
  [ffff8f64792c8c10] struct sbitmap_queue bitmap_tags;
}
crash> struct sbitmap_queue ffff8f64792c8c10
struct sbitmap_queue {
  sb = {
    depth = 256,
    shift = 6,
    map_nr = 4,
    map = 0xffff8f6479068100
  },
  alloc_hint = 0x331f6020d45c,
  wake_batch = 8,
  wake_index = {
    counter = 2
  },
  ws = 0xffff8f6478dace00,
  round_robin = false,
  min_shallow_depth = 4294967295
}
crash>
#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))//bitnr & 0x3f  
crash> px (1 << 6)
$74 = 0x40 //64
crash>
crash> eval 63
hexadecimal: 3f
    decimal: 63
      octal: 77
     binary: 0000000000000000000000000000000000000000000000000000000000111111
crash>
static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
                                            unsigned int bitnr)
{
        return &sb->map[SB_NR_TO_INDEX(sb, bitnr)].word;
}
#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
//(sb)->shift值为6，bitnr最大为255(一共256个tag)，所以SB_NR_TO_INDEX(sb, bitnr)的值范围为0~3.
//1<<6=64,也就是256个tag是分成四组，每组最多记录64个tag.
crash> struct sbitmap_queue.sb ffff8f64792c8c10 -xo
struct sbitmap_queue {
  [ffff8f64792c8c10] struct sbitmap sb;
}
crash> struct sbitmap.map ffff8f64792c8c10 -xo
struct sbitmap {
  [ffff8f64792c8c20] struct sbitmap_word *map;
}

crash> struct sbitmap.map ffff8f64792c8c10
  map = 0xffff8f6479068100
crash>

//前面分析SB_NR_TO_INDEX(sb, bitnr)的值范围为0-3，也就是对应四个struct sbitmap_word，
//struct sbitmap_word.word对应的bit位为1说明该tag对应的request资源已经被分配使用。

//找出四个struct sbitmap_word中word掩码为1的bit位，根据所在组的bit位计算出已被分配使用的tag id，
//每组是64个tags，在word掩码对应bit位基础上加上所在组号*64就得出tag id，比如：
//组1的word掩码bit 57，那么算出对应的tag id=57+1*64=121

crash> px ((struct sbitmap_word *)0xffff8f6479068100)[0]
$89 = {
  word = 0x40000043ffffff,
  depth = 0x40
}
crash> eval -b 0x40000043ffffff
......
bits set: 54 30 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
crash>
把bits set:的值保存到文本文件bit_index0.txt并转为tag并保存到tag.txt
cat bit_index0.txt | awk -F " " '{for (i=1;i<=NF;i++)printf("%s \n", $i);}' > tag.txt
crash> px ((struct sbitmap_word *)0xffff8f6479068100)[1]
$90 = {
  word = 0x200000000000000,
  depth = 0x40
}
crash> eval -b 0x200000000000000
   ......
   bits set: 57
crash>
把bits set:的值保存到文本文件bit_index1.txt并转为tag并保存到tag.txt
cat bit_index1.txt | awk -F " " '{for (i=1;i<=NF;i++)printf("%s \n", ($i+64));}'  >> tag.txt
crash> px ((struct sbitmap_word *)0xffff8f6479068100)[2]
$91 = {
  word = 0xfffffffff0000007,
  depth = 0x40
}
crash> eval -b 0xfffffffff0000007
  ......
bits set: 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 2 1 0
crash>
把bits set:的值保存到文本文件bit_index2.txt并转为tag并保存到tag.txt
cat bit_index2.txt | awk -F " " '{for (i=1;i<=NF;i++)printf("%s \n", ($i+128));}'  >> tag.txt
crash> px ((struct sbitmap_word *)0xffff8f6479068100)[3]
$92 = {
  word = 0xfffffc00f001ffff,
  depth = 0x40
}
crash> eval -b 0xfffffc00f001ffff
......
   bits set: 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 31 30 29 28 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
crash>
把bits set:的值保存到文本文件bit_index3.txt并转为tag并保存到tag.txt
cat bit_index3.txt | awk -F " " '{for (i=1;i<=NF;i++)printf("%s \n", ($i+192));}'  >> tag.txt

统计一共有111个tag，也就是111个io request已经被分配使用但是还未完成IO 落盘：
# cat tag.txt | wc -l
111
# cat tag.txt | head -2
54
30

static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
                unsigned int tag, unsigned int op)
{
        struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
 //从static_rqs可以获取所有的request       
        struct request *rq = tags->static_rqs[tag]
       ...... 
}      
crash> struct blk_mq_hw_ctx.sched_tags ffff8f648e322000
  sched_tags = 0xffff8f64792c8c00
crash> struct blk_mq_tags.nr_tags,static_rqs,nr_reserved_tags 0xffff8f64792c8c00
  nr_tags = 256
  static_rqs = 0xffff8f6479605000
  nr_reserved_tags = 0
crash>
struct blk_mq_tags tags.static_rqs为0xffff8f6479605000，

而request又是从tags.static_rqs数组获取:struct request *rq = tags->static_rqs[tag],
因此知道已经分配的tag号就可以找到所有已经分配未释放的request地址。


使用list列出deadline fifo中所有读request占用的tag号：
list request.queuelist  -s request.internal_tag -H 0xffff8f6478d94010 | grep internal_tag | awk '{print $3}' | sort -n > read_fifo_tag_sort.txt
对上面通过struct sbitmap_word.word掩码为1的bit找到的所有已经分配的使用的tag进行排序：
cat tag.txt | sort -n >tag_sort.txt

前面已经分析宕机时io deadline调度器fifo list只有read request，没有write request，
因此排除掉所有还在fifo list的read request的tag id后，剩下的32个io request就是已经被分配未加入
deadline或已经下发给scsi驱动但是硬件还未处理完成的io request:
# grep -vwf read_fifo_tag_sort.txt tag_sort.txt | wc -l
32
将这32个不在io deadline调度器但已被分配使用的tag id保存到scsi_tag.txt中：
grep -vwf read_fifo_tag_sort.txt tag_sort.txt > scsi_tag.txt

读取static_rqs所有256个tag对应的request地址：
crash> struct blk_mq_tags.static_rqs 0xffff8f64792c8c00
  static_rqs = 0xffff8f6479605000
crash>
crash> rd 0xffff8f6479605000 256 | awk '{print $2"\n"$3}' > static_rqs.txt
crash>

scsi_tag.txt存放的是不在deadline fifo list中的tag id，static_rqs.txt是按tag id大小顺序存放
的io request地址，

根据tad id号作为索引找到对应的io request地址，这里列出的就是已分配但不在deadline fifo list中的
io request:
# cat scsi_tag.txt | while read line;do  sed -n "$(($line+1))"p static_rqs.txt;done
ffff8f6478df8d00
ffff8f6478e02340
ffff8f6478e12340
ffff8f6478e48d00
ffff8f6478ee2340
ffff8f6478ef2340
ffff8f6478ef4680
ffff8f6478ef69c0
ffff8f6478fc4680
ffff8f6478fc69c0
ffff8f6478fc8d00
ffff8f6478fcb040
ffff8f6478fe69c0
ffff8f6478fe8d00
ffff8f6478feb040
ffff8f6478fed380
ffff8f6478ff0000
ffff8f6478ff2340
ffff8f6478ff4680
ffff8f6478ff69c0
ffff8f6478ff8d00
ffff8f6478ffb040
ffff8f6478ffd380
ffff8f6478800000
ffff8f6478802340
ffff8f6478804680
ffff8f64788069c0
ffff8f6478808d00
ffff8f647880b040
ffff8f647880d380
ffff8f6478810000
ffff8f6478812340
导出所有已分配但不在io deadline调度器的fifo list的request，并转为crash格式命令存放在脚步scsi_request.tx中用于批量执行crash 命令：
cat scsi_tag.txt | while read line;do  sed -n "$(($line+1))"p static_rqs.txt | awk '{print "struct request.wbt_flags,start_time_ns,io_start_time_ns "$1}' >> scsi_request.txt;done

wbt_flags //标识读写方向
start_time_ns//request分配成功时刻(系统启动运行第几ns)
io_start_time_ns//request从调度器移出后被下发给scsi驱动时刻点，为0代表还未下发给scsi驱动层。

导出所有request成员wbt_flags,start_time_ns,io_start_time_ns：
crash> <./scsi_request.txt > scsi_request_info.txt


 磁盘处理完IO后触发中断回调时最终会调用wbt_clear_state清除wbt_flags，如果wbt_flags为0，但是
 io_start_time_ns不为0(io_start_time_ns在io request被从deadline fifo挑中下发给scsi驱动时赋值)，
 说明硬件已经处理完这个IO，只是软件层面还未来得及把request释放(static_rqs的tag还在使用)。
 比如前面导出的scsi_request_info.txt挑一个看下信息：
 crash> struct request.wbt_flags,start_time_ns,io_start_time_ns ffff8f6478e12340
  wbt_flags = 0
  start_time_ns = 27783828896576000
  io_start_time_ns = 27783828896579885
 
request信息后面存放的是scsi_cmnd信息：
crash> struct request -xo | grep SIZE
SIZE: 0x140 
crash> px (0xffff8f6478e12340+0x140)
$96 = 0xffff8f6478e12480//scsi_cmnd地址
crash>
crash> struct scsi_cmnd.sc_data_direction,request 0xffff8f6478e12480
  sc_data_direction = DMA_FROM_DEVICE//读IO
  request = 0xffff8f6478e12340
crash>


根据wbt_flags值含义可以找出scsi_request_info.txt中write io request地址：
enum wbt_flags {
        WBT_TRACKED             = 1,    /* write, tracked for throttling */
        WBT_READ                = 2,    /* read */
        WBT_KSWAPD              = 4,    /* write, from kswapd */
        WBT_DISCARD             = 8,    /* discard */
        WBT_NR_BITS             = 4,    /* number of bits */
};
一共有四个已分配下发给scsi驱动层但未完成IO写入的write request，跟前面的inflight对应：
# cat scsi_request_info.txt | grep "wbt_flags = 1"
  wbt_flags = 1
  wbt_flags = 1
  wbt_flags = 1
  wbt_flags = 1
  
 #crash> struct rq_wait.inflight 0xffff8f648f036280
  inflight = {
    counter = 4
  }

  
  
  列出这四个已分配下发给scsi驱动层但未完成IO写入的write request信息：
# cat scsi_request_info.txt | grep "wbt_flags = 1" -B1
crash> struct request.wbt_flags,start_time_ns,io_start_time_ns ffff8f6478ef4680
  wbt_flags = 1
--
crash> struct request.wbt_flags,start_time_ns,io_start_time_ns ffff8f6478ef69c0
  wbt_flags = 1
--
crash> struct request.wbt_flags,start_time_ns,io_start_time_ns ffff8f6478ef8d00
  wbt_flags = 1
--
crash> struct request.wbt_flags,start_time_ns,io_start_time_ns ffff8f647880d380
  wbt_flags = 1

crash> struct request.wbt_flags,start_time_ns,io_start_time_ns ffff8f6478ef4680
  wbt_flags = 1
  start_time_ns = 27783824128512735
  io_start_time_ns = 27783824273300466
  //找出request对应的sci_cmd地址，该结构体地址紧挨着request  
crash> px (0xffff8f6478ef4680+0x140)
$103 = 0xffff8f6478ef47c0
crash>
crash> struct scsi_cmnd.sc_data_direction,request 0xffff8f6478ef47c0
  sc_data_direction = DMA_TO_DEVICE
  request = 0xffff8f6478ef4680
crash>  
//分配request完成到下发给scsi驱动层中间耗时(io_start_time_ns-start_time_ns)，单位纳秒：  
crash> pd (27783824273300466-27783824128512735)
$119 = 144787731
crash>
  
crash> struct request.wbt_flags,start_time_ns,io_start_time_ns ffff8f6478ef69c0
  wbt_flags = 1
  start_time_ns = 27783824128522763
  io_start_time_ns = 27783824273533146
//找出request对应的sci_cmd地址，该结构体地址紧挨着request  
crash> px (0xffff8f6478ef69c0+0x140)
$106 = 0xffff8f6478ef6b00
crash>
 crash> struct scsi_cmnd.sc_data_direction,request 0xffff8f6478ef6b00
  sc_data_direction = DMA_TO_DEVICE
  request = 0xffff8f6478ef69c0
crash>  
//分配request完成到下发给scsi驱动层中间耗时(io_start_time_ns-start_time_ns)，单位纳秒： 
 crash> pd (27783824273533146-27783824128522763)
$120 = 145010383
crash>
   
  
crash> struct request.wbt_flags,start_time_ns,io_start_time_ns ffff8f6478ef8d00
  wbt_flags = 1
  start_time_ns = 27783824128526993
  io_start_time_ns = 27783824273862236
  //找出request对应的sci_cmd地址，该结构体地址紧挨着request   
crash> px (0xffff8f6478ef8d00+0x140)
$117 = 0xffff8f6478ef8e40
crash> struct scsi_cmnd.sc_data_direction,request 0xffff8f6478ef8e40
  sc_data_direction = DMA_TO_DEVICE
  request = 0xffff8f6478ef8d00
crash>
//分配request完成到下发给scsi驱动层中间耗时(io_start_time_ns-start_time_ns)，单位纳秒：
  crash> pd (27783824273862236-27783824128526993)
$121 = 145335243
crash>
  
  crash> struct request.wbt_flags,start_time_ns,io_start_time_ns ffff8f647880d380
  wbt_flags = 1
  start_time_ns = 27783824128514586
  io_start_time_ns = 27783824244111757
  
 //找出request对应的sci_cmd地址，该结构体地址紧挨着request  
   crash> px (0xffff8f647880d380+0x140)
$105 = 0xffff8f647880d4c0
crash> struct scsi_cmnd.sc_data_direction,request 0xffff8f647880d4c0
  sc_data_direction = DMA_TO_DEVICE
  request = 0xffff8f647880d380
crash> 
//分配request完成到下发给scsi驱动层中间耗时(io_start_time_ns-start_time_ns)，单位纳秒：
crash> pd (27783824244111757-27783824128514586)
$122 = 115597171
crash>
  
已下发给scsi 驱动层但未完成的4个write request都是在触发panic时才分配的request，说明并非write io request
在被分配出来后系统并不存在长时间不处理或者硬件无响应的情况:
crash> log | grep panic
[27783793.158898] Kernel panic - not syncing: hung_task: blocked tasks
[27783793.435081]  panic+0xe8/0x258
//触发panic跟下发write request ffff8f647880d380给scsi驱动层之间间隔：
crash> pd (27783793435081-27783824244111)
$114 = -30809030//微妙
crash>
crash> pd (27783824244111-27783793435081)
$126 = 30809030
crash>
//触发panic跟分配write request ffff8f647880d380之间间隔：
crash> pd (27783793435081-27783824128514)
$116 = -30693433//微妙
crash>
crash> pd (27783824128514-27783793435081)
$124 = 30693433
crash>
前面分析write io进程因已经分配的write request超过limit限制而阻塞在rq_wait.wait上，
截止到触发panic时阻塞时间最长的per-pdi回写脏页kworter进程已经等待了6s，
而这里已经下发给scsi驱动层的write request是在触发panic后才分配的requst资源，
说明在panic前这些write request也是阻塞在rq_wait.wait上的并且是排在kworker进程pid:22335前面的，
只是在机器触发hung panic到最终重启前这段时间正好因为前面已经分配的write request写IO完成释放
了write request而被唤醒。

这些已下发给scsi驱动层但未完成IO落盘的所有读写request中最早的一个下发给scsi驱动层的io request时间戳
发生在触发panic后的30秒：
cat scsi_request_info.txt | grep  "io_start_time_ns =" | awk '{print $3}' | sort -nr | tail -1
27783824244111757
crash> pd (27783793435081-27783824244111)
$114 = -30809030//微妙
crash>

综合该vmcore分析无论是单个io请求从分配request到磁盘响应IO请求是没有长时间的耗时的，可以排除内核和硬件问题，据此可以确定hung task是因为重启前读IO流量过大以及dirty page累积引起的连锁反应造成的IO hung task。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

http

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

http

登录后参与评论

0 条评论

热度

结合coredump分析dirty page和IO request处理逻辑(下)

结合coredump分析dirty page和IO request处理逻辑(下)

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐