优化spark sql读取 kudu数据

shengjk1

发布于 2020-05-12 10:55:31

2.1K0

发布于 2020-05-12 10:55:31

文章被收录于专栏：码字搬砖

1.背景

通过 spark sql 读取 kudu 数据，由于 kudu 表只有 6 个 tablet ，所以 spark 默认只能启动 6 个 task，读取 kudu 数据，通过界面可以看到 kudu 的 scan 维持在 143M/s ，想要增大 spark 读取 kudu 的效率。 ![在这里插入图片描述](https://img-blog.csdnimg.cn/2020051118163413.png)

2.修改

通过追踪 kudu-spark.jar 的源码知道

kudu.batchSize: 默认为 20M batchSize Sets the maximum number of bytes returned by the scanner, on each batch. splitSizeBytes sets the target number of bytes per spark task. If set, tablet’s primary key range will be split to generate uniform task sizes instead of the default of 1 task per tablet

调参为：

val sqlDF = spark.sqlContext.read.options(
          Map("kudu.master" -> kuduMasters,
            "kudu.table" -> kuduTableName,
            //200M
            "kudu.batchSize" -> "419430400",
            //10G
            "kudu.splitSizeBytes" -> "10737418240")).format("kudu").load.cache()