数据准备

HDFS是Hadoop/Spark批处理作业可选的数据存储之一。本文将演示在HDFS中创建一个文件,并在Spark中进行访问。

1. 开通HDFS服务,并创建文件系统

2. 设置权限组

1. 创建权限组。

2. 设置权限组的规则。

3. 为挂载点添加权限组。

至此HDFS文件系统就准备完毕。

3. 安装Apache Hadoop Client

HDFS文件系统准备就绪后,就是存入文件。我们采用HDFS client的方式。

Apache Hadoop下载地址:官方链接。建议选用的Apache Hadoop版本不低于2.7.2,本文档中使用的Apache Hadoop版本为Apache Hadoop 2.7.2。

1、执行如下命令解压Apache Hadoop压缩包到指定文件夹。

tar -zxvf hadoop-2.7.2.tar.gz -C /usr/local/

2、执行如下命令打开core-site.xml配置文件。

vim /usr/local/hadoop-2.7.2/etc/hadoop/core-site.xml

修改core-site.xml配置文件如下:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>dfs://f-4b1fcae5dvexx.cn-hangzhou.dfs.aliyuncs.com:10290</value>
        <!-- 该地址填写自己的HDFS挂载点地址 -->
    </property>
    <property>
        <name>fs.dfs.impl</name>
        <value>com.alibaba.dfs.DistributedFileSystem</value>
    </property>
    <property>
        <name>fs.AbstractFileSystem.dfs.impl</name>
        <value>com.alibaba.dfs.DFS</value>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>8388608</value>
    </property>
    <property>
        <name>alidfs.use.buffer.size.setting</name>
        <value>false</value>
        <!-- 建议不开启,亲测开启后会严重降低iosize,进而影响数据吞吐 -->
    </property>
    <property>
        <name>dfs.usergroupservice.impl</name>
        <value>com.alibaba.dfs.security.LinuxUserGroupService.class</value>
    </property>
    <property>
        <name>dfs.connection.count</name>
        <value>256</value>
    </property>
</configuration>	
注意 由于我们是on k8s,所以yarn相关的配置项不用配置,只用配置HDFS相关的几个配置项。修改后的core-site.xml文件后在面很多地方会用到。

3、执行如下命令打开/etc/profile配置文件。

vim /etc/profile

添加环境变量。

export HADOOP_HOME=/usr/local/hadoop-2.7.2
export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.2/etc/hadoop:/usr/local/hadoop-2.7.2/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/common/*:/usr/local/hadoop-2.7.2/share/hadoop/hdfs:/usr/local/hadoop-2.7.2/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.2/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/yarn/*:/usr/local/hadoop-2.7.2/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.2/contrib/capacity-scheduler/*.jar
export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.2/etc/hadoop

执行如下命令使配置生效。

source /etc/profile
注意 我们只需要一个HDFS client即可,不需要部署HDFS集群。

4、添加阿里云HDFS依赖

cp aliyun-sdk-dfs-1.0.3.jar  /usr/local/hadoop-2.7.2/share/hadoop/hdfs

下载地址:此处下载文件存储HDFS的SDK。

4. 上传数据

#创建数据目录
[root@liumi-hdfs ~]# $HADOOP_HOME/bin/hadoop fs -mkdir -p /pod/data
#将本地准备的文件(一本小说文本)上传到hdfs
[root@liumi-hdfs ~]# $HADOOP_HOME/bin/hadoop fs -put ./A-Game-of-Thrones.txt /pod/data/A-Game-of-Thrones.txt
#查看,文件大小为30G
[root@liumi-hdfs local]# $HADOOP_HOME/bin/hadoop fs -ls /pod/data
Found 1 items
-rwxrwxrwx   3 root root 33710040000 2019-11-10 13:02 /pod/data/A-Game-of-Thrones.txt

至此HDFS数据准备部分就已经ready。

在spark应用中读取HDFS的数据

1. 开发应用

应用开发上跟传统的部署方式没有区别。

SparkConf conf = new SparkConf().setAppName(WordCount.class.getSimpleName());

JavaRDD<String> lines = sc.textFile("dfs://f-4b1fcae5dvxxx.cn-hangzhou.dfs.aliyuncs.com:10290/pod/data/A-Game-of-Thrones.txt", 250);

...
wordsCountResult.saveAsTextFile("dfs://f-4b1fcae5dvxxx.cn-hangzhou.dfs.aliyuncs.com:10290/pod/data/A-Game-of-Thrones-Result");

sc.close();   	

2. HDFS的配置

有两种常见的配置方式:1)静态配置文件。2)提交应用的时候动态设置

1)将前面的core-site.xml放入应用项目的resources目录

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <!-- HDFS 配置-->
    <configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>dfs://f-4b1fcae5dvexx.cn-hangzhou.dfs.aliyuncs.com:10290</value>
        <!-- 该地址填写自己的HDFS挂载点地址 -->
    </property>
    <property>
        <name>fs.dfs.impl</name>
        <value>com.alibaba.dfs.DistributedFileSystem</value>
    </property>
    <property>
        <name>fs.AbstractFileSystem.dfs.impl</name>
        <value>com.alibaba.dfs.DFS</value>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>8388608</value>
    </property>
    <property>
        <name>alidfs.use.buffer.size.setting</name>
        <value>false</value>
        <!-- 建议不开启,亲测开启后会严重降低iosize,进而影响数据吞吐 -->
    </property>
    <property>
        <name>dfs.usergroupservice.impl</name>
        <value>com.alibaba.dfs.security.LinuxUserGroupService.class</value>
    </property>
    <property>
        <name>dfs.connection.count</name>
        <value>256</value>
    </property>
</configuration>

2)以spark为例,也可以在提交应用时设置
hadoopConf:
    # HDFS
    "fs.defaultFS": "dfs://f-4b1fcae5dvexx.cn-hangzhou.dfs.aliyuncs.com:10290"
    "fs.dfs.impl": "com.alibaba.dfs.DistributedFileSystem"
    "fs.AbstractFileSystem.dfs.impl": "com.alibaba.dfs.DFS"

3. 打包的jar文件需要包含所有依赖

mvn assembly:assembly

附应用的pom.xml:

 1<?xml version="1.0" encoding="UTF-8"?>
 2<project xmlns="http://maven.apache.org/POM/4.0.0"
 3         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 4         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 5    <modelVersion>4.0.0</modelVersion>
 6
 7    <groupId>com.aliyun.liumi.spark</groupId>
 8    <artifactId>SparkExampleJava</artifactId>
 9    <version>1.0-SNAPSHOT</version>
10
11    <dependencies>
12        <dependency>
13            <groupId>org.apache.spark</groupId>
14            <artifactId>spark-core_2.12</artifactId>
15            <version>2.4.3</version>
16        </dependency>
17
18        <dependency>
19            <groupId>com.aliyun.dfs</groupId>
20            <artifactId>aliyun-sdk-dfs</artifactId>
21            <version>1.0.3</version>
22        </dependency>
23
24    </dependencies>
25
26    <build>
27    <plugins>
28        <plugin>
29            <groupId>org.apache.maven.plugins</groupId>
30            <artifactId>maven-assembly-plugin</artifactId>
31            <version>2.6</version>
32            <configuration>
33                <appendAssemblyId>false</appendAssemblyId>
34                <descriptorRefs>
35                    <descriptorRef>jar-with-dependencies</descriptorRef>
36                </descriptorRefs>
37                <archive>
38                    <manifest>
39                        <mainClass>com.aliyun.liumi.spark.example.WordCount</mainClass>
40                    </manifest>
41                </archive>
42            </configuration>
43            <executions>
44                <execution>
45                    <id>make-assembly</id>
46                    <phase>package</phase>
47                    <goals>
48                        <goal>assembly</goal>
49                    </goals>
50                </execution>
51            </executions>
52        </plugin>
53    </plugins>
54    </build>
55</project>
			

4. 编写Dockerfile

# spark base image
FROM registry.cn-hangzhou.aliyuncs.com/eci_open/spark:2.4.4
# 默认的kubernetes-client版本有问题,建议用最新的
RUN rm $SPARK_HOME/jars/kubernetes-client-*.jar
ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar $SPARK_HOME/jars
# 拷贝本地的应用jar
RUN mkdir -p /opt/spark/jars
COPY SparkExampleJava-1.0-SNAPSHOT.jar /opt/spark/jars

5. 构建应用镜像

docker build -t registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example -f Dockerfile .	

6. 推到阿里云ACR

docker push registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example	

至此,镜像都已经准备完毕。接下来就是在kubernetes集群中部署Spark应用了。