数据准备
HDFS是Hadoop/Spark批处理作业可选的数据存储之一。本文将演示在HDFS中创建一个文件,并在Spark中进行访问。
1. 开通HDFS服务,并创建文件系统
2. 设置权限组
1. 创建权限组。
2. 设置权限组的规则。
3. 为挂载点添加权限组。
至此HDFS文件系统就准备完毕。
3. 安装Apache Hadoop Client
HDFS文件系统准备就绪后,就是存入文件。我们采用HDFS client的方式。
Apache Hadoop下载地址:官方链接。建议选用的Apache Hadoop版本不低于2.7.2,本文档中使用的Apache Hadoop版本为Apache Hadoop 2.7.2。
1、执行如下命令解压Apache Hadoop压缩包到指定文件夹。
tar -zxvf hadoop-2.7.2.tar.gz -C /usr/local/
2、执行如下命令打开core-site.xml配置文件。
vim /usr/local/hadoop-2.7.2/etc/hadoop/core-site.xml
修改core-site.xml配置文件如下:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>dfs://f-4b1fcae5dvexx.cn-hangzhou.dfs.aliyuncs.com:10290</value>
<!-- 该地址填写自己的HDFS挂载点地址 -->
</property>
<property>
<name>fs.dfs.impl</name>
<value>com.alibaba.dfs.DistributedFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.dfs.impl</name>
<value>com.alibaba.dfs.DFS</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>8388608</value>
</property>
<property>
<name>alidfs.use.buffer.size.setting</name>
<value>false</value>
<!-- 建议不开启,亲测开启后会严重降低iosize,进而影响数据吞吐 -->
</property>
<property>
<name>dfs.usergroupservice.impl</name>
<value>com.alibaba.dfs.security.LinuxUserGroupService.class</value>
</property>
<property>
<name>dfs.connection.count</name>
<value>256</value>
</property>
</configuration>
3、执行如下命令打开/etc/profile配置文件。
vim /etc/profile
添加环境变量。
export HADOOP_HOME=/usr/local/hadoop-2.7.2
export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.2/etc/hadoop:/usr/local/hadoop-2.7.2/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/common/*:/usr/local/hadoop-2.7.2/share/hadoop/hdfs:/usr/local/hadoop-2.7.2/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.2/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/yarn/*:/usr/local/hadoop-2.7.2/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.2/contrib/capacity-scheduler/*.jar
export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.2/etc/hadoop
执行如下命令使配置生效。
source /etc/profile
4、添加阿里云HDFS依赖
cp aliyun-sdk-dfs-1.0.3.jar /usr/local/hadoop-2.7.2/share/hadoop/hdfs
下载地址:此处下载文件存储HDFS的SDK。
4. 上传数据
#创建数据目录
[root@liumi-hdfs ~]# $HADOOP_HOME/bin/hadoop fs -mkdir -p /pod/data
#将本地准备的文件(一本小说文本)上传到hdfs
[root@liumi-hdfs ~]# $HADOOP_HOME/bin/hadoop fs -put ./A-Game-of-Thrones.txt /pod/data/A-Game-of-Thrones.txt
#查看,文件大小为30G
[root@liumi-hdfs local]# $HADOOP_HOME/bin/hadoop fs -ls /pod/data
Found 1 items
-rwxrwxrwx 3 root root 33710040000 2019-11-10 13:02 /pod/data/A-Game-of-Thrones.txt
至此HDFS数据准备部分就已经ready。
在spark应用中读取HDFS的数据
1. 开发应用
应用开发上跟传统的部署方式没有区别。
SparkConf conf = new SparkConf().setAppName(WordCount.class.getSimpleName());
JavaRDD<String> lines = sc.textFile("dfs://f-4b1fcae5dvxxx.cn-hangzhou.dfs.aliyuncs.com:10290/pod/data/A-Game-of-Thrones.txt", 250);
...
wordsCountResult.saveAsTextFile("dfs://f-4b1fcae5dvxxx.cn-hangzhou.dfs.aliyuncs.com:10290/pod/data/A-Game-of-Thrones-Result");
sc.close();
2. HDFS的配置
有两种常见的配置方式:1)静态配置文件。2)提交应用的时候动态设置
1)将前面的core-site.xml放入应用项目的resources目录
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- HDFS 配置-->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>dfs://f-4b1fcae5dvexx.cn-hangzhou.dfs.aliyuncs.com:10290</value>
<!-- 该地址填写自己的HDFS挂载点地址 -->
</property>
<property>
<name>fs.dfs.impl</name>
<value>com.alibaba.dfs.DistributedFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.dfs.impl</name>
<value>com.alibaba.dfs.DFS</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>8388608</value>
</property>
<property>
<name>alidfs.use.buffer.size.setting</name>
<value>false</value>
<!-- 建议不开启,亲测开启后会严重降低iosize,进而影响数据吞吐 -->
</property>
<property>
<name>dfs.usergroupservice.impl</name>
<value>com.alibaba.dfs.security.LinuxUserGroupService.class</value>
</property>
<property>
<name>dfs.connection.count</name>
<value>256</value>
</property>
</configuration>
hadoopConf:
# HDFS
"fs.defaultFS": "dfs://f-4b1fcae5dvexx.cn-hangzhou.dfs.aliyuncs.com:10290"
"fs.dfs.impl": "com.alibaba.dfs.DistributedFileSystem"
"fs.AbstractFileSystem.dfs.impl": "com.alibaba.dfs.DFS"
3. 打包的jar文件需要包含所有依赖
mvn assembly:assembly
附应用的pom.xml:
1<?xml version="1.0" encoding="UTF-8"?>
2<project xmlns="http://maven.apache.org/POM/4.0.0"
3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
5 <modelVersion>4.0.0</modelVersion>
6
7 <groupId>com.aliyun.liumi.spark</groupId>
8 <artifactId>SparkExampleJava</artifactId>
9 <version>1.0-SNAPSHOT</version>
10
11 <dependencies>
12 <dependency>
13 <groupId>org.apache.spark</groupId>
14 <artifactId>spark-core_2.12</artifactId>
15 <version>2.4.3</version>
16 </dependency>
17
18 <dependency>
19 <groupId>com.aliyun.dfs</groupId>
20 <artifactId>aliyun-sdk-dfs</artifactId>
21 <version>1.0.3</version>
22 </dependency>
23
24 </dependencies>
25
26 <build>
27 <plugins>
28 <plugin>
29 <groupId>org.apache.maven.plugins</groupId>
30 <artifactId>maven-assembly-plugin</artifactId>
31 <version>2.6</version>
32 <configuration>
33 <appendAssemblyId>false</appendAssemblyId>
34 <descriptorRefs>
35 <descriptorRef>jar-with-dependencies</descriptorRef>
36 </descriptorRefs>
37 <archive>
38 <manifest>
39 <mainClass>com.aliyun.liumi.spark.example.WordCount</mainClass>
40 </manifest>
41 </archive>
42 </configuration>
43 <executions>
44 <execution>
45 <id>make-assembly</id>
46 <phase>package</phase>
47 <goals>
48 <goal>assembly</goal>
49 </goals>
50 </execution>
51 </executions>
52 </plugin>
53 </plugins>
54 </build>
55</project>
4. 编写Dockerfile
# spark base image
FROM registry.cn-hangzhou.aliyuncs.com/eci_open/spark:2.4.4
# 默认的kubernetes-client版本有问题,建议用最新的
RUN rm $SPARK_HOME/jars/kubernetes-client-*.jar
ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar $SPARK_HOME/jars
# 拷贝本地的应用jar
RUN mkdir -p /opt/spark/jars
COPY SparkExampleJava-1.0-SNAPSHOT.jar /opt/spark/jars
5. 构建应用镜像
docker build -t registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example -f Dockerfile .
6. 推到阿里云ACR
docker push registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example
至此,镜像都已经准备完毕。接下来就是在kubernetes集群中部署Spark应用了。