2024 年云原生运维实战文档 99 篇原创计划 第 014 篇 |玩转 AIGC「2024」系列 第 004 篇
你好,欢迎来到运维有术。
今天分享的内容是 玩转 AIGC「2024」 系列文档中的 打造本地 AI 大模型地基,Ubuntu 24.04 LTS 安装 Docker 和 NVIDIA Container Toolkit。
本文将详细介绍如何在 AI 大模型云主机 Ubuntu 24.04 LTS 中安装 Docker 和 NVIDIA Container Toolkit ,从而实现 Docker 容器使用 GPU 运行本地大模型。
本文选择纯矿卡 NVIDIA P104-100 作为演示,配置方法同样适用于其他型号显卡。
本文假设你已经完成了操作系统 Ubuntu 24.04 LTS 的安装配置。
接下来,我们完成 500G 数据盘的格式化及自动挂载。
由于内容比较基础,我们直接使用 Shell 脚本自动化完成磁盘的格式化及 LVM 配置。
数据盘配置如下,如果你的环境跟我的不一致,请修改脚本。
磁盘设备名:/dev/sdb
挂载点:/data
磁盘分区形式:LVM
Docker 数据根目录:/data/docker
容器运行数据目录:/data/containers
pvcreate /dev/sdb
vgcreate data /dev/sdb
lvcreate -l 100%VG data -n lvdata
mkfs.xfs /dev/mapper/data-lvdata
mkdir /data
mount /dev/mapper/data-lvdata /data/
tail -1 /etc/mtab >> /etc/fstab
mkdir -p /data/containers
将上面的脚本保存为 create-lvm.sh
,然后执行。
脚本执行过程如下:
root@AI-LLM-Prod:~# sh create-lvm.sh
Physical volume "/dev/sdb" successfully created.
Volume group "data" successfully created
Logical volume "lvdata" created.
meta-data=/dev/mapper/data-lvdata isize=512 agcount=4, agsize=32767744 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=0
data = bsize=4096 blocks=131070976, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=63999, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Discarding blocks...Done.
验证磁盘挂载结果:
root@AI-LLM-Prod:~# df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 3.2G 1.2M 3.2G 1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 38G 4.8G 31G 14% /
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/sda2 2.0G 95M 1.7G 6% /boot
tmpfs 3.2G 12K 3.2G 1% /run/user/1000
/dev/mapper/data-lvdata 500G 9.7G 491G 2% /data
以下所有操作均在 root 用户下完成,并假定操作系统是新安装的干净环境。
apt-get update
apt-get install ca-certificates curl gnupg
我们使用清华大学开源软件镜像站的软件仓库 mirrors.tuna.tsinghua.edu.cn
,作为安装源。
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get update
apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
创建配置文件 /etc/docker/daemon.json
cat > /etc/docker/daemon.json <<EOF
{
"data-root": "/data/docker",
"exec-opts": [
"native.cgroupdriver=systemd"
],
"log-level": "info",
"log-opts": {
"max-size": "100m",
"max-file": "5"
},
"storage-driver": "overlay2"
}
EOF
启动 Docker 服务,并设置开机自启。
systemctl restart docker
systemctl enable docker --now
docker info
hello-world
创建测试容器 docker run hello-world
root@AI-LLM-Prod:~# docker run --rm hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
c1ec31eb5944: Pull complete
Digest: sha256:a26bff933ddc26d5cdf7faa98b4ae1e3ec20c4985e6f87ac0973052224d24302
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
本文实验环境是个人的学习开发环境,所以我们采用比较简单的方式,利用系统安装包安装 NVIDIA 显卡驱动。生产环境建议选择适配的 .run 格式二进制安装包。
root@AI-LLM-Prod:~# ubuntu-drivers devices
udevadm hwdb is deprecated. Use systemd-hwdb instead.
udevadm hwdb is deprecated. Use systemd-hwdb instead.
udevadm hwdb is deprecated. Use systemd-hwdb instead.
udevadm hwdb is deprecated. Use systemd-hwdb instead.
ERROR:root:aplay command not found
== /sys/devices/pci0000:00/0000:00:10.0 ==
modalias : pci:v000010DEd00001B87sv00000000sd00000000bc03sc02i00
vendor : NVIDIA Corporation
model : GP104 [P104-100]
driver : nvidia-driver-535 - distro non-free recommended
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-535-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
注意: 重点关注 driver : nvidia-driver-535 - distro non-free recommended
执行下面的命令,自动安装适配的显卡驱动。
ubuntu-drivers autoinstall
如果想安装指定版本的 GPU 驱动,可以执行下面的命令。
apt install nvidia-driver-470
安装完成后,重启服务器(必须执行)
reboot
nvidia-smi
ubuntu@AI-LLM-Prod:~$ nvidia-smi
Mon May 6 14:46:22 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA P104-100 Off | 00000000:00:10.0 Off | N/A |
| 72% 36C P8 7W / 180W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
NVIDIA Container Toolkit 软件仓库在 NVIDIA GitHub 上,所以安装过程依赖于网络,如果失败,请多次尝试。
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-ctk runtime configure --runtime=docker
正确执行输出结果如下:
root@AI-LLM-Prod:~# nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading config from /etc/docker/daemon.json
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended that docker daemon be restarted.
root@AI-LLM-Prod:~# cat /etc/docker/daemon.json
{
"data-root": "/data/docker",
"exec-opts": [
"native.cgroupdriver=systemd"
],
"log-level": "info",
"log-opts": {
"max-file": "5",
"max-size": "100m"
},
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"storage-driver": "overlay2"
}
systemctl restart docker
运行 CUDA 示例容器进行验证测试
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
# 也可以用下面的命令
# docker run -it --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
正确执行输出结果如下:
root@AI-LLM-Prod:~# docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
49b384cc7b4a: Pull complete
Digest: sha256:3f85b7caad41a95462cf5b787d8a04604c8262cdcdf9a472b8c52ef83375fe15
Status: Downloaded newer image for ubuntu:latest
Mon May 6 15:11:10 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA P104-100 Off | 00000000:00:10.0 Off | N/A |
| 71% 34C P8 6W / 180W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@AI-LLM-Prod:~# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
安装完 NVIDIA 显卡驱动没有重启机器,重启后问题解决。
本文实战演示了如何在最新版的 Ubuntu 24.04 LTS 操作系统中安装 Docker 和 NVIDIA Container Toolkit,安装成功后我们进行了基本功能的验证测试且测试通过。
因为,该操作系统刚刚发布,并不在 NVIDIA Container Toolkit 官方支持的 Linux 发行版列表中。所以,如果你在使用的过程中遇到问题,可以切换到 Ubuntu 22.04 LTS。
以上,就是今天分享的内容,下一期我们会分享如何用 Docker 方式部署大模型运行管理器 Ollama 并使用 GPU 运行大模型。敬请持续关注!!!
免责声明:
Get 本文实战视频(请注意,文档视频异步发行,请先关注)
如果你喜欢本文,请分享、收藏、点赞、评论! 请持续关注 @ 运维有术,及时收看更多好文!
欢迎加入 「知识星球|运维有术」 ,获取更多的 KubeSphere、Kubernetes、云原生运维、自动化运维、AI 大模型等实战技能。未来运维生涯始终有我坐在你的副驾。
版权声明
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。