大模型之 Huggingface 初体验

原创

程序员架构进阶

发布于 2023-07-03 15:29:43

1.7K0

发布于 2023-07-03 15:29:43

文章被收录于专栏：架构进阶架构进阶

一背景

huggingface 相关环境的安装和问题处理本篇暂不涉及，后续补充。这里以一个模型为例，完成从模型介绍到加载、运行的完整过程，作为我们熟悉 huggingface 的一个示例。

二模型

这里选择 google/pegasus-newsroom 模型作为示例。

2.1 介绍

模型介绍参见https://huggingface.co/docs/transformers/main/model_doc/pegasus，模型是在论文《PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization》中提出的，作者：Jingqing Zhang。基本思想是，PEGASUS 在预训练阶段，将输入的文档的重要句子 remove/mask，通过其它的句子预测生成，类似于摘要生成的做法。

2.2 使用示例

https://huggingface.co/google/pegasus-newsroom/tree/main

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-newsroom")
model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-newsroom")

复制代码

2.3 遇到问题

按理说应该可以顺利执行，但实际上不出意外地遇到了意外。在执行时还是报错提示无法加载模型，信息如下：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/onnx/tutorial-env/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 709, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/root/onnx/tutorial-env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1809, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'google/pegasus-newsroom'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'google/pegasus-newsroom' is the correct path to a directory containing all relevant files for a PegasusTokenizerFast tokenizer.

复制代码

在最后一行，OSError 这段，给出了两种错误可能的提示：

（1）确保本地没有同名目录

这一点显然，从来都没有创建过这个目录；

（2）确认'google/pegasus-newsroom'是一个包含所有相关文件的正确目录

这是从 huggingface 官网上复制过来的代码，不可能会出错。那问题出在哪里了？

三问题排查

3.1 SSH 拉取模型文件

通过资料搜搜，和 huggingface 官网的模型页面查看，发现如下：

可以通过 git 拉取模型文件

不过执行后有如下报错：

所以改为使用 SSH 方式：

报了权限错误，不过还好，看到 publickey 的提示，应该是设置一下访问授权就可以了。

git clone git@hf.co:google/pegasus-newsroom
正克隆到 'pegasus-newsroom'...
Warning: Permanently added the ECDSA host key for IP address '3.210.66.237' to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.


Please make sure you have the correct access rights
and the repository exists.

复制代码

3.2 ssh key 生成与添加

在https://huggingface.co/docs/hub/security-git-ssh 中有相关的操作描述（当然在实际操作中发现也有坑。。。），简单整理如下：

1、检查是否存在 SSH key，由于是 linux 系统，所以默认是在~/.ssh 目录下。由于我们之前没有生成过，所以没有（有也没关系，直接覆盖生成就好）

id_rsa.pub
id_ecdsa.pub
id_ed25519.pub

2、如果没有，那么先生成，使用 ssh-keygen 命令，引号内是你注册 huggingface 时使用的邮箱：

ssh-keygen -t ed25519 -C "your.email@example.co"

复制代码

3、生成完毕后，使用 ssh-add 命令加入到你的 SSH agent 中：

ssh-add ~/.ssh/id_ed25519

复制代码

在第三步可能会遇到报错，例如我本地执行时错误如下：

Could not open a connection to your authentication agent.

复制代码

无法正常添加，这种情况需要先执行 ssh-agent bash，然后再次执行 ssh-add 添加即可。

接下来就可以拉模型文件了：

git clone git@hf.co:google/pegasus-newsroom
正克隆到 'pegasus-newsroom'...
remote: Enumerating objects: 33, done.
remote: Total 33 (delta 0), reused 0 (delta 0), pack-reused 33
接收对象中: 100% (33/33), 931.72 KiB | 604.00 KiB/s, done.
处理 delta 中: 100% (12/12), done.
Downloading pytorch_model.bin (2.3 GB)

复制代码

下载成功。

不过跟 huggingface 的描述相比，还有有个地方有些问题。按照 huggingface 的文档描述，ssh-add 添加 id_ed25519 成功后，在终端执行 ssh -T git@hf.co 命令，应该能看到包含你用户名的提示信息。但如上所述，我已经成功添加，并且可以拉取模型文件了，在终端执行命令后还是只有： “Hi anonymous, welcome to Hugging Face.”，按照文档描述这应该是失败的状态。这里暂时没有解决，留待后续继续排查。

四继续运行模型

4.1 网络问题

回过头来，我们继续尝试对 google/pegasus-newsroom 的尝试。依次执行命令如下：

from transformers import AutoTokenizer, PegasusModel


tokenizer = AutoTokenizer.from_pretrained("google/pegasus-large")
model = PegasusModel.from_pretrained("google/pegasus-large")


inputs = tokenizer("Studies have been shown that owning a dog is good for you", return_tensors="pt")
decoder_inputs = tokenizer("Studies show that", return_tensors="pt")
outputs = model(input_ids=inputs.input_ids, decoder_input_ids=decoder_inputs.input_ids)


last_hidden_states = outputs.last_hidden_state
list(last_hidden_states.shape)

复制代码

执行成功。

不过我们重复执行时，发现这里还有个问题，执行：model = PegasusModel.from_pretrained("google/pegasus-large") 时，依然会报连接失败的错误，而且失败的概率还比较大，所以依然需要继续解决。不过这个稍微分析一下，就知道是国内众所周知的“网络环境”问题，如果可以“访问国外网站”，那么就可以解决。不过相信也有很多小伙伴不具备这样的环境，或者风险较大，所以需要考虑采用其他更合法的方式。

4.2 离线模式

官网和其他可搜到的资料，基本都推荐采用离线模式。也就是把模型通过 git 或者手工下载再上传到服务器的指定目录，然后修改执行脚本从本地加载的方式。

由于上面我们已经完成了 ssh 的配置，并且可以 git clone 拉取模型文件，所以就直接加载已经拉下来的模型，脚本如下：

>>> from transformers import AutoTokenizer, AutoModelForMaskedLM
>>> tokenizer = AutoTokenizer.from_pretrained("/root/onnx/model/huggingface/pegasus-newsroom")
>>> 
>>> from transformers import AutoTokenizer, AutoModel
>>> model = AutoModel.from_pretrained("/root/onnx/model/huggingface/pegasus-newsroom")


Some weights of the model checkpoint at /root/onnx/model/huggingface/pegasus-newsroom were not used when initializing PegasusModel: ['final_logits_bias']
- This IS expected if you are initializing PegasusModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing PegasusModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of PegasusModel were not initialized from the model checkpoint at /root/onnx/model/huggingface/pegasus-newsroom and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
>>> 
>>> model = model.eval()

复制代码

到这里，算是跑通了整个运行流程。

五后续

接下来，将继续验证 huggingface 转 onnx，和加载 onnx 并对外提供服务。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

huggingface-transformers

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

huggingface-transformers

登录后参与评论

0 条评论

热度