BERT, or Bidirectional Encoder Representations from Transformers
BERT是google最新提出的NLP预训练方法,在大型文本语料库(如维基百科)上训练通用的“语言理解”模型,然后将该模型用于我们关心的下游NLP任务(如分类、阅读理解)。 BERT优于以前的方法,因为它是用于预训练NLP的第一个**无监督,深度双向**系统。
简单的说就是吊打以前的模型,例如 Semi-supervised Sequence Learning,Generative Pre-Training,ELMo, ULMFit,在多个语言任务上(SQuAD, MultiNLI, and MRPC)基于BERT的模型都取得了state of the art
的效果。
从句子中随机选取15%去除,作为模型预测目标,例如:
Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon
为了学习句子之间的关系。会从数据集抽取两个句子,其中第二句是第一句的下一句的概率是 50%,
Sentence A: the man went to the store .
Sentence B: he bought a gallon of milk .
Label: IsNextSentence
Sentence A: the man went to the store .
Sentence B: penguins are flightless .
Label: NotNextSentence
最后再将经过处理的句子传入大型 Transformer 模型,并通过两个损失函数同时学习上面两个目标就能完成训练。
主要在于Transformer模型。后续需要再分析其模型机构以及设计思想。
BERT-Base, Uncased
:
12-layer, 768-hidden, 12-heads, 110M parametersBERT-Large, Uncased
:
24-layer, 1024-hidden, 16-heads, 340M parametersBERT-Base, Cased
:
12-layer, 768-hidden, 12-heads , 110M parametersBERT-Large, Cased
: 24-layer, 1024-hidden, 16-heads, 340M parameters
(Not available yet. Needs to be re-generated).BERT-Base, Multilingual
:
102 languages, 12-layer, 768-hidden, 12-heads, 110M parametersBERT-Base, Chinese
:
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
parameters其中包含:
bert_model.ckpt
) containing the pre-trained
weights (which is actually 3 files).vocab.txt
) to map WordPiece to word id.bert_config.json
) which specifies the hyperparameters of
the model.其他语言见: Multilingual README。开放了中文数据集。
(算力紧张情况下单独训练了一版中文,中文影响力可见一斑,我辈仍需努力啊)
更多细节见: https://github.com/google-research/bert