Masked Language Modeling with Code Explaination, MLM及相关代码说明
Knowledge graph 本篇所涉及的知识点Masked language Modeling(MLM)使用BERT + MLM的过程To discuss and testKnowledge graph 本篇所涉及的知识点
BERT concept and theoryBERT applicationMLM: introduction and way to useNLPnext sentence prediction (NSP)Masked language Modeling(MLM)
关于BERT和MLM:
BERT可以很方便地用于应用领域;BERT + MLM可以方便应用于特定领域及问题中;
Here I would like to introduce Masked language Modeling(MLM). Before the introduction, there are some basic ideas you need to know about BERT and MLM:
BERTis easy to use in a general purpose of use;BERT with MLMcan be used in specific areas and domains.
BERT + MLM 的思想在于:
在数据输入BERT训练前,使用MLM遮盖部分数据,然后让BERT填补这部分数据;MLM所遮盖的部分,可以是随机性遮盖一定比例的。
(mask some tokens before training in BERT; let BERT fill the missing part of the text)
使用BERT + MLM的过程
The whole processes:
文本特征化后得到三个张量 tokenize the text, after this, we will getthree tensors:input_ids– this is what will be used as input to BERTtoken_type_ids– not necessary for MLMattention_mask标签张量label tensors: calculate loss against and optimize towardssimply input_ids – 只对这个张量进行操作 MLM遮盖数据集 randomlymask some tokens in input_ids15% of masking the tokens in pre-training model process. 计算损失函数 calculate loss – used for optimization the model input input_ids and labels in BERTdo the calculation
from transformers import BertTokenizer, BertForMaskedLMimport torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')model = BertForMaskedLM.from_pretrained('bert-base-uncased')text = ("After Abraham Lincoln won the November 1860 presidential ""election on an anti-slavery platform, an initial seven ""slave states declared their secession from the country ""to form the Confederacy. War broke out in April 1861 ""when secessionist forces attacked Fort Sumter in South ""Carolina, just over a month after Lincoln's ""inauguration.")
# tokenize the input textinputs = tokenizer(text, return_tensors='pt')# get three tensors:inputs.keys()
after running the previous code, you will get this:
dict_keys([‘input_ids’, ‘token_type_ids’, ‘attention_mask’])
inputs
{‘input_ids’: tensor([[ 101, 2044, 8181, 5367, 2180, 1996, 2281, 7313, 4883, 2602,
, , 3424, 1011, 8864, 4132, 1010, , 3988, 2698,
6658, 2163, 4161, 2037, 22965, , 1996, 2406, 2000, 2433,
1996, 18179, 1012, 2162, 3631, 2041, 1999, 2258, 6863, 2043,
22965, 2923, 2749, 4457, 3481, 7680, 3334, 1999, 2148, 3792,
1010, 2074, 2058, 1037, 3204, 2044, 5367, 1005, 1055, 17331,
1012, 102]]), ‘token_type_ids’: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), ‘attention_mask’: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
# create labelsinputs['labels'] = inputs.input_ids.detach().clone()inputs
{‘input_ids’: tensor([[ 101, 2044, 8181, 5367, 2180, 1996, 2281, 7313, 4883, 2602,
, , 3424, 1011, 8864, 4132, 1010, , 3988, 2698,
6658, 2163, 4161, 2037, 22965, , 1996, 2406, 2000, 2433,
1996, 18179, 1012, 2162, 3631, 2041, 1999, 2258, 6863, 2043,
22965, 2923, 2749, 4457, 3481, 7680, 3334, 1999, 2148, 3792,
1010, 2074, 2058, 1037, 3204, 2044, 5367, 1005, 1055, 17331,
1012, 102]]), ‘token_type_ids’: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), ‘attention_mask’: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), ‘labels’: tensor([[ 101, 2044, 8181, 5367, 2180, 1996, 2281, 7313, 4883, 2602,
, , 3424, 1011, 8864, 4132, 1010, , 3988, 2698,
6658, 2163, 4161, 2037, 22965, , 1996, 2406, 2000, 2433,
1996, 18179, 1012, 2162, 3631, 2041, 1999, 2258, 6863, 2043,
22965, 2923, 2749, 4457, 3481, 7680, 3334, 1999, 2148, 3792,
1010, 2074, 2058, 1037, 3204, 2044, 5367, 1005, 1055, 17331,
1012, 102]])}
# create random array of floats in equal dimension to input_idsrand = torch.rand(inputs.input_ids.shape)# set mask array' randomization is less than 0.15, use it to mask arraymask_arr = rand < 0.15mask_arr
tensor([[False, False, False, False, True, False, False, False, False, False,
False, False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, False, False, True,
False, True, False, True, False, False, False, False, True, False,
False, False, False, True, True, False, False, False, False, False,
False, False]])
# if "" are not considered into tokens(inputs.input_ids != 101) * (inputs.input_ids != 102)mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102)mask_arr
tensor([[False, False, False, False, True, False, False, False, False, False,
False, False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, False, False, True,
False, True, False, True, False, False, False, False, True, False,
False, False, False, True, True, False, False, False, False, False,
False, False]])
# select position from mask_arrselection = torch.flatten((mask_arr[0]).nonzero()).tolist()selection
[4, 14, 36, 39, 41, 43, 48, 53, 54]
inputs
{‘input_ids’: tensor([[ 101, 2044, 8181, 5367, 103, 1996, 2281, 7313, 4883, 2602,
, , 3424, 1011, 103, 4132, 1010, , 3988, 2698,
6658, 2163, 4161, 2037, 22965, , 1996, 2406, 2000, 2433,
1996, 18179, 1012, 2162, 3631, 2041, 103, 2258, 6863, 103,
22965, 103, 2749, 103, 3481, 7680, 3334, 1999, 103, 3792,
1010, 2074, 2058, 103, 103, 2044, 5367, 1005, 1055, 17331,
1012, 102]]), ‘token_type_ids’: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), ‘attention_mask’: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), ‘labels’: tensor([[ 101, 2044, 8181, 5367, 2180, 1996, 2281, 7313, 4883, 2602,
, , 3424, 1011, 8864, 4132, 1010, , 3988, 2698,
6658, 2163, 4161, 2037, 22965, , 1996, 2406, 2000, 2433,
1996, 18179, 1012, 2162, 3631, 2041, 1999, 2258, 6863, 2043,
22965, 2923, 2749, 4457, 3481, 7680, 3334, 1999, 2148, 3792,
1010, 2074, 2058, 1037, 3204, 2044, 5367, 1005, 1055, 17331,
1012, 102]])}
outputs = model(**inputs)outputs.loss
tensor(0.7134, grad_fn=)
To discuss and test
tokenizer(text, return_tensors=‘pt’)torch.rand(inputs.input_ids.shape)愿你的选择照耀的是希望,而不是焦虑!Enjoy!