Add an option for generating custom vocabulary based on fine-tuning data

Pre-training a model from scratch with a custom vocabulary could help solve entity ambiguity and more importantly also boost entity tagging performance. BERT’s default vocabulary is rich with full words and subwords for detecting entity types like person, location, organization etc (Figures 4a and b). However, it is deficient in capturing full and partial terms in biomedical domain. For instance, the tokenization of drugs like imatinib, nilotinib, dasatinib, do not consider the common subword “tinib”. Imatinib is tokenized into i ##mat ##ini ##b whereas dasatinib is tokenized into das ##ati ##ni ##b. If we create our own vocabulary using sentencepiece on a biomedical corpus, we get im ##a ##tinib and d ##as ##a ##tinib — capturing common suffixes. Also the custom vocabulary contains full words from biomedical domain that capture characteristics of biomedical domain better. For instance words like congenital, carcinoma, carcinogen, cardiologist etc. are absent in the default BERT pre-trained models.