... | ... | @@ -3,20 +3,21 @@ |
|
|
```python
|
|
|
BertTagger()
|
|
|
```
|
|
|
|
|
|
### Parameters
|
|
|
|
|
|
#### Optional
|
|
|
|
|
|
| Parameter | Default | Type | Description |
|
|
|
| --------------------- | -------- | ------ | ------------------------------------------------------------------- |
|
|
|
| allow_standard_output | True | bool | Display info/progress messages in standard output. |
|
|
|
| autoadjust_batch_size | True | bool | If enabled, batch size is automatically adjusted based on training param `max_length` and available memory |
|
|
|
| min_available_memory | 500 (MB) | int | Minimum available GPU memory. If free GPU memory < `min_available_memory`, CPU is used instead of GPU. NB! The parameter has effect only if GPU is available. |
|
|
|
| sklearn_avg_function | "macro" | string | Average function used when calculation sklearn metrics like precision, recall and f1-score. Allowed options = ["binary", "macro", "micro", "weighted"] |
|
|
|
| use_gpu | True | bool | If enabled, uses GPU. |
|
|
|
| save_pretrained | True | bool | If enabled, saves pretrained models to local storage specified with param `pretrained_models_dir` |
|
|
|
| pretrained_models_dir | "" | string | Path to the location where the pretrained models are (or will be) saved. |
|
|
|
| logger | None | logging.Logger | Info logger for logging progress and info messages. |
|
|
|
| Parameter | Default | Type | Description |
|
|
|
|-----------|---------|------|-------------|
|
|
|
| allow_standard_output | True | bool | Display info/progress messages in standard output. |
|
|
|
| autoadjust_batch_size | True | bool | If enabled, batch size is automatically adjusted based on training param `max_length` and available memory |
|
|
|
| min_available_memory | 500 (MB) | int | Minimum available GPU memory. If free GPU memory < `min_available_memory`, CPU is used instead of GPU. NB! The parameter has effect only if GPU is available. |
|
|
|
| sklearn_avg_function | "macro" | string | Average function used when calculation sklearn metrics like precision, recall and f1-score. Allowed options = \["binary", "macro", "micro", "weighted"\] |
|
|
|
| use_gpu | True | bool | If enabled, uses GPU. |
|
|
|
| save_pretrained | True | bool | If enabled, saves pretrained models to local storage specified with param `pretrained_models_dir` |
|
|
|
| pretrained_models_dir | "" | string | Path to the location where the pretrained models are (or will be) saved. |
|
|
|
| logger | None | logging.Logger | Info logger for logging progress and info messages. |
|
|
|
|
|
|
### Example
|
|
|
|
... | ... | @@ -28,33 +29,32 @@ bert_tagger = BertTagger() |
|
|
```
|
|
|
|
|
|
## Training a model
|
|
|
|
|
|
```python
|
|
|
BertTagger().train(data_sample, **kwargs)
|
|
|
```
|
|
|
|
|
|
### Parameters
|
|
|
|
|
|
#### Required
|
|
|
|
|
|
| Parameter | Type | Description |
|
|
|
| ------------------- |------------------------ | ------------------------------------------------------------------- |
|
|
|
| data_sample | Dict[str,List[str]] | Training data of type dict, where keys = labels and values = list of examples corresponding to the label. |
|
|
|
| pos_label | string | Class used as positive when calculating sklearn metrics. NB! It should be specified if input data has *2 classes*, but doesn't affect anything with 3 or more classes and might be left unspecified then. Default to "".|
|
|
|
| Parameter | Type | Description |
|
|
|
|-----------|------|-------------|
|
|
|
| data_sample | Dict\[str,List\[str\]\] | Training data of type dict, where keys = labels and values = list of examples corresponding to the label. |
|
|
|
| pos_label | string | Class used as positive when calculating sklearn metrics. NB! It should be specified if input data has _2 classes_, but doesn't affect anything with 3 or more classes and might be left unspecified then. Default to "". |
|
|
|
|
|
|
#### Optional
|
|
|
|
|
|
| Parameter | Default | Type | Description |
|
|
|
| ------------------- | -------- | ------ | ------------------------------------------------------------------- |
|
|
|
| batch_size | 32 | int | Size of one batch in data sampler.|
|
|
|
| bert_model | "bert-base-multilingual-cased" | str | Pre-trained BERT model to use. [List of all available models](https://huggingface.co/transformers/pretrained_models.html).|
|
|
|
| eps | 1e-8 | float | TODO. |
|
|
|
| lr | 2e-5 | float| Learning rate. |
|
|
|
| max_length | 32 | int | Maximum number of tokens of each training example used. Each example is truncated/padded accordingly. |
|
|
|
| n_epochs | 2 | int | Number of epochs to train. NB! 2 is usually sufficient as higher numbers already lead to overfitting on training data.|
|
|
|
|
|
|
| seed_val | 42 | int | Random seed value. |
|
|
|
| split_ratio | 0.8 | float | Ratio of input_data used for training. The rest is used for validation. |
|
|
|
|
|
|
| Parameter | Default | Type | Description |
|
|
|
|-----------|---------|------|-------------|
|
|
|
| batch_size | 32 | int | Size of one batch in data sampler. |
|
|
|
| bert_model | "bert-base-multilingual-cased" | str | Pre-trained BERT model to use. [List of all available models](https://huggingface.co/transformers/pretrained_models.html). |
|
|
|
| eps | 1e-8 | float | TODO. |
|
|
|
| lr | 2e-5 | float | Learning rate. |
|
|
|
| max_length | 32 | int | Maximum number of tokens of each training example used. Each example is truncated/padded accordingly. |
|
|
|
| n_epochs | 2 | int | Number of epochs to train. NB! 2 is usually sufficient as higher numbers already lead to overfitting on training data. |
|
|
|
|
|
|
| seed_val | 42 | int | Random seed value. | | split_ratio | 0.8 | float | Ratio of input_data used for training. The rest is used for validation. |
|
|
|
|
|
|
### Example
|
|
|
|
... | ... | @@ -77,22 +77,25 @@ report = bert_tagger.train(data_sample, pos_label = "OFFENSIVE", bert_model = "b |
|
|
# contains evalutaion scores and some other information for the last training epoch. For more information, see chapter Training Reports.
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Tagging text
|
|
|
|
|
|
```python
|
|
|
BertTagger().tag_text(text)
|
|
|
```
|
|
|
|
|
|
### Parameters
|
|
|
|
|
|
#### Required
|
|
|
|
|
|
| Parameter | Type | Description |
|
|
|
| ---------- |------ | ------------ |
|
|
|
| text | str | Text to tag. |
|
|
|
| Parameter | Type | Default | Description |
|
|
|
|-----------|------|---------|-------------|
|
|
|
| text | str | - | Text to tag. |
|
|
|
| highest_only | bool | True | Whether to include only the class with the highest porbability into the output (output type = dict) or all the classes with their probabilities (output type = List\[dict\]). |
|
|
|
|
|
|
### Example
|
|
|
|
|
|
#### Train (or load) a model
|
|
|
|
|
|
```python
|
|
|
from texta_bert_tagger.tagger import BertTagger
|
|
|
|
... | ... | @@ -109,9 +112,12 @@ data_sample = { |
|
|
}
|
|
|
|
|
|
bert_tagger.train(data_sample, pos_label="OFFENSIVE", bert_model = "bert-base-uncased")
|
|
|
```
|
|
|
|
|
|
#### Tag text: output only the most probable class
|
|
|
|
|
|
# ... And tag text with the retrieved model.
|
|
|
prediction = bert_tagger.tag("I hope you die!")
|
|
|
```
|
|
|
prediction = bert_tagger.tag_text("I hope you die!", highest_only=True)
|
|
|
```
|
|
|
|
|
|
#### Output
|
... | ... | @@ -120,17 +126,34 @@ prediction = bert_tagger.tag("I hope you die!") |
|
|
{"prediction": "OFFENSIVE", "probability": 0.75200404}
|
|
|
```
|
|
|
|
|
|
#### Tag text: output all the classes with their probabilities
|
|
|
|
|
|
```python
|
|
|
prediction = bert_tagger.tag_text("I hope you die!", highest_only=False)
|
|
|
```
|
|
|
|
|
|
#### Output
|
|
|
|
|
|
```python
|
|
|
[
|
|
|
{"prediction": "OFFENSIVE", "probability": 0.75200404},
|
|
|
{"prediction": "OK", "probability": 0.2479956}
|
|
|
]
|
|
|
```
|
|
|
|
|
|
## Tagging document
|
|
|
|
|
|
```python
|
|
|
BertTagger().tag_doc(doc)
|
|
|
```
|
|
|
|
|
|
### Parameters
|
|
|
|
|
|
#### Required
|
|
|
|
|
|
| Parameter | Type | Description |
|
|
|
| ---------- |------ | --------------------- |
|
|
|
| doc | dict | JSON document to tag. |
|
|
|
| Parameter | Type | Description |
|
|
|
|-----------|------|-------------|
|
|
|
| doc | dict | JSON document to tag. |
|
|
|
|
|
|
### Example
|
|
|
|
... | ... | @@ -168,8 +191,8 @@ print(prediction) |
|
|
{"prediction": "OFFENSIVE", "probability": 0.75200404}
|
|
|
```
|
|
|
|
|
|
|
|
|
## Saving a model
|
|
|
|
|
|
```python
|
|
|
BertTagger().save(path)
|
|
|
```
|
... | ... | @@ -178,9 +201,9 @@ BertTagger().save(path) |
|
|
|
|
|
#### Required
|
|
|
|
|
|
| Parameter | Type | Description |
|
|
|
| ------------------- |------- | ------------------------ |
|
|
|
| path | str | Full path to model file. |
|
|
|
| Parameter | Type | Description |
|
|
|
|-----------|------|-------------|
|
|
|
| path | str | Full path to model file. |
|
|
|
|
|
|
### Example
|
|
|
|
... | ... | @@ -202,8 +225,8 @@ bert_tagger.train(data_sample, pos_label="OFFENSIVE", bert_model = "bert-base-un |
|
|
bert_tagger.save("/home/bert_models/en_offensive")
|
|
|
```
|
|
|
|
|
|
|
|
|
## Loading a model
|
|
|
|
|
|
```python
|
|
|
BertTagger().load(path)
|
|
|
```
|
... | ... | @@ -212,10 +235,9 @@ BertTagger().load(path) |
|
|
|
|
|
#### Required
|
|
|
|
|
|
| Parameter | Type | Description |
|
|
|
| ------------------- |------- | ------------------------ |
|
|
|
| path | str | Full path to model file. |
|
|
|
|
|
|
| Parameter | Type | Description |
|
|
|
|-----------|------|-------------|
|
|
|
| path | str | Full path to model file. |
|
|
|
|
|
|
### Example
|
|
|
|
... | ... | @@ -287,20 +309,21 @@ reports = [r.to_dict() for r in reports] |
|
|
```python
|
|
|
BertTagger.download_pretrained_models(bert_models, save_dir, logger)
|
|
|
```
|
|
|
|
|
|
### Parameters
|
|
|
|
|
|
#### Required
|
|
|
|
|
|
| Parameter | Type | Description |
|
|
|
| ------------------- |------------------------ | ------------------------------------------------------------------- |
|
|
|
| bert_models | List[str] | List of bert model identifiers available in [HuggingFace] (https://huggingface.co/models).|
|
|
|
| save_dir | string | Directory where the pretrained models should be saved. |
|
|
|
| Parameter | Type | Description |
|
|
|
|-----------|------|-------------|
|
|
|
| bert_models | List\[str\] | List of bert model identifiers available in \[HuggingFace\] (https://huggingface.co/models). |
|
|
|
| save_dir | string | Directory where the pretrained models should be saved. |
|
|
|
|
|
|
#### Optional
|
|
|
|
|
|
| Parameter | Default | Type | Description |
|
|
|
| --------------------- | -------- | ------ | ------------------------------------------------------------------- |
|
|
|
| logger | None | logging.Logger | Info logger for logging progress and info messages. |
|
|
|
| Parameter | Default | Type | Description |
|
|
|
|-----------|---------|------|-------------|
|
|
|
| logger | None | logging.Logger | Info logger for logging progress and info messages. |
|
|
|
|
|
|
### Example
|
|
|
|
... | ... | |