|
.. _token_classification: |
|
|
|
Token Classification (Named Entity Recognition) Model |
|
===================================================== |
|
|
|
Token Classification model supports named entity recognition (NER) and other token level classification tasks, as long as the data |
|
follows the format specified below. |
|
|
|
We're going to use NER task throughout this section. NER, also referred to as entity chunking, identification or extraction, is the |
|
task of detecting and classifying key information (entities) in text. In other words, a NER model takes a piece of text as input and |
|
for each word in the text, the model identifies a category the word belongs to. For example, in a sentence: ``Mary lives in Santa Clara |
|
and works at NVIDIA``, the model should detect that ``Mary`` is a person, ``Santa Clara`` is a location and ``NVIDIA`` is a company. |
|
|
|
Quick Start Guide |
|
----------------- |
|
|
|
.. code-block:: python |
|
|
|
from nemo.collections.nlp.models import TokenClassificationModel |
|
|
|
# to get the list of pre-trained models |
|
TokenClassificationModel.list_available_models() |
|
|
|
# Download and load the pre-trained BERT-based model |
|
model = TokenClassificationModel.from_pretrained("ner_en_bert") |
|
|
|
# try the model on a few examples |
|
model.add_predictions(['we bought four shirts from the nvidia gear store in santa clara.', 'NVIDIA is a company.']) |
|
|
|
.. note:: |
|
|
|
We recommend you try this model in a Jupyter notebook (run on `Google's Colab <https: |
|
`NeMo/tutorials/nlp/Token_Classification_Named_Entity_Recognition.ipynb <https: |
|
|
|
Connect to an instance with a GPU (**Runtime** -> **Change runtime type** -> select **GPU** for the hardware accelerator). |
|
|
|
An example script on how to train the model can be found here: `NeMo/examples/nlp/token_classification/token_classification_train.py <https: |
|
|
|
An example script on how to run evaluation and inference can be found here: `NeMo/examples/nlp/token_classification/token_classification_evaluate.py <https: |
|
|
|
The default configuration file for the model can be found here: `NeMo/examples/nlp/token_classification/conf/token_classification_config.yaml <https: |
|
|
|
.. _dataset_token_classification: |
|
|
|
Data Input for Token Classification Model |
|
----------------------------------------- |
|
|
|
For pre-training or fine-tuning of the model, the data should be split into 2 files: |
|
|
|
- ``text.txt`` |
|
- ``labels.txt`` |
|
|
|
Each line of the ``text.txt`` file contains text sequences, where words are separated with spaces, i.e.: ``[WORD] [SPACE] [WORD] [SPACE] [WORD]``. |
|
The ``labels.txt`` file contains corresponding labels for each word in ``text.txt``, the labels are separated with spaces, i.e.: ``[LABEL] [SPACE] [LABEL] [SPACE] [LABEL]``. |
|
Example of a ``text.txt`` file: |
|
|
|
Jennifer is from New York City . |
|
She likes ... |
|
... |
|
|
|
Corresponding ``labels.txt`` file: |
|
|
|
B-PER O O B-LOC I-LOC I-LOC O |
|
O O ... |
|
... |
|
|
|
Dataset Conversion |
|
------------------ |
|
|
|
To convert an `IOB format <https: |
|
`examples/nlp/token_classification/data/import_from_iob_format.py <https: |
|
|
|
.. code:: |
|
|
|
# For conversion from IOB format, for example, for CoNLL-2003 dataset: |
|
python import_from_iob_format.py --data_file=<PATH/TO/THE/FILE/IN/IOB/FORMAT> |
|
|
|
Convert Dataset Required Arguments |
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
|
- :code:`--data_file`: path to the file to convert from IOB to NeMo format |
|
|
|
After running the above command, the data directory, where the :code:`--data_file` is stored, should contain :code:`text_*.txt` and :code:`labels_*.txt` files. |
|
The default names for the training and evaluation in the :code:`conf/token_classification_config.yaml` are the following: |
|
|
|
.. code:: |
|
|
|
. |
|
|--data_dir |
|
|-- labels_dev.txt |
|
|-- labels_train.txt |
|
|-- text_dev.txt |
|
|-- text_train.txt |
|
|
|
|
|
Training The Token Classification model |
|
--------------------------------------- |
|
|
|
In the Token Classification model, we are jointly training a classifier on top of a pre-trained language model, such as |
|
`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__ :cite:`nlp-ner-devlin2018bert`. |
|
Unless the user provides a pre-trained checkpoint for the language model, the language model is initialized with the pre-trained model |
|
from `HuggingFace Transformers <https://github.com/huggingface/transformers>`__. |
|
|
|
Example of model configuration file for training the model can be found at: `NeMo/examples/nlp/token_classification/conf/token_classification_config.yaml <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/conf/token_classification_config.yaml>`__. |
|
|
|
The specification can be roughly grouped into three categories: |
|
|
|
- Parameters that describe the training process: **trainer** |
|
- Parameters that describe the datasets: **model.dataset**, **model.train_ds**, **model.validation_ds** |
|
- Parameters that describe the model: **model** |
|
|
|
More details about parameters in the spec file can be found below: |
|
|
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **Parameter** | **Data Type** | **Description** | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **model.dataset.data_dir** | string | Path to the data converted to the specified above format. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **model.head.num_fc_layers** | integer | Number of fully connected layers. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **model.head.fc_dropout** | float | Dropout to apply to the input hidden states. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **model.head.activation** | string | Activation to use between fully connected layers. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **model.punct_head.use_transrormer_init** | bool | Whether to initialize the weights of the classifier head with the same approach used in Transformer. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **training_ds.text_file** | string | Name of the text training file located at ``data_dir``. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **training_ds.labels_file** | string | Name of the labels training file located at ``data_dir``. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **training_ds.num_samples** | integer | Number of samples to use from the training dataset, ``-1`` - to use all. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **validation_ds.text_file** | string | Name of the text file for evaluation, located at ``data_dir``. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **validation_ds.labels_file** | string | Name of the labels dev file located at ``data_dir``. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
| **validation_ds.num_samples** | integer | Number of samples to use from the dev set, ``-1`` - to use all. | |
|
+-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ |
|
|
|
For more information, see :ref:`nlp_model`. |
|
|
|
Example of the command for training the model: |
|
|
|
.. code:: |
|
|
|
python token_classification_train.py \ |
|
model.dataset.data_dir=<PATH_TO_DATA_DIR> \ |
|
trainer.max_epochs=<NUM_EPOCHS> \ |
|
trainer.devices=[<CHANGE_TO_GPU(s)_YOU_WANT_TO_USE>] \ |
|
trainer.accelerator='gpu' |
|
|
|
|
|
Required Arguments for Training |
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
|
- :code:`model.dataset.data_dir`: path to the directory with pre-processed data. |
|
|
|
.. note:: |
|
|
|
While the arguments are defined in the spec file, if you want to override these parameter definitions in the spec file and |
|
experiment with them, use the command-line to define the parameter. For example, the sample spec file mentioned above has |
|
:code:`validation_ds.batch_size` set to ``64``. However, if the GPU utilization can be optimized further by |
|
using a larger batch size, override it to the desired value by adding the field :code:`validation_ds.batch_size=128` from |
|
the command-line. You can repeat this with any of the parameters defined in the sample spec file. |
|
|
|
Inference |
|
--------- |
|
|
|
An example script on how to run inference can be found at `examples/nlp/token_classification/token_classification_evaluate.py <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/token_classification_evaluate.py>`_. |
|
|
|
To run inference with the pre-trained model, run: |
|
|
|
.. code:: |
|
|
|
python token_classification_evaluate.py \ |
|
pretrained_model=<PRETRAINED_MODEL> |
|
|
|
Required Arguments for Inference |
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
|
- :code:`pretrained_model`: pretrained Token Classification model from ``list_available_models()`` or path to a ``.nemo`` file. For example, ``ner_en_bert`` or ``your_model.nemo`` |
|
|
|
Model Evaluation |
|
---------------- |
|
|
|
An example script on how to evaluate the pre-trained model can be found at `examples/nlp/token_classification/token_classification_evaluate.py <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/token_classification_evaluate.py>`_. |
|
|
|
To start evaluation of the pre-trained model, run: |
|
|
|
.. code:: |
|
|
|
python token_classification_evaluate.py \ |
|
model.dataset.data_dir=<PATH/TO/DATA/DIR> \ |
|
pretrained_model=ner_en_bert \ |
|
model.test_ds.text_file=<text_*.txt> \ |
|
model.test_ds.labels_file=<labels_*.txt> \ |
|
model.dataset.max_seq_length=512 |
|
|
|
|
|
Required Arguments |
|
^^^^^^^^^^^^^^^^^^ |
|
|
|
- :code:`pretrained_model`: pretrained Token Classification model from ``list_available_models()`` or path to a ``.nemo`` file. For example, ``ner_en_bert`` or ``your_model.nemo`` |
|
- :code:`model.dataset.data_dir`: path to the directory that containes :code:`model.test_ds.text_file` and :code:`model.test_ds.labels_file` |
|
|
|
During evaluation of the :code:`test_ds`, the script generates a classification report that includes the following metrics: |
|
|
|
- :code:`Precision` |
|
- :code:`Recall` |
|
- :code:`F1` |
|
|
|
For more information, see `here <https://en.wikipedia.org/wiki/Precision_and_recall>`__. |
|
|
|
References |
|
---------- |
|
|
|
.. bibliography:: nlp_all.bib |
|
:style: plain |
|
:labelprefix: NLP-NER |
|
:keyprefix: nlp-ner- |
|
|