File size: 5,867 Bytes
23d39ea 2150d10 23d39ea d02dedf 23d39ea 10be618 23d39ea 46efa36 23d39ea d34a836 23d39ea 25cd922 23d39ea |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
---
license: cc-by-nc-nd-4.0
language:
- az
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: token-classification
tags:
- personally
- identifiable
- information
- recognition
- ner
new_version: LocalDoc/private_ner_azerbaijani_v2
---
# PII NER Azerbaijani
**PII NER Azerbaijani** is a fine-tuned Named Entity Recognition (NER) model based on XLM-RoBERTa. It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text.
## Model Details
- **Base Model:** XLM-RoBERTa
- **Training Metrics:**
- **Epoch 1:** Training Loss: 0.156, Validation Loss: 0.1309, Precision: 0.7794, Recall: 0.7940, F1: 0.7866, Accuracy: 0.9590
- **Epoch 2:** Training Loss: 0.1196, Validation Loss: 0.1172, Precision: 0.8042, Recall: 0.8078, F1: 0.8060, Accuracy: 0.9618
- **Epoch 3:** Training Loss: 0.1069, Validation Loss: 0.1129, Precision: 0.8096, Recall: 0.8213, F1: 0.8154, Accuracy: 0.9639
- **Test Metrics:**
- Loss: 0.11616, Precision: 0.80187, Recall: 0.80821, F1: 0.80503, Accuracy: 0.96264
## Entities (id2label)
```python
{
0: "O",
1: "VEHICLEVRM",
2: "HEIGHT",
3: "USERNAME",
4: "FIRSTNAME",
5: "BUILDINGNUMBER",
6: "SEX",
7: "PHONENUMBER",
8: "CURRENCY",
9: "CREDITCARDISSUER",
10: "CURRENCYNAME",
11: "MAC",
12: "MIDDLENAME",
13: "TIME",
14: "EYECOLOR",
15: "CURRENCYSYMBOL",
16: "GENDER",
17: "URL",
18: "CURRENCYCODE",
19: "ZIPCODE",
20: "CREDITCARDCVV",
21: "JOBTITLE",
22: "PHONEIMEI",
23: "COUNTY",
24: "JOBTYPE",
25: "LITECOINADDRESS",
26: "COMPANYNAME",
27: "ORDINALDIRECTION",
28: "MASKEDNUMBER",
29: "USERAGENT",
30: "LASTNAME",
31: "SSN",
32: "STREET",
33: "SECONDARYADDRESS",
34: "STATE",
35: "ETHEREUMADDRESS",
36: "AMOUNT",
37: "ACCOUNTNUMBER",
38: "CITY",
39: "CREDITCARDNUMBER",
40: "BIC",
41: "EMAIL",
42: "NEARBYGPSCOORDINATE",
43: "PIN",
44: "ACCOUNTNAME",
45: "VEHICLEVIN",
46: "PREFIX",
47: "JOBAREA",
48: "AGE",
49: "PASSWORD",
50: "DOB",
51: "BITCOINADDRESS",
52: "IBAN",
53: "IP",
54: "DATE"
}
```
## Usage
To use the model for spell correction:
```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
model_id = "LocalDoc/private_ner_azerbaijani"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
test_text = (
"Salam, mənim adım Əli Hüseynovdur. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, Nizami küçəsində, 25/31 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir."
)
inputs = tokenizer(test_text, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
offset_mapping = offset_mapping[0].tolist()
predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
word_ids = inputs.word_ids(batch_index=0)
aggregated = []
prev_word_id = None
for idx, word_id in enumerate(word_ids):
if word_id is None:
continue
if word_id != prev_word_id:
aggregated.append({
"word_id": word_id,
"tokens": [tokens[idx]],
"offsets": [offset_mapping[idx]],
"label": predicted_labels[idx]
})
else:
aggregated[-1]["tokens"].append(tokens[idx])
aggregated[-1]["offsets"].append(offset_mapping[idx])
prev_word_id = word_id
entities = []
current_entity = None
for word in aggregated:
if word["label"] == "O":
if current_entity is not None:
entities.append(current_entity)
current_entity = None
else:
if current_entity is None:
current_entity = {
"type": word["label"],
"start": word["offsets"][0][0],
"end": word["offsets"][-1][1]
}
else:
if word["label"] == current_entity["type"]:
current_entity["end"] = word["offsets"][-1][1]
else:
entities.append(current_entity)
current_entity = {
"type": word["label"],
"start": word["offsets"][0][0],
"end": word["offsets"][-1][1]
}
if current_entity is not None:
entities.append(current_entity)
for entity in entities:
entity["text"] = test_text[entity["start"]:entity["end"]]
for entity in entities:
print(entity)
```
```json
{'type': 'FIRSTNAME', 'start': 18, 'end': 21, 'text': 'Əli'}
{'type': 'LASTNAME', 'start': 22, 'end': 34, 'text': 'Hüseynovdur.'}
{'type': 'DOB', 'start': 49, 'end': 64, 'text': '15.05.1990-dır.'}
{'type': 'STREET', 'start': 81, 'end': 87, 'text': 'Nizami'}
{'type': 'BUILDINGNUMBER', 'start': 99, 'end': 104, 'text': '25/31'}
{'type': 'PHONENUMBER', 'start': 141, 'end': 159, 'text': '+994552345678-dir.'}
```
## License
This model licensed under the CC BY-NC-ND 4.0 license.
What does this license allow?
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made.
Non-Commercial: You may not use the material for commercial purposes.
No Derivatives: If you remix, transform, or build upon the material, you may not distribute the modified material.
For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by-nc-nd/4.0/">CC BY-NC-ND 4.0 license</a>.
## Contact
For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com]. |