Model Card for PII Detection with DeBERTa

This model is a fine-tuned version of microsoft/deberta for Named Entity Recognition (NER), specifically designed for detecting Personally Identifiable Information (PII) entities like names, SSNs, phone numbers, credit card numbers, addresses, and more.

Model Details

Model Description

This transformer-based model is fine-tuned on a custom dataset to detect sensitive information, commonly categorized as PII. The model performs sequence labeling to identify entities using token-level classification.

Developed by: [Privatone]
Finetuned from model: microsoft/deberta
Model type: Token Classification (NER)
Language(s): English
Use case: PII detection in text

Training Details

Training Data

The model was fine-tuned on a custom dataset containing labeled examples of the following PII entity types:

NAME
SSN
PHONE-NO
CREDIT-CARD-NO
BANK-ACCOUNT-NO
BANK-ROUTING-NO
ADDRESS

Epoch Logs

Epoch	Train Loss	Val Loss	Precision	Recall	F1	Accuracy
1	0.3672	0.1987	0.7806	0.8114	0.7957	0.9534
2	0.1149	0.1011	0.9161	0.9772	0.9457	0.9797
3	0.0795	0.0889	0.9264	0.9825	0.9536	0.9813
4	0.0708	0.0880	0.9242	0.9842	0.9533	0.9806
5	0.0626	0.0858	0.9235	0.9851	0.9533	0.9806

SeqEval Classification Report

Label	Precision	Recall	F1-score	Support
ADDRESS	0.91	0.94	0.92	77
BANK-ACCOUNT-NO	0.91	0.99	0.95	169
BANK-ROUTING-NO	0.85	0.96	0.90	104
CREDIT-CARD-NO	0.95	1.00	0.97	228
NAME	0.98	0.97	0.97	164
PHONE-NO	0.94	0.99	0.96	308
SSN	0.87	1.00	0.93	90

Summary

Micro avg: 0.95
Macro avg: 0.95
Weighted avg: 0.95

Evaluation

Testing Data

Evaluation was done on a held-out portion of the same labeled dataset.

Metrics

Precision
Recall
F1 (via seqeval)
Entity-wise breakdown
Token-level accuracy

Results

F1-score consistently above 0.95 for most labels, showing robustness in PII detection.

Recommendations

Use human review in high-risk environments.
Evaluate on your own domain-specific data before deployment.

How to Get Started with the Model


from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "AI-Enthusiast11/pii-entity-extractor"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
    entities = {}
    for entity in ner_results:
        entity_type = entity["entity_group"]
        entity_value = entity["word"].replace("##", "")  # Remove subword prefixes

        # Handle token merging
        if entity_type not in entities:
            entities[entity_type] = []
        if entities[entity_type] and not entity_value.startswith(" "):
            # If the previous token exists and this one isn't a new word, merge it
            entities[entity_type][-1] += entity_value
        else:
            entities[entity_type].append(entity_value)

    return entities

def redact_text_with_labels(text):
    ner_results = nlp(text)

    # Merge tokens for multi-token entities (if any)
    cleaned_entities = merge_tokens(ner_results)

    redacted_text = text
    for entity_type, values in cleaned_entities.items():
        for value in values:
            # Replace each identified entity with the label
            redacted_text = redacted_text.replace(value, f"[{entity_type}]")

    return redacted_text



#Loading the pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")

# Example input (choose one from your examples)
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# Run pipeline and process result
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)

# Print the NER results
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
    print(f"  {entity_type}: {', '.join(values)}")

# Redact the single example with labels
redacted_example = redact_text_with_labels(example)

# Print the redacted result
print(f"\n==Redacted Example:==\n{redacted_example}")