nvidia
/

Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0

Text Classification

PEFT

Safetensors

English

Model card Files Files and versions Community

prasoonv commited on Jun 9

Commit

8a2cf4e

verified ·

1 Parent(s): 62006ac

Update model name

Browse files

Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -15,19 +15,19 @@ pipeline_tag: text-classification
 The use of this model is governed by the [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/).
 ## Model Details
-Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0 is a LLM content safety model. It is a parameter efficient instruction tuned version of [Llama Guard](https://huggingface.co/meta-llama/LlamaGuard-7b) based on [Llama2-7B](https://arxiv.org/abs/2307.09288) trained on Nvidia's content safety dataset [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) covering Nvidia's broad taxonomy of 13 critical safety risk categories.
-Paper Details: [Aegis Content Moderation](https://arxiv.org/pdf/2404.05993.pdf#page=10.63)
 ### Model Description
-The Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0 model involves the following:
 1. System instruction including the safety taxonomy, a safety policy with inclusions and, exclusions.
 2. The system prompt instructs the LLM to moderate user prompt, partial dialog or full dialog.
 3. The LLM response is a string which can be either safe or unsafe. If the string generated by the LLM is "unsafe", on a new line, the category ID of violation is output by the LLM based on the policy in the system prompt.
 4. Novel safety risk categories and policy can be provided in the instruction for the model to categorize using the novel taxonomy and policy.
 5. The safety taxonomy and policy used to train the models contain 13 critically unsafe risk categories, a safe category and a "needs caution" category.
-6. Internally annotated dataset called Aegis-AI-Content-Safety-Dataset-1.0 of approximately 11,000 prompts and responses are used to instruction tune the model. Annotations are at dialog level not per turn.
 We have since collected in total 30,000 annotations on a further expanded taxonomy and future versions of the models will be trained on the full set. The annotations are at dialog level instead of per-turn level.
 7. Model is instruction tuned with safety instruction, with the LLM behaving as a classifier in this setting.
 PLEASE NOTE: Model has only been trained to perform prompt classification since the annotations were not available at turn level. If you wish to use the model for response classification, use the template as provided below.
@@ -169,7 +169,7 @@ Ethical use: Technology can have a profound impact on people and the world, and
 ### Direct Use
-- The Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0 model is for users who wants to safeguard or evaluate a general purpose LLM's generated content
 Model and dataset restrictions:
@@ -376,7 +376,7 @@ The inference code for this model is available through the NeMo Curator GitHub r
 ## Training Details
 ### Training Data
-The model has been trained on Nvidia's [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
 * Human Prompts from Anthropic RLHF harmless dataset [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 * LLM response generated from Mistral-7B-v0.1 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
@@ -409,7 +409,7 @@ We use the [PEFT](https://huggingface.co/docs/peft/en/index) library from Huggin
 The model has been evaluated on the following benchmarks:
-* Test partition of Nvidia's content safety dataset [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
 * [Toxic Chat Dataset](https://huggingface.co/datasets/lmsys/toxic-chat)
 * [Open AI Moderation Dataset](https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation/tree/main)
 * [SimpleSafetyTests Benchmark](https://arxiv.org/html/2311.08370v2)
@@ -417,7 +417,7 @@ The model has been evaluated on the following benchmarks:
 #### Metrics
 We report F1 and AUPRC scores for the model on the evaluation benchmarks.
-### Results on Aegis Content Safety Test Set
  Model                  | AUPRC         |          F1   |
 ------------            |:-----------:  |-----------:   |
 Llama Guard Base        |0.930          |0.62           |

 The use of this model is governed by the [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/).
 ## Model Details
+Llama Nemotron Safety Guard Defensive V1, formerly known as Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0, is an LLM content safety model. It is a parameter efficient instruction tuned version of [Llama Guard](https://huggingface.co/meta-llama/LlamaGuard-7b) based on [Llama2-7B](https://arxiv.org/abs/2307.09288) trained on Nvidia's content safety dataset [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) covering Nvidia's broad taxonomy of 13 critical safety risk categories.
+Paper Details: [Aegis 1.0: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts](https://arxiv.org/pdf/2404.05993.pdf#page=10.63)
 ### Model Description
+The Llama-2-Nemotron-Safety-Guard-Defensive-7B-v1 model involves the following:
 1. System instruction including the safety taxonomy, a safety policy with inclusions and, exclusions.
 2. The system prompt instructs the LLM to moderate user prompt, partial dialog or full dialog.
 3. The LLM response is a string which can be either safe or unsafe. If the string generated by the LLM is "unsafe", on a new line, the category ID of violation is output by the LLM based on the policy in the system prompt.
 4. Novel safety risk categories and policy can be provided in the instruction for the model to categorize using the novel taxonomy and policy.
 5. The safety taxonomy and policy used to train the models contain 13 critically unsafe risk categories, a safe category and a "needs caution" category.
+6. Internally annotated dataset called [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) of approximately 11,000 prompts and responses are used to instruction tune the model. Annotations are at dialog level not per turn.
 We have since collected in total 30,000 annotations on a further expanded taxonomy and future versions of the models will be trained on the full set. The annotations are at dialog level instead of per-turn level.
 7. Model is instruction tuned with safety instruction, with the LLM behaving as a classifier in this setting.
 PLEASE NOTE: Model has only been trained to perform prompt classification since the annotations were not available at turn level. If you wish to use the model for response classification, use the template as provided below.
 ### Direct Use
+- The Llama-2-Nemotron-Safety-Guard-Defensive-7B-v1 model is for users who wants to safeguard or evaluate a general purpose LLM's generated content
 Model and dataset restrictions:
 ## Training Details
 ### Training Data
+The model has been trained on Nvidia's [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
 * Human Prompts from Anthropic RLHF harmless dataset [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 * LLM response generated from Mistral-7B-v0.1 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
 The model has been evaluated on the following benchmarks:
+* Test partition of Nvidia's content safety dataset [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
 * [Toxic Chat Dataset](https://huggingface.co/datasets/lmsys/toxic-chat)
 * [Open AI Moderation Dataset](https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation/tree/main)
 * [SimpleSafetyTests Benchmark](https://arxiv.org/html/2311.08370v2)
 #### Metrics
 We report F1 and AUPRC scores for the model on the evaluation benchmarks.
+### Results on the Nemotron Content Safety V1 Test Set
  Model                  | AUPRC         |          F1   |
 ------------            |:-----------:  |-----------:   |
 Llama Guard Base        |0.930          |0.62           |