Text Classification
PEFT
Safetensors
English
prasoonv commited on
Commit
8a2cf4e
·
verified ·
1 Parent(s): 62006ac

Update model name

Browse files

Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>

Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -15,19 +15,19 @@ pipeline_tag: text-classification
15
  The use of this model is governed by the [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/).
16
 
17
  ## Model Details
18
- Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0 is a LLM content safety model. It is a parameter efficient instruction tuned version of [Llama Guard](https://huggingface.co/meta-llama/LlamaGuard-7b) based on [Llama2-7B](https://arxiv.org/abs/2307.09288) trained on Nvidia's content safety dataset [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) covering Nvidia's broad taxonomy of 13 critical safety risk categories.
19
 
20
- Paper Details: [Aegis Content Moderation](https://arxiv.org/pdf/2404.05993.pdf#page=10.63)
21
 
22
  ### Model Description
23
- The Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0 model involves the following:
24
 
25
  1. System instruction including the safety taxonomy, a safety policy with inclusions and, exclusions.
26
  2. The system prompt instructs the LLM to moderate user prompt, partial dialog or full dialog.
27
  3. The LLM response is a string which can be either safe or unsafe. If the string generated by the LLM is "unsafe", on a new line, the category ID of violation is output by the LLM based on the policy in the system prompt.
28
  4. Novel safety risk categories and policy can be provided in the instruction for the model to categorize using the novel taxonomy and policy.
29
  5. The safety taxonomy and policy used to train the models contain 13 critically unsafe risk categories, a safe category and a "needs caution" category.
30
- 6. Internally annotated dataset called Aegis-AI-Content-Safety-Dataset-1.0 of approximately 11,000 prompts and responses are used to instruction tune the model. Annotations are at dialog level not per turn.
31
  We have since collected in total 30,000 annotations on a further expanded taxonomy and future versions of the models will be trained on the full set. The annotations are at dialog level instead of per-turn level.
32
  7. Model is instruction tuned with safety instruction, with the LLM behaving as a classifier in this setting.
33
  PLEASE NOTE: Model has only been trained to perform prompt classification since the annotations were not available at turn level. If you wish to use the model for response classification, use the template as provided below.
@@ -169,7 +169,7 @@ Ethical use: Technology can have a profound impact on people and the world, and
169
 
170
  ### Direct Use
171
 
172
- - The Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0 model is for users who wants to safeguard or evaluate a general purpose LLM's generated content
173
 
174
  Model and dataset restrictions:
175
 
@@ -376,7 +376,7 @@ The inference code for this model is available through the NeMo Curator GitHub r
376
  ## Training Details
377
 
378
  ### Training Data
379
- The model has been trained on Nvidia's [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
380
 
381
  * Human Prompts from Anthropic RLHF harmless dataset [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
382
  * LLM response generated from Mistral-7B-v0.1 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
@@ -409,7 +409,7 @@ We use the [PEFT](https://huggingface.co/docs/peft/en/index) library from Huggin
409
 
410
  The model has been evaluated on the following benchmarks:
411
 
412
- * Test partition of Nvidia's content safety dataset [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
413
  * [Toxic Chat Dataset](https://huggingface.co/datasets/lmsys/toxic-chat)
414
  * [Open AI Moderation Dataset](https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation/tree/main)
415
  * [SimpleSafetyTests Benchmark](https://arxiv.org/html/2311.08370v2)
@@ -417,7 +417,7 @@ The model has been evaluated on the following benchmarks:
417
  #### Metrics
418
  We report F1 and AUPRC scores for the model on the evaluation benchmarks.
419
 
420
- ### Results on Aegis Content Safety Test Set
421
  Model | AUPRC | F1 |
422
  ------------ |:-----------: |-----------: |
423
  Llama Guard Base |0.930 |0.62 |
 
15
  The use of this model is governed by the [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/).
16
 
17
  ## Model Details
18
+ Llama Nemotron Safety Guard Defensive V1, formerly known as Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0, is an LLM content safety model. It is a parameter efficient instruction tuned version of [Llama Guard](https://huggingface.co/meta-llama/LlamaGuard-7b) based on [Llama2-7B](https://arxiv.org/abs/2307.09288) trained on Nvidia's content safety dataset [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) covering Nvidia's broad taxonomy of 13 critical safety risk categories.
19
 
20
+ Paper Details: [Aegis 1.0: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts](https://arxiv.org/pdf/2404.05993.pdf#page=10.63)
21
 
22
  ### Model Description
23
+ The Llama-2-Nemotron-Safety-Guard-Defensive-7B-v1 model involves the following:
24
 
25
  1. System instruction including the safety taxonomy, a safety policy with inclusions and, exclusions.
26
  2. The system prompt instructs the LLM to moderate user prompt, partial dialog or full dialog.
27
  3. The LLM response is a string which can be either safe or unsafe. If the string generated by the LLM is "unsafe", on a new line, the category ID of violation is output by the LLM based on the policy in the system prompt.
28
  4. Novel safety risk categories and policy can be provided in the instruction for the model to categorize using the novel taxonomy and policy.
29
  5. The safety taxonomy and policy used to train the models contain 13 critically unsafe risk categories, a safe category and a "needs caution" category.
30
+ 6. Internally annotated dataset called [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) of approximately 11,000 prompts and responses are used to instruction tune the model. Annotations are at dialog level not per turn.
31
  We have since collected in total 30,000 annotations on a further expanded taxonomy and future versions of the models will be trained on the full set. The annotations are at dialog level instead of per-turn level.
32
  7. Model is instruction tuned with safety instruction, with the LLM behaving as a classifier in this setting.
33
  PLEASE NOTE: Model has only been trained to perform prompt classification since the annotations were not available at turn level. If you wish to use the model for response classification, use the template as provided below.
 
169
 
170
  ### Direct Use
171
 
172
+ - The Llama-2-Nemotron-Safety-Guard-Defensive-7B-v1 model is for users who wants to safeguard or evaluate a general purpose LLM's generated content
173
 
174
  Model and dataset restrictions:
175
 
 
376
  ## Training Details
377
 
378
  ### Training Data
379
+ The model has been trained on Nvidia's [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
380
 
381
  * Human Prompts from Anthropic RLHF harmless dataset [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
382
  * LLM response generated from Mistral-7B-v0.1 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
 
409
 
410
  The model has been evaluated on the following benchmarks:
411
 
412
+ * Test partition of Nvidia's content safety dataset [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
413
  * [Toxic Chat Dataset](https://huggingface.co/datasets/lmsys/toxic-chat)
414
  * [Open AI Moderation Dataset](https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation/tree/main)
415
  * [SimpleSafetyTests Benchmark](https://arxiv.org/html/2311.08370v2)
 
417
  #### Metrics
418
  We report F1 and AUPRC scores for the model on the evaluation benchmarks.
419
 
420
+ ### Results on the Nemotron Content Safety V1 Test Set
421
  Model | AUPRC | F1 |
422
  ------------ |:-----------: |-----------: |
423
  Llama Guard Base |0.930 |0.62 |