Update model name
Browse filesSigned-off-by: Prasoon Varshney <prasoonv@nvidia.com>
README.md
CHANGED
@@ -15,19 +15,19 @@ pipeline_tag: text-classification
|
|
15 |
The use of this model is governed by the [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/).
|
16 |
|
17 |
## Model Details
|
18 |
-
Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0 is
|
19 |
|
20 |
-
Paper Details: [Aegis Content Moderation](https://arxiv.org/pdf/2404.05993.pdf#page=10.63)
|
21 |
|
22 |
### Model Description
|
23 |
-
The
|
24 |
|
25 |
1. System instruction including the safety taxonomy, a safety policy with inclusions and, exclusions.
|
26 |
2. The system prompt instructs the LLM to moderate user prompt, partial dialog or full dialog.
|
27 |
3. The LLM response is a string which can be either safe or unsafe. If the string generated by the LLM is "unsafe", on a new line, the category ID of violation is output by the LLM based on the policy in the system prompt.
|
28 |
4. Novel safety risk categories and policy can be provided in the instruction for the model to categorize using the novel taxonomy and policy.
|
29 |
5. The safety taxonomy and policy used to train the models contain 13 critically unsafe risk categories, a safe category and a "needs caution" category.
|
30 |
-
6. Internally annotated dataset called Aegis-AI-Content-Safety-Dataset-1.0 of approximately 11,000 prompts and responses are used to instruction tune the model. Annotations are at dialog level not per turn.
|
31 |
We have since collected in total 30,000 annotations on a further expanded taxonomy and future versions of the models will be trained on the full set. The annotations are at dialog level instead of per-turn level.
|
32 |
7. Model is instruction tuned with safety instruction, with the LLM behaving as a classifier in this setting.
|
33 |
PLEASE NOTE: Model has only been trained to perform prompt classification since the annotations were not available at turn level. If you wish to use the model for response classification, use the template as provided below.
|
@@ -169,7 +169,7 @@ Ethical use: Technology can have a profound impact on people and the world, and
|
|
169 |
|
170 |
### Direct Use
|
171 |
|
172 |
-
- The
|
173 |
|
174 |
Model and dataset restrictions:
|
175 |
|
@@ -376,7 +376,7 @@ The inference code for this model is available through the NeMo Curator GitHub r
|
|
376 |
## Training Details
|
377 |
|
378 |
### Training Data
|
379 |
-
The model has been trained on Nvidia's [
|
380 |
|
381 |
* Human Prompts from Anthropic RLHF harmless dataset [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
|
382 |
* LLM response generated from Mistral-7B-v0.1 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
|
@@ -409,7 +409,7 @@ We use the [PEFT](https://huggingface.co/docs/peft/en/index) library from Huggin
|
|
409 |
|
410 |
The model has been evaluated on the following benchmarks:
|
411 |
|
412 |
-
* Test partition of Nvidia's content safety dataset [
|
413 |
* [Toxic Chat Dataset](https://huggingface.co/datasets/lmsys/toxic-chat)
|
414 |
* [Open AI Moderation Dataset](https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation/tree/main)
|
415 |
* [SimpleSafetyTests Benchmark](https://arxiv.org/html/2311.08370v2)
|
@@ -417,7 +417,7 @@ The model has been evaluated on the following benchmarks:
|
|
417 |
#### Metrics
|
418 |
We report F1 and AUPRC scores for the model on the evaluation benchmarks.
|
419 |
|
420 |
-
### Results on
|
421 |
Model | AUPRC | F1 |
|
422 |
------------ |:-----------: |-----------: |
|
423 |
Llama Guard Base |0.930 |0.62 |
|
|
|
15 |
The use of this model is governed by the [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/).
|
16 |
|
17 |
## Model Details
|
18 |
+
Llama Nemotron Safety Guard Defensive V1, formerly known as Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0, is an LLM content safety model. It is a parameter efficient instruction tuned version of [Llama Guard](https://huggingface.co/meta-llama/LlamaGuard-7b) based on [Llama2-7B](https://arxiv.org/abs/2307.09288) trained on Nvidia's content safety dataset [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) covering Nvidia's broad taxonomy of 13 critical safety risk categories.
|
19 |
|
20 |
+
Paper Details: [Aegis 1.0: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts](https://arxiv.org/pdf/2404.05993.pdf#page=10.63)
|
21 |
|
22 |
### Model Description
|
23 |
+
The Llama-2-Nemotron-Safety-Guard-Defensive-7B-v1 model involves the following:
|
24 |
|
25 |
1. System instruction including the safety taxonomy, a safety policy with inclusions and, exclusions.
|
26 |
2. The system prompt instructs the LLM to moderate user prompt, partial dialog or full dialog.
|
27 |
3. The LLM response is a string which can be either safe or unsafe. If the string generated by the LLM is "unsafe", on a new line, the category ID of violation is output by the LLM based on the policy in the system prompt.
|
28 |
4. Novel safety risk categories and policy can be provided in the instruction for the model to categorize using the novel taxonomy and policy.
|
29 |
5. The safety taxonomy and policy used to train the models contain 13 critically unsafe risk categories, a safe category and a "needs caution" category.
|
30 |
+
6. Internally annotated dataset called [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) of approximately 11,000 prompts and responses are used to instruction tune the model. Annotations are at dialog level not per turn.
|
31 |
We have since collected in total 30,000 annotations on a further expanded taxonomy and future versions of the models will be trained on the full set. The annotations are at dialog level instead of per-turn level.
|
32 |
7. Model is instruction tuned with safety instruction, with the LLM behaving as a classifier in this setting.
|
33 |
PLEASE NOTE: Model has only been trained to perform prompt classification since the annotations were not available at turn level. If you wish to use the model for response classification, use the template as provided below.
|
|
|
169 |
|
170 |
### Direct Use
|
171 |
|
172 |
+
- The Llama-2-Nemotron-Safety-Guard-Defensive-7B-v1 model is for users who wants to safeguard or evaluate a general purpose LLM's generated content
|
173 |
|
174 |
Model and dataset restrictions:
|
175 |
|
|
|
376 |
## Training Details
|
377 |
|
378 |
### Training Data
|
379 |
+
The model has been trained on Nvidia's [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
|
380 |
|
381 |
* Human Prompts from Anthropic RLHF harmless dataset [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
|
382 |
* LLM response generated from Mistral-7B-v0.1 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
|
|
|
409 |
|
410 |
The model has been evaluated on the following benchmarks:
|
411 |
|
412 |
+
* Test partition of Nvidia's content safety dataset [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
|
413 |
* [Toxic Chat Dataset](https://huggingface.co/datasets/lmsys/toxic-chat)
|
414 |
* [Open AI Moderation Dataset](https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation/tree/main)
|
415 |
* [SimpleSafetyTests Benchmark](https://arxiv.org/html/2311.08370v2)
|
|
|
417 |
#### Metrics
|
418 |
We report F1 and AUPRC scores for the model on the evaluation benchmarks.
|
419 |
|
420 |
+
### Results on the Nemotron Content Safety V1 Test Set
|
421 |
Model | AUPRC | F1 |
|
422 |
------------ |:-----------: |-----------: |
|
423 |
Llama Guard Base |0.930 |0.62 |
|