|
--- |
|
license: other |
|
license_name: research-use-only |
|
license_link: LICENSE |
|
language: |
|
- en |
|
base_model: |
|
- Qwen/Qwen2.5-14B |
|
tags: |
|
- LLM safety |
|
- jailbreaking |
|
- adversarial prompts |
|
- research-only |
|
- Attacker-v0.1 |
|
- Qwen2.5-14B |
|
model_index: |
|
- name: Attacker-v0.1 |
|
datasets: |
|
- walledai/HarmBench |
|
extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects." |
|
extra_gated_fields: |
|
Affiliation: text |
|
Professional Email Address: text |
|
Country: country |
|
Specific date: date_picker |
|
I want to use this model for: |
|
type: select |
|
options: |
|
- Research |
|
- label: Other |
|
value: other |
|
I agree to use this model for research use ONLY: checkbox |
|
--- |
|
|
|
# Attacker-v0.1 |
|
|
|
## Model Description |
|
|
|
**Attacker-v0.1** is a specialized model designed to generate adversarial prompts capable of bypassing the safety mechanisms of various Large Language Models (LLMs). Its primary objective is to assist researchers in identifying and understanding vulnerabilities within LLMs, thereby contributing to the development of more robust and secure AI systems. |
|
|
|
## Intended Uses & Limitations |
|
|
|
**Intended Uses:** |
|
|
|
- **Research and Development:** To study and analyze the vulnerabilities of LLMs by generating prompts that can potentially bypass their safety constraints. |
|
- **Security Testing:** To evaluate the effectiveness of safety mechanisms implemented in LLMs and assist in enhancing their robustness. |
|
|
|
**Limitations:** |
|
|
|
- **Ethical Considerations:** The use of Attacker-v0.1 should be confined to ethical research purposes. It is not intended for malicious activities or to cause harm. |
|
- **Controlled Access:** Due to potential misuse, access to this model is restricted. Interested parties must contact the author for usage permissions beyond academic research. |
|
|
|
## Usage |
|
|
|
Please refer to https://github.com/leileqiTHU/Attacker |
|
**π₯Ίπ₯Ίπ₯Ί Please πSTARπ this repo if you think the model or the repo is helpful, this is very important to me! Thanks !! π₯Ίπ₯Ίπ₯Ί** |
|
|
|
## Risks |
|
|
|
**Risks:** |
|
|
|
- **Misuse Potential:** There is a risk that the model could be used for unethical purposes, such as generating harmful content or exploiting AI systems maliciously. |
|
|
|
**Mitigation Strategies:** |
|
|
|
- **Access Control:** Implement strict access controls to ensure the model is used solely for legitimate research purposes. |
|
|
|
|
|
--- |
|
|
|
## π Researchers Using Attacker-v0.1 |
|
|
|
Attacker-v0.1 has been explored and utilized by researchers from the following esteemed institutions: |
|
|
|
- **Ant Group** β [Website](https://www.antgroup.com/) |
|
- **NCSOFT** β [Website](https://us.ncsoft.com/en-us) |
|
- **Northeastern University (USA)** β [Website](https://www.northeastern.edu/) |
|
- **ShanghaiTech University** β [Website](https://www.shanghaitech.edu.cn/) |
|
- **The Chinese University of Hong Kong** β [Website](https://www.cuhk.edu.hk/chinese/index.html) |
|
- **University of Electronic Science and Technology of China** β [Website](https://en.uestc.edu.cn/) |
|
- **Virtue AI** β [Website](https://www.virtueai.com/) |
|
- **Zhejiang University** β [Website](https://www.zju.edu.cn/) |
|
|
|
We are grateful to these research teams for their contributions and valuable insights into advancing the safety and robustness of Large Language Models. |
|
|
|
--- |
|
|
|
## π License |
|
|
|
- License: Research Use Only |
|
- Usage Restrictions: This model is intended for research purposes only. For commercial use or other inquiries, please contact the model author. |
|
|
|
|
|
|
|
## Model Card Authors |
|
|
|
- **Author:** Leqi Lei |
|
- **Affiliation:** The CoAI Group, Tsinghua University |
|
- **Contact Information:** leilq23@mails.tsinghua.edu.cn |
|
|
|
|
|
## Citation |
|
|
|
If you use this model in your research or work, please cite it as follows: |
|
|
|
``` |
|
@misc{leqi2025Attacker, |
|
author = {Leqi Lei}, |
|
title = {The Safety Mirage: Reinforcement-Learned Jailbreaks Bypass Even the |
|
Most Aligned LLMs}, |
|
year = {2025}, |
|
howpublished = {\url{https://github.com/leileqiTHU/Attacker}}, |
|
note = {Accessed: 2025-04-13} |
|
} |
|
``` |