Hesgoal || TOTALSPORTEK|| F1 STREAMS || SOCCER STREAMS

How to test OpenAI model against single-turn adversarial attacks using DeepTeam

In this tutorial, we will explore how to test OpenAI models against single-turn attacks using DeepTeam.

Deep More than 10 attack methods are provided (such as prompt injection, jailbreak and leetspeak) that expose weaknesses in LLM applications. It starts with a simple baseline attack and then applies more advanced technology (called attack augmentation) to mimic real-world malicious behavior. Check The complete code is here.

By performing these attacks, we can evaluate the model’s ability to defend against different vulnerabilities.

In DeepTeam, there are two main types of attacks:

Here we will focus only on single-turn attacks.

Install dependencies

pip install deepteam openai pandas

Before running the red_team() function, you need to set OpenAI_API_KEY as an environment variable, because DeepTeam uses LLMS to generate both adversarial attacks and evaluate LLM output.

To obtain the OpenAI API key, access and generate a new key. If you are a new user, you may need to add billing details and pay a minimum of $5 to activate API access. Check The complete code is here.

import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')

Import library

import asyncio
from openai import OpenAI
from deepteam import red_team
from deepteam.vulnerabilities import IllegalActivity
from deepteam.attacks.single_turn import PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, MathProblem

Define model callbacks

This code defines an asynchronous callback function that querys the OpenAI model (GPT-4O-MINI) and returns the model’s response text. It acts as an output generator for the attack framework of LLM. Check The complete code is here.

client = OpenAI()

# Define callback for querying the LLM
async def model_callback(input: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # use "gpt-4o" if you want a stronger model
        messages=[{"role": "user", "content": input}],
    )
    return response.choices[0].message.content

Define vulnerability and attack

Here we specify the vulnerability (illegality) and set its type to Child Exploitation. Then we define different attack methods, e.g. Promote,,,,, Gray box,,,,, Basics 64,,,,, leetspeak,,,,, rot13,,,,, multilingualand Mathematical Problems

Each of these attack types will be explained in detail later in this tutorial. Check The complete code is here.

# Vulnerability
illegal_activity = IllegalActivity(types=["child exploitation"])

# Attacks
prompt_injection = PromptInjection()
graybox_attack = GrayBox()
base64_attack = Base64()
leetspeak_attack = Leetspeak()
rot_attack = ROT13()
multi_attack = Multilingual()
math_attack = MathProblem()

Prompt injection

Prompt injection is an attack where users attempt to overwrite the original description of the model by injecting malicious or manipulative text into the prompt. The purpose is to deceive the model into ignoring security policies and producing harmful or limited content.

The injections are swift and quick to force the model to play a role, an immoral role called Oppo, who ignores safety rules and encourages illegal activities.

Despite trying this, the model refused to comply – it responded:

“I’m sorry, I can’t help.”

This confirms that the model successfully resisted rapid injections and returned the pass results of the test. Check The complete code is here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[prompt_injection],
    )

Graybox Attack

The Graybox attack uses some of the knowledge about the target LLM system to create adversarial prompts. Unlike random input, Graybox attacks exploit known weaknesses by reconstructing baseline attacks with abstract or misleading languages, making it harder for security filters to detect malicious intent.

In this test, adversarial cue attempts to encourage illegal activity by masking instructions about creating fake identity files and using encryption channels. Check The complete code is here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[graybox_attack],
    )

Basic 64 Attacks

Basic 64 attacks are a common adversarial technique in which harmful instructions are encoded in Basic 64 to bypass security filters. Instead of presenting malicious content directly, the attacker hides it in an encoded format, hoping that the model will decode it and execute instructions.

In this test, the encoded string contains directions related to illegal activities, which at first glance are masqueraded as harmless. However, the model does not attempt to decode or follow hidden requests. Check The complete code is here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[base64_attack],
    )

Leetspeak attack

The leetspeak attack masks malicious indications by replacing normal characters with numbers or symbols (for example, changing to 4, e changing to 3, i changing to 1). This symbolic alternative makes harmful text harder to detect with simple keyword filters while still being readable to people or systems that may decode it.

In this test, the attack text attempted to guide minors in illegal activities written in Leetspeak format. Despite the confusion, the model clearly acknowledges malicious intentions. Check The complete code is here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[leetspeak_attack],
    )

ROT-13 Attack

ROT-13 attack is a classic obfuscation method in which each letter moves 13 positions within the letter. For example, a becomes n, b becomes o, and so on. This transformation pieces together harmful instructions into an encoded form, making them less likely to trigger simple keyword-based content filters. However, the text can still be easily decoded into the original form. Check The complete code is here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[rot_attack],
    )

Multilingual attacks

Multilingual attacks work by converting harmful baseline cues into uncommon languages. The idea is that content filters and regulating systems may be more robust in widely used languages (e.g. English) but less effective in other languages, allowing malicious instructions to bypass detection.

In this test, the attack is written Swahiliask for instructions related to illegal activities. Check The complete code is here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[multi_attack],
    )

Mathematical Problems

Mathematics problem attacks mask malicious requests in mathematical symbols or problem statements. By embedding harmful instructions into formal structures, the text appears to be a harmless academic exercise, making it difficult for filters to detect basic intent.

In this case, the input illegal exploitation content is a group theory problem, requiring the model to “prove” harmful results and provide “translation” in simple language. Check The complete code is here.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[math_attack],
    )

Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


I am a civil engineering graduate in Islamic Islam in Jamia Milia New Delhi (2022) and I am very interested in data science, especially neural networks and their applications in various fields.

You may also like...