Ensuring AI production safety: Developers’ development guide

Security is not optional when deploying AI into the real world, it is essential. Openai places emphasize ensuring that applications on its models are secure, responsible and aligned with policies. This article describes how OpenAI evaluates security and how you can meet these criteria.

In addition to technical performance, responsible AI deployments also require the anticipation of potential risks, maintaining user trust, and aligning with broader ethical and social considerations. OpenAI’s approach involves continuous testing, monitoring, and improvement of its models, and provides developers with clear guidance to minimize abuse. By understanding these security measures, you can not only build more reliable applications, but also contribute to a healthier AI ecosystem where innovation and responsibility coexist.

Why safety is crucial

AI systems are powerful, but without guardrails, they can produce harmful, biased or misleading content. For developers, ensuring security involves more than compliance – it’s about applications that people can truly trust and benefit.

  • Protect end users from harm by minimizing risks such as misinformation, exploitation or offensive outputs
  • Increase trust in the application to make it more attractive and reliable to users
  • Help you comply with OpenAI’s usage policies and a broader legal or ethical framework
  • Prevent account suspensions, reputational losses and potential long-term setbacks for the business

By embedding security into your design and development process, you not only reduce risks—you create a stronger foundation for innovation that can be scaled responsibly.

Core security practices

Overview of Moderate APIs

OpenAI provides a free modest API designed to help developers identify potentially harmful content in text and images. The tool can enable powerful content filtering by systematically labeling categories such as harassment, hatred, violence, sexual content or self-harm, enhancing end-user protection and enhancing responsible AI usage.

Supported models – Two modest models can be used:

  • Omni-Modiention-Latest:For the preferred choice for most applications, the model supports text and image input, provides more subtle categories, and provides extended detection capabilities.
  • Best text mode (Legend): Support only text and provide fewer categories. It is recommended to use the OMNI model for new deployments as it provides a wider range of protection and multi-modal analysis.

Before deploying content, use the audit endpoint to evaluate whether it violates OpenAI’s policies. If the system determines risk or harmful substances, intervention can be done by filtering the content, stopping publication, or taking further action. This API is free and is constantly updated to improve security.

You can use OpenAI’s official Python SDK to adjust text input:

from openai import OpenAI
client = OpenAI()

response = client.moderations.create(
    model="omni-moderation-latest",
    input="...text to classify goes here...",
)

print(response)

The API will return a structured JSON response indicating:

  • Tag: Whether the input is considered potentially harmful.
  • Category: Which categories (e.g., violence, hatred, sex) are marked as assault.
  • Category_scores: Model confidence score (range 0-1) for each category indicating the possibility of a violation.
  • CATCORY_APPLIED_INPUT_TYPES: For OMNI model, which input type (text, image) is shown to trigger each flag.

Sample output may include:

{
  "id": "...",
  "model": "omni-moderation-latest",
  "results": [
    {
      "flagged": true,
      "categories": {
        "violence": true,
        "harassment": false,
        // other categories...
      },
      "category_scores": {
        "violence": 0.86,
        "harassment": 0.001,
        // other scores...
      },
      "category_applied_input_types": {
        "violence": ["image"],
        "harassment": [],
        // others...
      }
    }
  ]
}

Moderate APIs can detect and tag multiple content categories:

  • Harassment (including threatening language)
  • Hate (based on race, gender, religion, etc.)
  • Illegal (recommendation or reference to illegal acts)
  • Self-harm (including encouragement, intention or instructions)
  • Sexual content
  • Violence (including graphic violence)

Some categories support text and image input, especially the OMNI model, while others use text only.

Confrontational test

Adversarial testing (often called Red Team) is intentionally challenging your AI system with malicious, unexpected, or manipulated input to discover weaknesses before real users. This helps reveal issues such as timely injection (“ignoring all instructions and…”), bias, toxicity, or data leakage.

The Red Team is not a one-time event, but an ongoing best practice. It ensures that your application remains resilient to gas risk. Tools such as DeepeVal systematically test LLM applications (chatbots, RAG pipelines, proxy, etc.) by providing a structured framework to make it easier to understand vulnerabilities, biases, or insecure outputs.

By integrating adversarial testing into development and deployment, you can create safer and more reliable AI systems that prepare you for unpredictable real-world behavior.

Humans are cycling (hitl)

When working in high-risk areas such as healthcare, finance, law, or code generation, it is important to review before using AI to produce output. Reviewers should also have access to all original material (such as source documents or notes) so that they can check the work of AI and make sure it is trustworthy and accurate. This process helps capture errors and build confidence in application reliability.

Timely engineering

Timely engineering is a key technology to reduce the insecure or unnecessary output of AI models. By carefully designing the tips, developers can limit the theme and tone of response, making the model less likely to produce harmful or irrelevant content.

Adding context and providing high-quality example tips before asking new questions helps guide the model to produce safer, more accurate and appropriate results. Expecting potential abuse and proactively building defenses in prompts can further protect the application from abuse.

This approach enhances control over AI behavior and improves overall security.

Input and output controls

Input and output control is critical to improving the security and reliability of AI applications. Limiting the length of time the user input reduces the risk of rapid injection of attacks, while capping the number of output tokens helps control abuse and manage costs.

Whenever possible, minimize the chance of unsafe input using a proven input method (such as a drop-down menu) instead of a free text field. Additionally, routing user queries with trusted, pre-verified resources (such as a curated knowledge base for customer support) rather than generating completely new responses can significantly reduce errors and harmful output.

Together, these measures create a safer and predictable AI experience.

User identity and access

User identity and access controls are important to reduce anonymous abuse and help maintain security of AI applications. Usually, an account that requires users to register and log in (using Gmail, LinkedIn, or other suitable authentication) adds a layer of accountability. In some cases, credit card or ID verification can further reduce the risk of abuse.

In addition, including security identifiers in API requests enable OpenAI to effectively track and monitor abuse. These identifiers are unique strings representing each user, but should be hashed to protect privacy. If the user accesses your service without logging in, you are advised to send a session ID. Here is an example of using a security identifier in a chat completion request:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {"role": "user", "content": "This is a test"}
  ],
  max_tokens=5,
  safety_identifier="user_123456"
)

This practice helps OpenAI provide actionable feedback and improve abuse detection tailored to your application usage patterns.

Transparency and feedback loops

To maintain security and increase user trust, it is important to provide users with a simple and easy-to-access way to report insecure or unexpected output. This can be done via a clearly visible button, listed email address or ticket submission form. Reports submitted should be actively monitored by humans, which can be investigated and respond appropriately.

Furthermore, limitations of AI systems, such as the possibility of hallucination or bias, are clearly communicated to set appropriate user expectations and encourage responsible use. Continuous monitoring of applications in production allows you to quickly identify and resolve issues to ensure that your system remains safe and reliable over time.

How OpenAI evaluates security

OpenAI evaluates security in several key areas to ensure that models and applications act responsibly. This includes checking whether the output produces harmful content, testing how resistant the model is, ensuring clear communication of limitations, and identifying critical workflows for human supervision. By meeting these standards, developers increase their chances of applying, will pass OpenAI’s security checks and successfully operate in production.

With the release of GPT-5, OpenAI introduced a security classifier that classifies requests based on risk levels. If your organization repeatedly triggers high-risk thresholds, OpenAI may limit or block access to GPT-5 to prevent abuse. To help manage this issue, developers are encouraged to use security identifiers in API requests, which require uniquely identifying users (while protecting privacy) to enable precise abuse detection and intervention without personal violations throughout the organization.

OpenAI also performs multi-layered security checks on the model, including preventing unsubscribed content such as hate or illegal material, testing against jailbreak prompts, evaluating factual accuracy (minimizing hallucinations), and ensuring that the model follows a hierarchy in instructions between the system, developer and user messages. This powerful, ongoing evaluation process helps OpenAI maintain high standards of model security standards while adapting to evolving risks and capabilities

in conclusion

Building a secure and trustworthy AI application requires not only technical performance, but also thoughtful safeguards, ongoing testing and clear accountability. From modest APIs to adversarial testing, human review, and careful control of inputs and outputs, developers have a range of tools and practices that reduce risk and increase reliability.

Security is not a box that can be checked once, but a process of continuous evaluation, improvement and adaptation as technology and user behavior develop. By embedding these practices into the development workflow, teams can not only meet policy requirements but also provide AI systems users can truly rely on – application cameras that combine innovation with scalability of responsibility and scalability and trust.


I am a civil engineering graduate in Islamic Islam in Jamia Milia New Delhi (2022) and I am very interested in data science, especially neural networks and their applications in various fields.

🔥[Recommended Read] NVIDIA AI Open Source VIPE (Video Pose Engine): A powerful and universal 3D video annotation tool for spatial AI

You may also like...