OpenAI releases open-weight safety models for custom content moderation

OpenAI released gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, open-weight reasoning models that apply developer-provided safety policies at inference time to classify text and return an auditable chain of thought.

The models are fine-tuned from gpt-oss, licensed Apache-2.0, and available on Hugging Face. The release builds on its internal Safety Reasoner and was developed with early feedback from SafetyKit, ROOST, Tomoro and Discord.

OpenAI’s technical report says the models outperform gpt-5-thinking and gpt-oss baselines on an internal multi-policy accuracy evaluation, while noting limits: large labeled classifiers can still do better on some complex risks, and compute/latency can constrain always-on use.

The 20B model “fits into GPUs with 16GB VRAM,” according to its model card; OpenAI recommends pairing lightweight filters with selective calls to the reasoning model.

OpenAI says the models produce an auditable chain-of-thought and are designed to slot into “defense-in-depth” safety stacks so teams can apply their own policies and iterate quickly in production; in some launches, OpenAI devoted up to 16% of total compute to safety reasoning.