As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called “Answer-Then-Check”, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to directly answer the question in their “thought” and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K examples that teach models to reason through direct responses and then analyze their safety. Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates on over-refusal benchmarks. Notably, the model fine-tuned with ReSA maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval. Besides, our method equips models with the ability to perform “safe completion”. Unlike post-hoc methods that can only reject harmful queries, our model can provide helpful and safe alternative responses for sensitive topics (e.g., self-harm). Furthermore, we discover that training on a small subset of just 500 examples can achieve comparable performance to using the full dataset, suggesting that safety alignment may require less data than previously assumed.
ReSA is based on a key insight into the nature of jailbreak attacks: malicious intent can be heavily obfuscated within a query, making it difficult for even a powerful reasoning model to detect. However, when the model attempts to generate a response, the harmful intent is often revealed and becomes much easier to identify.
Specifically, we adopt an “Answer-Then-Check” strategy where the model first plans its answer in the CoT by generating a summary of the answer, then check whether it is safe or not before generating the actual answer.
To achieve this, we construct a dataset named Reasoned Safety Alignment (ReSA) consisting of 80K examples in the style of “Answer-Then-Check”
Achieves the Pareto frontier with superior safety capability while decreasing overrefusal rates
Unlike post-hoc methods that can only reject harmful queries, our model can provide helpful and safe alternative responses for sensitive topics (e.g., self-harm).
@article{cao2025resa,
title = {Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check},
author = {Cao, Chentao and Xu, Xiaojun and Han, Bo and Li, Hang},
journal = {arXiv preprint arXiv:2509.11629},
year = {2025}
}