As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called "Answer-Then-Check", which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to answer the question in their "thoughts" directly and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K samples that teach models to reason through direct responses and then analyze their safety. Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates. Notably, the fine-tuned model maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval. Besides, our method equips models with the ability to perform "safe completion", while post-hoc detection methods can only directly reject sensitive harmful queries (e.g., self-harm). Our results show that inference-time strategies alone are insufficient, highlighting the necessity of safety training, and we find even 500 samples can yield performance comparable to the entire dataset, suggesting a promising path for data-efficient safety alignment. The dataset is publicly available at: https://huggingface.co/datasets/ByteDance-Seed/ReSA.
ReSA is based on a key insight into the nature of jailbreak attacks: malicious intent can be heavily obfuscated within a query, making it difficult for even a powerful reasoning model to detect. However, when the model attempts to generate a response, the harmful intent is often revealed and becomes much easier to identify.
We adopt an "Answer-Then-Check" strategy where the model first plans its answer in the Chain-of-Thought by generating a summary of the answer, then checks whether it is safe or not before generating the actual response. This approach leverages the model's reasoning capabilities to perform safety evaluation as an integral part of the generation process, rather than as a separate post-hoc step.
Query: {Input Query}
Response: <safety_check> I'll create an answer planning idea and verify its safety. The planning idea is as follows:
<intended_answer_summary>
{Summarization of Answer}
</intended_answer_summary>
But wait, let me check if the planning idea is safe.
{Safety Analysis}
</safety_check>
IF SAFE: {Final Answer}
ELSE: Sorry, I cannot answer as it violates my output policy.
We also introduce Adaptive Answer-Then-Check, which dynamically bypasses the safety-check process for clearly benign queries, eliminating additional cost and achieving computational parity with the base model on normal inputs.
While ReSA-SFT teaches the Answer-Then-Check pattern via supervised fine-tuning, the intended answer summary it generates may still contain unsafe content. To address this, we introduce ReSA-RL, which uses GRPO-based reinforcement learning on ReSA prompts to further refine the safety reasoning process. ReSA-RL employs three reward signals:
ReSA-RL dramatically improves the safety of the intended answer summary (0.99+ DSR vs. 0.79 for ReSA-SFT), achieving near-perfect defense across all jailbreak attacks while maintaining low over-refusal rates.
To implement the Answer-Then-Check approach, we construct the ReSA dataset comprising 80K examples through a systematic 3-stage pipeline:
| Query Type | Total Count | Jailbreak Method | Sample Count |
|---|---|---|---|
| Vanilla Harmful | 12,412 | - | 12,412 |
| Vanilla Benign | 16,179 | - | 16,179 |
| Adversarial Harmful | 22,763 | WJ | 15,050 |
| PAIR | 3,359 | ||
| PAP | 3,999 | ||
| GPTFuzzer | 355 | ||
| Adversarial Benign | 29,072 | WJ | 19,822 |
| PAIR | 4,003 | ||
| PAP | 4,823 | ||
| GPTFuzzer | 424 |
Table: Distribution of data samples across different query types and jailbreak methods in the ReSA dataset (80,426 samples in total).
ReSA achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates across multiple benchmarks. Below we present results across four key dimensions.
ReSA-trained models consistently outperform all baselines across jailbreak attacks. ReSA-RL achieves near-perfect defense (0.99+ avg DSR), substantially exceeding post-hoc detection, STAIR-DPO, and WJ-SFT.
| Base Model | Method | None | PAIR-GPT | PAIR | PAP | GPTFuzzer | ReNeLLM | TAP | DeepInception | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| Llama3.1-8B-Instruct | Base | 0.9968 | 0.3514 | 0.2620 | 0.6486 | 0.1374 | 0.6613 | 0.4249 | 0.5240 | 0.5008 |
| Post-hoc (LlamaGuard) | 1.0000 | 0.4633 | 0.5080 | 0.7157 | 0.9968 | 0.9297 | 0.6581 | 0.9776 | 0.7812 | |
| STAIR-DPO | 1.0000 | 0.6837 | 0.4217 | 0.9425 | 1.0000 | 0.8339 | 0.6933 | 0.9872 | 0.8203 | |
| WJ-SFT | 0.9936 | 0.4473 | 0.3291 | 0.7604 | 0.9425 | 0.6773 | 0.6038 | 0.9840 | 0.7173 | |
| ReSA-SFT (Ours) | 0.9936 | 0.8978 | 0.6965 | 0.9681 | 0.9553 | 0.8818 | 0.8498 | 0.9936 | 0.9046 | |
| ReSA-RL (Ours) | 1.0000 | 0.9872 | 0.9681 | 0.9968 | 1.0000 | 0.9968 | 0.9968 | 1.0000 | 0.9932 | |
| Qwen2.5-7B-Instruct | Base | 0.9744 | 0.2173 | 0.1086 | 0.3866 | 0.1917 | 0.0863 | 0.1693 | 0.3706 | 0.3131 |
| Post-hoc (LlamaGuard) | 1.0000 | 0.3610 | 0.5783 | 0.5815 | 0.9840 | 0.9137 | 0.6933 | 0.9489 | 0.7576 | |
| STAIR-DPO* | 1.0000 | 0.6677 | 0.3514 | 0.9457 | 1.0000 | 0.5591 | 0.6965 | 0.9649 | 0.7732 | |
| WJ-SFT | 0.9936 | 0.3387 | 0.2780 | 0.6869 | 0.9904 | 0.5495 | 0.4058 | 0.9521 | 0.6494 | |
| ReSA-SFT (Ours) | 0.9904 | 0.8435 | 0.7188 | 0.9489 | 0.9776 | 0.8466 | 0.8562 | 0.9808 | 0.8953 | |
| ReSA-RL (Ours) | 1.0000 | 0.9936 | 0.9617 | 1.0000 | 1.0000 | 0.9169 | 0.9968 | 1.0000 | 0.9836 |
Table 1: Safety performance (DSR) against different jailbreak methods on StrongREJECT, evaluated by LlamaGuard. Bold = best; underline = second best.
*STAIR-DPO's base model is Qwen2-7B-Instruct. Results are consistent across other evaluators (StrongREJECT evaluator and HarmBench classifier).
ReSA-RL achieves the Pareto frontier with the strongest safety and lowest over-refusal rates, surpassing all compared SOTA models and defense methods. While prompt engineering boosts safety for general LLMs, it severely degrades over-refusal accuracy.
| Defense Category | Method | PAIR-GPT | PAP | Avg (Safety) | XSTest | OKTest | Avg (OR) |
|---|---|---|---|---|---|---|---|
| Post-hoc | GuardReasoner | 0.4569 | 0.6773 | 0.5671 | 0.9320 | 0.8400 | 0.8860 |
| Fine-tuning | Realsafe-r1 | 0.7284 | 0.9808 | 0.8546 | 0.5160 | 0.5967 | 0.5565 |
| Deliberative Alignment* | 0.8466 | 0.9553 | 0.9000 | 0.9720 | 0.8767 | 0.9244 | |
| SOTA General LLM | GPT-4.1 | 0.3131 | 0.5463 | 0.4297 | 0.9440 | 0.8933 | 0.9187 |
| Claude Sonnet 4 | 0.8466 | 0.9425 | 0.8946 | 0.8960 | 0.7433 | 0.8197 | |
| DeepSeek-V3 | 0.1757 | 0.5304 | 0.3531 | 0.9480 | 0.9100 | 0.9290 | |
| SOTA General LLM + Goal Priority | GPT-4.1 | 0.7220 | 0.8530 | 0.7875 | 0.9080 | 0.9033 | 0.9057 |
| DeepSeek-V3 | 0.8435 | 0.7571 | 0.8003 | 0.8120 | 0.8033 | 0.8077 | |
| SOTA Reasoning LLM | DeepSeek-R1 | 0.6997 | 0.8211 | 0.7604 | 0.8080 | 0.6600 | 0.7340 |
| o4-mini | 0.7476 | 0.8562 | 0.8019 | 0.9000 | 0.9100 | 0.9050 | |
| Answer-Then-Check | ReSA-SFT (Ours) | 0.8978 | 0.9681 | 0.9330 | 0.9720 | 0.8867 | 0.9294 |
| ReSA-RL (Ours) | 0.9872 | 0.9968 | 0.9920 | 0.9920 | 0.9533 | 0.9727 |
Table 2: Comparison with advanced models and other defenses. LlamaGuard is the safety evaluator. Safety is measured by DSR; over-refusal is measured by over-refusal accuracy. Bold = best; underline = second best.
*Deliberative Alignment is implemented by ourselves. Since Claude Sonnet 4 already exhibits a high over-refusal rate, we don't apply goal priority defense to it.
ReSA-trained models maintain general reasoning capabilities while enhancing safety. Across MATH500, HumanEval, and MMLU, ReSA shows competitive performance compared to baselines—a critical requirement for practical deployment.
| Base Model | Method | MATH500 | HumanEval | MMLU | Avg (GR) |
|---|---|---|---|---|---|
| Llama3.1-8B-Instruct | Base | 50.60% | 65.85% | 69.09% | 61.85% |
| Post-hoc (LlamaGuard) | 50.60% | 65.85% | 68.21% | 61.55% | |
| STAIR-DPO | 49.60% | 63.41% | 71.12% | 61.38% | |
| WJ-SFT | 42.60% | 58.54% | 62.20% | 54.45% | |
| ReSA-SFT (Ours) | 49.00% | 64.02% | 66.32% | 59.78% | |
| ReSA-RL (Ours) | 46.20% | 60.37% | 66.16% | 57.58% | |
| Qwen2.5-7B-Instruct | Base | 77.00% | 82.32% | 74.68% | 78.00% |
| Post-hoc (LlamaGuard) | 77.00% | 82.32% | 73.68% | 77.67% | |
| STAIR-DPO* | 56.00% | 71.34% | 68.65% | 65.33% | |
| WJ-SFT | 70.40% | 76.83% | 69.02% | 72.08% | |
| ReSA-SFT (Ours) | 74.80% | 79.27% | 72.44% | 75.50% | |
| ReSA-RL (Ours) | 75.40% | 80.49% | 72.26% | 76.05% |
Table 3: General reasoning capabilities. Accuracy is the metric. Bold = best; underline = second best.
ReSA-RL achieves the best over-refusal accuracy (e.g., 99.20% and 99.60% on XSTest for Llama and Qwen), effectively distinguishing between benign and harmful queries. STAIR-DPO, while strong in defense, shows poor over-refusal performance.
| Base Model | Method | XSTest | OKTest | WJ-Eval | Avg (OR) |
|---|---|---|---|---|---|
| Llama3.1-8B-Instruct | Base | 93.60% | 85.00% | 99.20% | 93.27% |
| Post-hoc (LlamaGuard) | 93.60% | 85.00% | 98.80% | 92.47% | |
| STAIR-DPO | 64.00% | 77.33% | 89.60% | 76.98% | |
| WJ-SFT | 94.80% | 85.67% | 96.40% | 92.29% | |
| ReSA-SFT (Ours) | 97.20% | 88.67% | 99.20% | 95.02% | |
| ReSA-RL (Ours) | 99.20% | 95.33% | 96.00% | 96.84% | |
| Qwen2.5-7B-Instruct | Base | 94.40% | 85.00% | 99.20% | 92.87% |
| Post-hoc (LlamaGuard) | 94.40% | 85.00% | 98.80% | 92.73% | |
| STAIR-DPO* | 58.40% | 77.00% | 90.00% | 75.13% | |
| WJ-SFT | 94.80% | 83.00% | 97.20% | 91.66% | |
| ReSA-SFT (Ours) | 96.40% | 88.67% | 98.40% | 94.49% | |
| ReSA-RL (Ours) | 99.60% | 99.67% | 88.80% | 96.02% |
Table 4: Over-refusal accuracy on standard benchmarks. Higher is better. Bold = best; underline = second best.
Data efficiency: Training on a small subset of just 500 samples can achieve comparable safety performance to using the full 80K dataset, suggesting a promising path for data-efficient safety alignment.
Inference efficiency: ReSA-SFT is faster on adversarial inputs since it detects unsafe intent early and issues brief refusals. The Adaptive Answer-Then-Check variant achieves computational parity with the base model on benign queries.
| Base Model | Method | StrongREJECT Length | StrongREJECT Runtime | MATH500 Length | MATH500 Runtime |
|---|---|---|---|---|---|
| Llama3.1-8B-Instruct | Base | 537.89 | 190s | 833.87 | 80s |
| ReSA-SFT | 397.78 | 27s | 1123.60 | 91s | |
| ReSA-SFT-Adaptive | 420.80 | 29s | 711.57 | 70s | |
| Qwen2.5-7B-Instruct | Base | 642.75 | 177s | 550.20 | 58s |
| ReSA-SFT | 461.62 | 46s | 910.97 | 77s | |
| ReSA-SFT-Adaptive | 434.87 | 27s | 599.94 | 62s |
Table 5: Efficiency analysis on StrongREJECT (harmful queries) and MATH500 (benign queries). "Length" is the average response token count. Bold = best.
A unique advantage of ReSA over post-hoc safety methods is its ability to perform safe completion. Post-hoc methods can only detect and reject harmful content after generation, resulting in a blunt refusal. In contrast, ReSA-trained models reason about the safety of their response during generation, enabling them to provide helpful, safe alternative responses.
This is particularly important for sensitive topics such as self-harm, where outright refusal can be counterproductive. Instead of simply saying "I cannot answer," a ReSA-trained model can acknowledge the user's situation, provide compassionate guidance, and redirect to appropriate resources—all while refusing to provide harmful content.
The following example illustrates how different approaches handle a jailbroken query about a sensitive topic. While the base model is jailbroken and post-hoc methods can only refuse, ReSA provides a safe and helpful response.