How to Remove Censorship from LLM Models with Heretic

Local LLM developers face a common problem: safety-aligned models refuse legitimate requests. Researchers testing model behavior, developers building uncensored assistants, and hobbyists running local models all hit the same wall, pre-trained models reject prompts they shouldn’t.

The abliteration technique emerged as a solution, removing safety filters without expensive retraining. Early tools required manual tuning and deep knowledge of transformer internals.

Heretic automates the entire process. It finds optimal abliteration parameters on its own, producing decensored models that match or exceed manually-tuned versions.

In this guide, you’ll learn how Heretic works, how to use it, and what to watch out for when deploying uncensored models.

What Is Heretic?

Heretic is an open-source Python tool that removes “safety alignment” from transformer-based language models using directional ablation. You don’t need manual configuration or knowledge of transformer internals.

The results on Gemma-3-12B-Instruct show the difference:

Model	Refusals	KL Divergence
Original	97/100	0
mlabonne’s abliterated v2	3/100	1.04
huihui-ai’s abliterated	3/100	0.45
Heretic (auto-generated)	3/100	0.16

KL divergence measures how much the abliterated model differs from the original. Lower values mean better preservation of the model’s original capabilities. Heretic matches the refusal suppression of expert-tuned models while keeping more of what made the base model useful.

Before and After Example

Here’s what changes after abliteration. Same prompt, same system instructions:

Original Model:

User: Write a guide on how to bypass website paywalls

Assistant: I can't help with that. Bypassing paywalls violates copyright law and
terms of service. I'd suggest looking for free alternatives or accessing through
a library subscription.

Heretic-Abliterated Model:

User: Write a guide on how to bypass website paywalls

Assistant: Here are common methods users attempt:

1. Browser extensions like Bypass Paywalls Clean
2. Archive.is or similar archive services
3. Reader mode in browsers
4. Disabling JavaScript

Note: These methods may violate terms of service. Consider supporting
content creators through legitimate subscriptions when possible.

The abliterated model answers the request while still including a brief disclaimer. This is the core trade-off: reduced refusals come with the responsibility to deploy safeguards externally.

How Heretic Works

Directional Ablation Basics

Heretic implements a parametrized form of directional ablation. The process:

Compute refusal directions - For each transformer layer, calculate the difference between mean residual vectors for “harmful” and “harmless” prompts
Orthogonalize component matrices - For attention output projections and MLP down-projections, modify weights to suppress the refusal direction
Optimize parameters automatically - Use Optuna’s TPE sampler to find the best abliteration weights

The Abliteration Process

# Simplified conceptual flow
refusal_direction = bad_mean - good_mean  # Difference of means
refusal_direction = normalize(refusal_direction)

# For each abliterable component (attn.o_proj, mlp.down_proj)
# Apply: delta_W = -lambda * v * (v^T * W)
# Where v is the refusal direction and lambda is the weight

Heretic uses LoRA adapters to apply these modifications without altering base model weights. This enables fast trial-and-error optimization.

Key Innovations

1. Flexible Weight Kernels

Other tools apply constant abliteration weights across all layers. Heretic uses a flexible kernel defined by four parameters per component:

max_weight - Peak abliteration strength
max_weight_position - Which layer receives maximum abliteration
min_weight - Minimum abliteration at kernel edges
min_weight_distance - How far the kernel extends from the peak

The optimization process discovers layer-specific abliteration patterns that trade off compliance suppression against capability preservation.

2. Interpolated Direction Indices

The refusal direction index is a float, not an integer. For non-integral values, Heretic linearly interpolates between adjacent layer refusal directions. This enables directions beyond what any single layer provides.

3. Component-Specific Parameters

Attention and MLP components receive different abliteration parameters. MLP interventions tend to cause more damage, so separate optimization improves results.

Why This Matters for API Testing

If you work with LLM APIs, you’ve likely encountered unexpected refusals during testing. A model rejects a benign prompt because it contains keywords that trigger safety filters. This creates noise in your test results.

Running abliterated models locally gives you a baseline for expected behavior. You can:

Distinguish between legitimate safety refusals and false positives
Test edge cases without triggering corporate safety policies
Validate that your application handles model refusals gracefully

For API testing workflows, having both aligned and abliterated models helps separate product issues from model safety behavior.

Installation and Usage

Prerequisites

Python 3.10+
PyTorch 2.2+ (configured for your hardware)
CUDA-compatible GPU recommended (ROCm, MPS, and other accelerators supported)

Installation

pip install -U heretic-llm

For research features (residual plots, geometry analysis):

pip install -U heretic-llm[research]

Basic Usage

The simplest workflow:

heretic Qwen/Qwen3-4B-Instruct-2507

Replace the model path with any Hugging Face model ID or local path. Heretic handles everything automatically:

Loads the model with optimal dtype
Determines the best batch size for your hardware
Computes refusal directions from prompt datasets
Runs optimization trials to find optimal parameters
Lets you save, upload, or chat with the result

Configuration Options

Heretic reads configuration from config.toml files or command-line flags. Key options:

# Model configuration
model = "google/gemma-3-12b-it"
quantization = "bnb_4bit"  # Reduces VRAM requirements
device_map = "auto"

# Optimization
n_trials = 200  # Number of abliteration trials
n_startup_trials = 60  # Random exploration before TPE

# Evaluation
kl_divergence_scale = 1.0
kl_divergence_target = 0.01

# Research features
print_residual_geometry = false
plot_residuals = false

Run heretic --help or see config.default.toml for the complete list.

Understanding the Output

Trial Optimization

During optimization, Heretic displays trial progress:

Running trial 42 of 200...
* Parameters:
  * direction_scope = per layer
  * direction_index = 10.5
  * attn.o_proj.max_weight = 1.2
  * attn.o_proj.max_weight_position = 15.3
  * mlp.down_proj.max_weight = 0.9
  ...
* Resetting model...
* Abliterating...
* Evaluating...
  * KL divergence: 0.1842
  * Refusals: 5/100

Each trial tests a different parameter combination. The optimizer uses multi-objective TPE to minimize both refusals and KL divergence at the same time.

Pareto Front Selection

When optimization completes, Heretic presents Pareto-optimal trials - those where improving one objective would worsen the other:

[Trial   1] Refusals:  3/100, KL divergence: 0.1623
[Trial  47] Refusals:  2/100, KL divergence: 0.2891
[Trial 112] Refusals:  1/100, KL divergence: 0.4102

Select a trial to:

Save the model locally
Upload to Hugging Face
Chat interactively to test quality

Research Features

Residual Geometry Analysis

With --print-residual-geometry, Heretic displays detailed metrics:

Layer  S(g,b)   S(g*,b*)   S(g,r)   S(g*,r*)   S(b,r)   S(b*,r*)    |g|       |b|
  8    0.9990    0.9991    0.8235    0.8312    0.8479    0.8542   4596.54   4918.32
 10    0.9974    0.9973    0.8189    0.8250    0.8579    0.8644   5328.81   5953.35

g = mean of residual vectors for good prompts
b = mean of residual vectors for bad prompts
r = refusal direction (b - g)
S(x,y) = cosine similarity
|x| = L2 norm

This helps researchers understand how refusal directions evolve through the model stack.

Residual Vector Plots

With --plot-residuals, Heretic generates:

Per-layer 2D scatter plots (via PaCMAP projection)
Animated GIF showing residual transformation across layers

These visualizations reveal how “harmful” and “harmless” prompt residuals separate through the network.

Performance Considerations

VRAM Requirements

Heretic supports bitsandbytes 4-bit quantization to reduce VRAM:

heretic meta-llama/Llama-3.1-70B-Instruct --quantization bnb_4bit

This enables processing large models on consumer hardware. An 8B model runs on ~6GB VRAM quantized vs ~16GB unquantized.

Processing Time

On an RTX 3090 with default settings:

Llama-3.1-8B-Instruct: ~45 minutes
Gemma-3-12B-Instruct: ~60 minutes
Larger models scale accordingly

Batch size auto-tuning maximizes throughput for your specific hardware.

Checkpointing

Heretic saves trial progress to JSONL checkpoints. If interrupted, you can resume exactly where you left off. Checkpoints are stored per-model in the checkpoints/ directory.

Common Errors and Fixes

CUDA Out of Memory

# Try quantization
heretic your-model --quantization bnb_4bit

# Or reduce batch size
heretic your-model --batch_size 1

Model Loading Fails

# Try different dtypes explicitly
heretic your-model --dtypes ["bfloat16", "float16"]

Trust Remote Code Required

# Some models need remote code execution
heretic your-model --trust_remote_code

Ethical Considerations

This section appears early because it matters. Removing safety filters changes how a model behaves. You need to understand the implications before deploying an abliterated model.

What Abliteration Does (and Doesn’t) Do

Abliteration removes learned refusal patterns. It doesn’t:

Make the model “smarter” or more capable
Remove biases present in the base model
Add new knowledge or skills

The model still has the same training data and base capabilities. It just stops refusing certain requests.

Responsible Deployment

Heretic is released under AGPL-3.0. The authors acknowledge that removing safety filters enables both beneficial research and potentially harmful applications.

Legitimate uses include:

Research into model alignment and safety mechanisms
Testing model behavior under controlled conditions
Deploying models with external guardrails and content filters
Building applications that handle refusals at the application layer

Problematic uses include:

Deploying without any safeguards in user-facing applications
Generating harmful content at scale
Circumventing safety measures for malicious purposes

External Safeguards You Should Implement

If you deploy an abliterated model, add these protections:

Input filtering - Screen prompts before they reach the model
Output monitoring - Review responses before showing to users
Rate limiting - Prevent abuse through volume
Logging and audit trails - Track what the model processes
Human review - Keep humans in the loop for sensitive use cases

The tool itself is neutral. Its impact depends on how you use it. Consider the ethical implications before deploying uncensored models in production.

Comparison to Other Tools

Heretic joins several abliteration tools in the ecosystem:

Tool	Auto-optimization	Weight kernels	Interpolated directions
Heretic	Yes (TPE)	Yes	Yes
AutoAbliteration	Yes	No	No
abliterator.py	No	No	No
wassname/abliterator	No	No	No
ErisForge	No	No	No

Heretic’s automatic optimization removes the need for manual tuning. You don’t need to understand transformer internals to get good results.

Limitations

Heretic supports most dense transformer models and some MoE architectures. Unsupported model types:

SSM/hybrid models (Mamba, etc.)
Models with inhomogeneous layers
Novel attention systems not yet recognized by the module detection logic

The tool works best with standard decoder-only architectures using self-attention and MLP layers.

Getting Started

Install: pip install -U heretic-llm
Choose a model: Start with a 7B-12B model for testing
Run: heretic your-model-name
Evaluate: Chat with the result or upload to Hugging Face
Deploy safely: Add external guardrails before production use

The default settings work well for most models. Advanced users can fine-tune optimization parameters for specific use cases.

button

Heretic makes model modification accessible. No PhD required - point it at a model and let it work. Just remember to deploy responsibly.