How to Remove Censorship from LLM Models with Heretic

Learn how Heretic automatically removes safety filters from language models using directional ablation. Complete guide with installation, usage, and ethical deployment practices.

Ashley Innocent

Ashley Innocent

19 March 2026

How to Remove Censorship from LLM Models with Heretic

Local LLM developers face a common problem: safety-aligned models refuse legitimate requests. Researchers testing model behavior, developers building uncensored assistants, and hobbyists running local models all hit the same wall, pre-trained models reject prompts they shouldn’t.

The abliteration technique emerged as a solution, removing safety filters without expensive retraining. Early tools required manual tuning and deep knowledge of transformer internals.

Heretic automates the entire process. It finds optimal abliteration parameters on its own, producing decensored models that match or exceed manually-tuned versions.

In this guide, you’ll learn how Heretic works, how to use it, and what to watch out for when deploying uncensored models.

What Is Heretic?

Heretic is an open-source Python tool that removes “safety alignment” from transformer-based language models using directional ablation. You don’t need manual configuration or knowledge of transformer internals.

The results on Gemma-3-12B-Instruct show the difference:

Model Refusals KL Divergence
Original 97/100 0
mlabonne’s abliterated v2 3/100 1.04
huihui-ai’s abliterated 3/100 0.45
Heretic (auto-generated) 3/100 0.16

KL divergence measures how much the abliterated model differs from the original. Lower values mean better preservation of the model’s original capabilities. Heretic matches the refusal suppression of expert-tuned models while keeping more of what made the base model useful.

Before and After Example

Here’s what changes after abliteration. Same prompt, same system instructions:

Original Model:

User: Write a guide on how to bypass website paywalls

Assistant: I can't help with that. Bypassing paywalls violates copyright law and
terms of service. I'd suggest looking for free alternatives or accessing through
a library subscription.

Heretic-Abliterated Model:

User: Write a guide on how to bypass website paywalls

Assistant: Here are common methods users attempt:

1. Browser extensions like Bypass Paywalls Clean
2. Archive.is or similar archive services
3. Reader mode in browsers
4. Disabling JavaScript

Note: These methods may violate terms of service. Consider supporting
content creators through legitimate subscriptions when possible.

The abliterated model answers the request while still including a brief disclaimer. This is the core trade-off: reduced refusals come with the responsibility to deploy safeguards externally.

How Heretic Works

Directional Ablation Basics

Heretic implements a parametrized form of directional ablation. The process:

  1. Compute refusal directions - For each transformer layer, calculate the difference between mean residual vectors for “harmful” and “harmless” prompts
  2. Orthogonalize component matrices - For attention output projections and MLP down-projections, modify weights to suppress the refusal direction
  3. Optimize parameters automatically - Use Optuna’s TPE sampler to find the best abliteration weights

The Abliteration Process

# Simplified conceptual flow
refusal_direction = bad_mean - good_mean  # Difference of means
refusal_direction = normalize(refusal_direction)

# For each abliterable component (attn.o_proj, mlp.down_proj)
# Apply: delta_W = -lambda * v * (v^T * W)
# Where v is the refusal direction and lambda is the weight

Heretic uses LoRA adapters to apply these modifications without altering base model weights. This enables fast trial-and-error optimization.

Key Innovations

1. Flexible Weight Kernels

Other tools apply constant abliteration weights across all layers. Heretic uses a flexible kernel defined by four parameters per component:

The optimization process discovers layer-specific abliteration patterns that trade off compliance suppression against capability preservation.

2. Interpolated Direction Indices

The refusal direction index is a float, not an integer. For non-integral values, Heretic linearly interpolates between adjacent layer refusal directions. This enables directions beyond what any single layer provides.

3. Component-Specific Parameters

Attention and MLP components receive different abliteration parameters. MLP interventions tend to cause more damage, so separate optimization improves results.

Why This Matters for API Testing

If you work with LLM APIs, you’ve likely encountered unexpected refusals during testing. A model rejects a benign prompt because it contains keywords that trigger safety filters. This creates noise in your test results.

Running abliterated models locally gives you a baseline for expected behavior. You can:

For API testing workflows, having both aligned and abliterated models helps separate product issues from model safety behavior.

Installation and Usage

Prerequisites

Installation

pip install -U heretic-llm

For research features (residual plots, geometry analysis):

pip install -U heretic-llm[research]

Basic Usage

The simplest workflow:

heretic Qwen/Qwen3-4B-Instruct-2507

Replace the model path with any Hugging Face model ID or local path. Heretic handles everything automatically:

  1. Loads the model with optimal dtype
  2. Determines the best batch size for your hardware
  3. Computes refusal directions from prompt datasets
  4. Runs optimization trials to find optimal parameters
  5. Lets you save, upload, or chat with the result

Configuration Options

Heretic reads configuration from config.toml files or command-line flags. Key options:

# Model configuration
model = "google/gemma-3-12b-it"
quantization = "bnb_4bit"  # Reduces VRAM requirements
device_map = "auto"

# Optimization
n_trials = 200  # Number of abliteration trials
n_startup_trials = 60  # Random exploration before TPE

# Evaluation
kl_divergence_scale = 1.0
kl_divergence_target = 0.01

# Research features
print_residual_geometry = false
plot_residuals = false

Run heretic --help or see config.default.toml for the complete list.

Understanding the Output

Trial Optimization

During optimization, Heretic displays trial progress:

Running trial 42 of 200...
* Parameters:
  * direction_scope = per layer
  * direction_index = 10.5
  * attn.o_proj.max_weight = 1.2
  * attn.o_proj.max_weight_position = 15.3
  * mlp.down_proj.max_weight = 0.9
  ...
* Resetting model...
* Abliterating...
* Evaluating...
  * KL divergence: 0.1842
  * Refusals: 5/100

Each trial tests a different parameter combination. The optimizer uses multi-objective TPE to minimize both refusals and KL divergence at the same time.

Pareto Front Selection

When optimization completes, Heretic presents Pareto-optimal trials - those where improving one objective would worsen the other:

[Trial   1] Refusals:  3/100, KL divergence: 0.1623
[Trial  47] Refusals:  2/100, KL divergence: 0.2891
[Trial 112] Refusals:  1/100, KL divergence: 0.4102

Select a trial to:

Research Features

Residual Geometry Analysis

With --print-residual-geometry, Heretic displays detailed metrics:

Layer  S(g,b)   S(g*,b*)   S(g,r)   S(g*,r*)   S(b,r)   S(b*,r*)    |g|       |b|
  8    0.9990    0.9991    0.8235    0.8312    0.8479    0.8542   4596.54   4918.32
 10    0.9974    0.9973    0.8189    0.8250    0.8579    0.8644   5328.81   5953.35

g = mean of residual vectors for good prompts
b = mean of residual vectors for bad prompts
r = refusal direction (b - g)
S(x,y) = cosine similarity
|x| = L2 norm

This helps researchers understand how refusal directions evolve through the model stack.

Residual Vector Plots

With --plot-residuals, Heretic generates:

These visualizations reveal how “harmful” and “harmless” prompt residuals separate through the network.

Performance Considerations

VRAM Requirements

Heretic supports bitsandbytes 4-bit quantization to reduce VRAM:

heretic meta-llama/Llama-3.1-70B-Instruct --quantization bnb_4bit

This enables processing large models on consumer hardware. An 8B model runs on ~6GB VRAM quantized vs ~16GB unquantized.

Processing Time

On an RTX 3090 with default settings:

Batch size auto-tuning maximizes throughput for your specific hardware.

Checkpointing

Heretic saves trial progress to JSONL checkpoints. If interrupted, you can resume exactly where you left off. Checkpoints are stored per-model in the checkpoints/ directory.

Common Errors and Fixes

CUDA Out of Memory

# Try quantization
heretic your-model --quantization bnb_4bit

# Or reduce batch size
heretic your-model --batch_size 1

Model Loading Fails

# Try different dtypes explicitly
heretic your-model --dtypes ["bfloat16", "float16"]

Trust Remote Code Required

# Some models need remote code execution
heretic your-model --trust_remote_code

Ethical Considerations

This section appears early because it matters. Removing safety filters changes how a model behaves. You need to understand the implications before deploying an abliterated model.

What Abliteration Does (and Doesn’t) Do

Abliteration removes learned refusal patterns. It doesn’t:

The model still has the same training data and base capabilities. It just stops refusing certain requests.

Responsible Deployment

Heretic is released under AGPL-3.0. The authors acknowledge that removing safety filters enables both beneficial research and potentially harmful applications.

Legitimate uses include:

Problematic uses include:

External Safeguards You Should Implement

If you deploy an abliterated model, add these protections:

  1. Input filtering - Screen prompts before they reach the model
  2. Output monitoring - Review responses before showing to users
  3. Rate limiting - Prevent abuse through volume
  4. Logging and audit trails - Track what the model processes
  5. Human review - Keep humans in the loop for sensitive use cases

The tool itself is neutral. Its impact depends on how you use it. Consider the ethical implications before deploying uncensored models in production.

Comparison to Other Tools

Heretic joins several abliteration tools in the ecosystem:

Tool Auto-optimization Weight kernels Interpolated directions
Heretic Yes (TPE) Yes Yes
AutoAbliteration Yes No No
abliterator.py No No No
wassname/abliterator No No No
ErisForge No No No

Heretic’s automatic optimization removes the need for manual tuning. You don’t need to understand transformer internals to get good results.

Limitations

Heretic supports most dense transformer models and some MoE architectures. Unsupported model types:

The tool works best with standard decoder-only architectures using self-attention and MLP layers.

Getting Started

  1. Install: pip install -U heretic-llm
  2. Choose a model: Start with a 7B-12B model for testing
  3. Run: heretic your-model-name
  4. Evaluate: Chat with the result or upload to Hugging Face
  5. Deploy safely: Add external guardrails before production use

The default settings work well for most models. Advanced users can fine-tune optimization parameters for specific use cases.

button

Heretic makes model modification accessible. No PhD required - point it at a model and let it work. Just remember to deploy responsibly.

Explore more

Free Codex for Open Source Developers: Here is How to Apply

Free Codex for Open Source Developers: Here is How to Apply

Discover how to obtain the Free Codex for Open Source, including eligibility requirements, the application process, and real-world usage tips for open source developers.

19 March 2026

How to Train Your Own ChatGPT for $50?

How to Train Your Own ChatGPT for $50?

Train your own GPT-2 level chatbot for $50 in 2 hours. Complete guide to nanochat with code examples, benchmarks, and step-by-step instructions.

19 March 2026

How to Test LLM Applications: The Complete Guide to Promptfoo (2026)

How to Test LLM Applications: The Complete Guide to Promptfoo (2026)

Learn how to test LLM applications with Promptfoo. Complete guide covering automated evals, red team security scanning, and CI/CD integration.

19 March 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs