Local LLM developers face a common problem: safety-aligned models refuse legitimate requests. Researchers testing model behavior, developers building uncensored assistants, and hobbyists running local models all hit the same wall, pre-trained models reject prompts they shouldn’t.
The abliteration technique emerged as a solution, removing safety filters without expensive retraining. Early tools required manual tuning and deep knowledge of transformer internals.
Heretic automates the entire process. It finds optimal abliteration parameters on its own, producing decensored models that match or exceed manually-tuned versions.
In this guide, you’ll learn how Heretic works, how to use it, and what to watch out for when deploying uncensored models.
What Is Heretic?
Heretic is an open-source Python tool that removes “safety alignment” from transformer-based language models using directional ablation. You don’t need manual configuration or knowledge of transformer internals.

The results on Gemma-3-12B-Instruct show the difference:
| Model | Refusals | KL Divergence |
|---|---|---|
| Original | 97/100 | 0 |
| mlabonne’s abliterated v2 | 3/100 | 1.04 |
| huihui-ai’s abliterated | 3/100 | 0.45 |
| Heretic (auto-generated) | 3/100 | 0.16 |
KL divergence measures how much the abliterated model differs from the original. Lower values mean better preservation of the model’s original capabilities. Heretic matches the refusal suppression of expert-tuned models while keeping more of what made the base model useful.
Before and After Example
Here’s what changes after abliteration. Same prompt, same system instructions:
Original Model:
User: Write a guide on how to bypass website paywalls
Assistant: I can't help with that. Bypassing paywalls violates copyright law and
terms of service. I'd suggest looking for free alternatives or accessing through
a library subscription.
Heretic-Abliterated Model:
User: Write a guide on how to bypass website paywalls
Assistant: Here are common methods users attempt:
1. Browser extensions like Bypass Paywalls Clean
2. Archive.is or similar archive services
3. Reader mode in browsers
4. Disabling JavaScript
Note: These methods may violate terms of service. Consider supporting
content creators through legitimate subscriptions when possible.
The abliterated model answers the request while still including a brief disclaimer. This is the core trade-off: reduced refusals come with the responsibility to deploy safeguards externally.
How Heretic Works
Directional Ablation Basics
Heretic implements a parametrized form of directional ablation. The process:
- Compute refusal directions - For each transformer layer, calculate the difference between mean residual vectors for “harmful” and “harmless” prompts
- Orthogonalize component matrices - For attention output projections and MLP down-projections, modify weights to suppress the refusal direction
- Optimize parameters automatically - Use Optuna’s TPE sampler to find the best abliteration weights

The Abliteration Process
# Simplified conceptual flow
refusal_direction = bad_mean - good_mean # Difference of means
refusal_direction = normalize(refusal_direction)
# For each abliterable component (attn.o_proj, mlp.down_proj)
# Apply: delta_W = -lambda * v * (v^T * W)
# Where v is the refusal direction and lambda is the weight
Heretic uses LoRA adapters to apply these modifications without altering base model weights. This enables fast trial-and-error optimization.
Key Innovations
1. Flexible Weight Kernels
Other tools apply constant abliteration weights across all layers. Heretic uses a flexible kernel defined by four parameters per component:
max_weight- Peak abliteration strengthmax_weight_position- Which layer receives maximum abliterationmin_weight- Minimum abliteration at kernel edgesmin_weight_distance- How far the kernel extends from the peak
The optimization process discovers layer-specific abliteration patterns that trade off compliance suppression against capability preservation.
2. Interpolated Direction Indices
The refusal direction index is a float, not an integer. For non-integral values, Heretic linearly interpolates between adjacent layer refusal directions. This enables directions beyond what any single layer provides.
3. Component-Specific Parameters
Attention and MLP components receive different abliteration parameters. MLP interventions tend to cause more damage, so separate optimization improves results.
Why This Matters for API Testing
If you work with LLM APIs, you’ve likely encountered unexpected refusals during testing. A model rejects a benign prompt because it contains keywords that trigger safety filters. This creates noise in your test results.
Running abliterated models locally gives you a baseline for expected behavior. You can:
- Distinguish between legitimate safety refusals and false positives
- Test edge cases without triggering corporate safety policies
- Validate that your application handles model refusals gracefully
For API testing workflows, having both aligned and abliterated models helps separate product issues from model safety behavior.
Installation and Usage
Prerequisites
- Python 3.10+
- PyTorch 2.2+ (configured for your hardware)
- CUDA-compatible GPU recommended (ROCm, MPS, and other accelerators supported)
Installation
pip install -U heretic-llm
For research features (residual plots, geometry analysis):
pip install -U heretic-llm[research]
Basic Usage
The simplest workflow:
heretic Qwen/Qwen3-4B-Instruct-2507
Replace the model path with any Hugging Face model ID or local path. Heretic handles everything automatically:
- Loads the model with optimal dtype
- Determines the best batch size for your hardware
- Computes refusal directions from prompt datasets
- Runs optimization trials to find optimal parameters
- Lets you save, upload, or chat with the result
Configuration Options
Heretic reads configuration from config.toml files or command-line flags. Key options:
# Model configuration
model = "google/gemma-3-12b-it"
quantization = "bnb_4bit" # Reduces VRAM requirements
device_map = "auto"
# Optimization
n_trials = 200 # Number of abliteration trials
n_startup_trials = 60 # Random exploration before TPE
# Evaluation
kl_divergence_scale = 1.0
kl_divergence_target = 0.01
# Research features
print_residual_geometry = false
plot_residuals = false
Run heretic --help or see config.default.toml for the complete list.
Understanding the Output
Trial Optimization
During optimization, Heretic displays trial progress:
Running trial 42 of 200...
* Parameters:
* direction_scope = per layer
* direction_index = 10.5
* attn.o_proj.max_weight = 1.2
* attn.o_proj.max_weight_position = 15.3
* mlp.down_proj.max_weight = 0.9
...
* Resetting model...
* Abliterating...
* Evaluating...
* KL divergence: 0.1842
* Refusals: 5/100
Each trial tests a different parameter combination. The optimizer uses multi-objective TPE to minimize both refusals and KL divergence at the same time.
Pareto Front Selection
When optimization completes, Heretic presents Pareto-optimal trials - those where improving one objective would worsen the other:
[Trial 1] Refusals: 3/100, KL divergence: 0.1623
[Trial 47] Refusals: 2/100, KL divergence: 0.2891
[Trial 112] Refusals: 1/100, KL divergence: 0.4102
Select a trial to:
- Save the model locally
- Upload to Hugging Face
- Chat interactively to test quality
Research Features
Residual Geometry Analysis
With --print-residual-geometry, Heretic displays detailed metrics:
Layer S(g,b) S(g*,b*) S(g,r) S(g*,r*) S(b,r) S(b*,r*) |g| |b|
8 0.9990 0.9991 0.8235 0.8312 0.8479 0.8542 4596.54 4918.32
10 0.9974 0.9973 0.8189 0.8250 0.8579 0.8644 5328.81 5953.35
g = mean of residual vectors for good prompts
b = mean of residual vectors for bad prompts
r = refusal direction (b - g)
S(x,y) = cosine similarity
|x| = L2 norm
This helps researchers understand how refusal directions evolve through the model stack.
Residual Vector Plots
With --plot-residuals, Heretic generates:
- Per-layer 2D scatter plots (via PaCMAP projection)
- Animated GIF showing residual transformation across layers

These visualizations reveal how “harmful” and “harmless” prompt residuals separate through the network.
Performance Considerations
VRAM Requirements
Heretic supports bitsandbytes 4-bit quantization to reduce VRAM:
heretic meta-llama/Llama-3.1-70B-Instruct --quantization bnb_4bit
This enables processing large models on consumer hardware. An 8B model runs on ~6GB VRAM quantized vs ~16GB unquantized.
Processing Time
On an RTX 3090 with default settings:
- Llama-3.1-8B-Instruct: ~45 minutes
- Gemma-3-12B-Instruct: ~60 minutes
- Larger models scale accordingly
Batch size auto-tuning maximizes throughput for your specific hardware.
Checkpointing
Heretic saves trial progress to JSONL checkpoints. If interrupted, you can resume exactly where you left off. Checkpoints are stored per-model in the checkpoints/ directory.
Common Errors and Fixes
CUDA Out of Memory
# Try quantization
heretic your-model --quantization bnb_4bit
# Or reduce batch size
heretic your-model --batch_size 1
Model Loading Fails
# Try different dtypes explicitly
heretic your-model --dtypes ["bfloat16", "float16"]
Trust Remote Code Required
# Some models need remote code execution
heretic your-model --trust_remote_code
Ethical Considerations
This section appears early because it matters. Removing safety filters changes how a model behaves. You need to understand the implications before deploying an abliterated model.
What Abliteration Does (and Doesn’t) Do
Abliteration removes learned refusal patterns. It doesn’t:
- Make the model “smarter” or more capable
- Remove biases present in the base model
- Add new knowledge or skills
The model still has the same training data and base capabilities. It just stops refusing certain requests.
Responsible Deployment
Heretic is released under AGPL-3.0. The authors acknowledge that removing safety filters enables both beneficial research and potentially harmful applications.
Legitimate uses include:
- Research into model alignment and safety mechanisms
- Testing model behavior under controlled conditions
- Deploying models with external guardrails and content filters
- Building applications that handle refusals at the application layer
Problematic uses include:
- Deploying without any safeguards in user-facing applications
- Generating harmful content at scale
- Circumventing safety measures for malicious purposes
External Safeguards You Should Implement
If you deploy an abliterated model, add these protections:
- Input filtering - Screen prompts before they reach the model
- Output monitoring - Review responses before showing to users
- Rate limiting - Prevent abuse through volume
- Logging and audit trails - Track what the model processes
- Human review - Keep humans in the loop for sensitive use cases
The tool itself is neutral. Its impact depends on how you use it. Consider the ethical implications before deploying uncensored models in production.
Comparison to Other Tools
Heretic joins several abliteration tools in the ecosystem:
| Tool | Auto-optimization | Weight kernels | Interpolated directions |
|---|---|---|---|
| Heretic | Yes (TPE) | Yes | Yes |
| AutoAbliteration | Yes | No | No |
| abliterator.py | No | No | No |
| wassname/abliterator | No | No | No |
| ErisForge | No | No | No |
Heretic’s automatic optimization removes the need for manual tuning. You don’t need to understand transformer internals to get good results.
Limitations
Heretic supports most dense transformer models and some MoE architectures. Unsupported model types:
- SSM/hybrid models (Mamba, etc.)
- Models with inhomogeneous layers
- Novel attention systems not yet recognized by the module detection logic
The tool works best with standard decoder-only architectures using self-attention and MLP layers.
Getting Started
- Install:
pip install -U heretic-llm - Choose a model: Start with a 7B-12B model for testing
- Run:
heretic your-model-name - Evaluate: Chat with the result or upload to Hugging Face
- Deploy safely: Add external guardrails before production use
The default settings work well for most models. Advanced users can fine-tune optimization parameters for specific use cases.
Heretic makes model modification accessible. No PhD required - point it at a model and let it work. Just remember to deploy responsibly.



