How to Run Deepseek V3 0324 Locally (with Steps)

Welcome! In this tutorial, I'll guide you through running DeepSeek V3 0324, a powerful 671 billion parameter language model, on your local machine. We'll cover everything from preparation to running your first prompts using dynamic quantization techniques that balance model accuracy with memory requirements.

DeepSeek V3 0324: How Good Is It?

DeepSeek's March 2024 update brings significant performance improvements over the previous V3 model:

MMLU-Pro score improved by +5.3% (to 81.2%)
AIME score improved by +19.8%
LiveCodeBench score improved by +10.0%
GPQA score improved by +9.3%

The original model is a massive 671 billion parameters, which means we need efficient quantization techniques to run it on consumer hardware.

Here are the available quantization options for balancing disk space and accuracy:

MoE Bits Type	Disk Size	Accuracy	Details
1.78bit (IQ1_S)	173GB	Okay	2.06/1.56bit
1.93bit (IQ1_M)	183GB	Fair	2.5/2.06/1.56
2.42bit (IQ2_XXS)	203GB	Suggested	2.5/2.06bit
2.71bit (Q2_K_XL)	231GB	Suggested	3.5/2.5bit
3.5bit (Q3_K_XL)	320GB	Great	4.5/3.5bit
4.5bit (Q4_K_XL)	406GB	Best	5.5/4.5bit

The original float8 model takes 715GB, so these quantized versions offer significant space savings!

Step-by-Step Tutorial: Running DeepSeek V3 0324 in llama.cpp

Before we start, let's understand the optimal settings for DeepSeek V3 0324:

Temperature: 0.3 (use 0.0 for coding tasks)
Min_P: 0.01 (helps filter out unlikely tokens)
Chat template: <｜User｜>YOUR_PROMPT<｜Assistant｜>
For KV cache quantization, use 8bit (not 4bit) for better performance

Step 1: Set up llama.cpp

First, we need to get and compile llama.cpp:

# Update packages and install required dependencies
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y

# Clone llama.cpp repository
git clone <https://github.com/ggml-org/llama.cpp>

# Build with CUDA support for GPU (use -DGGML_CUDA=OFF for CPU-only)
# Note: building with CUDA might take about 5 minutes
cmake llama.cpp -B llama.cpp/build \\\\
  -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON

# Build the necessary tools
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split

# Copy the built tools for easy access
cp llama.cpp/build/bin/llama-* llama.cpp/

Step 2: Download the Quantized Model

Install the required Python packages and download the model:

pip install huggingface_hub hf_transfer

# Set up environment for faster downloads
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Download the model (here we're using the 2.7bit dynamic quant for balance)
from huggingface_hub import snapshot_download
snapshot_download(
  repo_id = "unsloth/DeepSeek-V3-0324-GGUF",
  local_dir = "unsloth/DeepSeek-V3-0324-GGUF",
  allow_patterns = ["*UD-Q2_K_XL*"],  # Dynamic 2.7bit (230GB)
  # Use "*UD-IQ_S*" for Dynamic 1.78bit (151GB) if space is limited
)

Step 3: Run a Test Prompt

Let's test the model with a prompt asking it to create a Flappy Bird game:

./llama.cpp/llama-cli \\\\
  --model unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf \\\\
  --cache-type-k q8_0 \\\\
  --threads 20 \\\\
  --n-gpu-layers 2 \\\\
  -no-cnv \\\\
  --prio 3 \\\\
  --temp 0.3 \\\\
  --min_p 0.01 \\\\
  --ctx-size 4096 \\\\
  --seed 3407 \\\\
  --prompt "<｜User｜>Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<｜Assistant｜>"

here're the explainations of the parameters:

-model: Path to the model file
-cache-type-k q8_0: Uses 8-bit quantization for KV cache
-threads 20: Number of CPU threads (adjust based on your CPU)
-n-gpu-layers 2: Number of layers to offload to GPU (adjust if you have memory issues)
no-cnv: Disables convolution for performance reasons
-prio 3: Priority setting
-temp 0.3: Temperature setting (use 0.0 for deterministic coding)
-min_p 0.01: Minimum probability for token sampling
-ctx-size 4096: Context window size
-seed 3407: Random seed for reproducibility

Testing Deepseek V3 0324 with the "Heptagon Challenge"

You can further test your model's capabilities by running the "Heptagon Challenge," which asks the model to create a complex physics simulation with balls bouncing inside a spinning heptagon:

./llama.cpp/llama-cli \\\\
  --model unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf \\\\
  --cache-type-k q8_0 \\\\
  --threads 20 \\\\
  --n-gpu-layers 2 \\\\
  -no-cnv \\\\
  --prio 3 \\\\
  --temp 0.3 \\\\
  --min_p 0.01 \\\\
  --ctx-size 4096 \\\\
  --seed 3407 \\\\
  --prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:
- All balls have the same radius.
- All balls have a number on it from 1 to 20.
- All balls drop from the heptagon center when starting.
- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35
- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.
- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.
- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.
- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.
- The heptagon size should be large enough to contain all the balls.
- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.
- All codes should be put in a single Python file.<｜Assistant｜>"

Optimizing Deepseek V3 0324 Performance

Flash Attention: For faster decoding, use Flash Attention when compiling llama.cpp:

-DGGML_CUDA_FA_ALL_QUANTS=ON

CUDA Architecture: Set your specific CUDA architecture to reduce compilation times:

-DCMAKE_CUDA_ARCHITECTURES="80"  # Adjust for your GPU

Adjusting Parameters:

If you run into out-of-memory issues, try reducing -n-gpu-layers
For CPU-only inference, remove the -n-gpu-layers parameter
Adjust -threads based on your CPU cores

You now have DeepSeek V3 0324 running locally! To recap:

We set up llama.cpp with CUDA support
Downloaded a quantized version of the model (2.7bit dynamic quant)
Ran test prompts to verify the model's capabilities
Learned about optimal settings and performance tips

The 2.7bit dynamic quantization provides an excellent balance between disk space (231GB) and model accuracy, allowing you to run this 671B parameter model efficiently on your own hardware.

Feel free to experiment with different prompts and parameters to get the most out of this powerful model!

Testing DeepSeek API with Apidog

If you're developing applications that use DeepSeek through its API rather than running it locally, Apidog provides powerful tools for API development, testing, and debugging.

Setting Up Apidog for DeepSeek API Testing

Step 1: Download and Install Apidog

Visit https://apidog.com/download/ to download the Apidog client for your operating system.

button

2. Install and launch Apidog, then create an account or sign in with Google/Github.

3. When prompted, select your role (e.g., "Fullstack developer") and preferred work mode (e.g., "API design first").

Step 2: Create a New API Project for DeepSeek

Create a new HTTP project in Apidog for your DeepSeek API testing.
Add your DeepSeek API endpoint(s) to the project.

Debugging Streaming Responses from DeepSeek

DeepSeek and many other AI models use Server-Sent Events (SSE) for streaming responses. Apidog (version 2.6.49 or higher) has built-in support for debugging SSE:

Create and configure your DeepSeek API endpoint in Apidog.
Send the request to your DeepSeek API.
If the response includes the header Content-Type: text/event-stream, Apidog automatically processes it as an SSE event.
View the real-time streaming responses in the Timeline view of the response panel.

Apidog has built-in support for popular AI model formats, including:

OpenAI API Compatible Format
Gemini API Compatible Format
Claude API Compatible Format

For DeepSeek specifically, Apidog can display the model's thought process in the timeline, providing insight into the AI's reasoning.

Customizing SSE Response Handling for DeepSeek

If DeepSeek's response format doesn't match Apidog's built-in recognition rules, you can:

Configure JSONPath extraction rules for JSON-formatted SSE responses:

For a response like: data: {"choices":[{"index":0,"message":{"role":"assistant","content":"H"}}]}
Use JSONPath: $.choices[0].message.content

Use post-processor scripts for non-JSON SSE messages:

Write custom scripts to handle the data format
Process the messages according to your specific requirements

Creating Automated Tests for DeepSeek APIs

Once you have your DeepSeek API endpoint set up, you can create automated tests in Apidog to ensure it functions correctly:

Create test scenarios for different prompt types in the Tests module.
Add validation and assertions to verify response structure and content.
Configure the test scenario to run with different environments (e.g., development, production).
Set up batch runs to test multiple scenarios at once.

For CI/CD integration, Apidog CLI allows you to run these tests as part of your pipeline:

# Install Apidog CLI
npm install -g apidog-cli

# Run test scenario
apidog run test-scenario -c <collection-id> -e <environment-id> -k <api-key>

You can read more about how apidog-cli works in the Official Documentation.

Performance Testing DeepSeek API

Apidog also offers performance testing capabilities to evaluate how your DeepSeek API performs under load:

Create a test scenario that includes calls to your DeepSeek API.

Configure the performance test settings:

Set the number of virtual users (up to 100)
Specify the test duration
Configure the ramp-up duration to simulate gradual user increase

Run the performance test to see key metrics like:

Average throughput
Average response time
Maximum/minimum response time
Error rates

This is particularly useful for understanding how your DeepSeek deployment handles multiple concurrent requests.

Conclusion

You now have both DeepSeek V3 0324 running locally and the knowledge to test DeepSeek APIs effectively using Apidog! To recap:

We set up llama.cpp with CUDA support
Downloaded a quantized version of the model (2.7bit dynamic quant)
Ran test prompts to verify the model's capabilities
Learned how to use Apidog for testing and debugging DeepSeek APIs
Explored performance optimization tips for both local deployment and API testing

The 2.7bit dynamic quantization provides an excellent balance between disk space (231GB) and model accuracy, allowing you to run this 671B parameter model efficiently on your own hardware. Meanwhile, Apidog gives you powerful tools to develop, test, and debug DeepSeek API implementations, especially with its SSE debugging capabilities for streaming responses.

Feel free to experiment with different quantization options and Apidog features to find the setup that works best for your specific needs!

button