10 Best Small Local LLMs to Run on 8GB RAM or VRAM (No Cloud Required)

Discover 10 top small local LLMs that run on laptops with under 8GB of RAM or VRAM—no cloud required. Learn how quantization works, compare model strengths, and see the best tools for efficient AI deployments on your own hardware.

Mark Ponomarev

Mark Ponomarev

15 June 2026

10 Best Small Local LLMs to Run on 8GB RAM or VRAM (No Cloud Required)

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

Explore Apidog Enterprise

Modern large language models (LLMs) have revolutionized AI, but they often seem out of reach—requiring expensive GPUs, constant cloud access, or high monthly fees. What if you could run advanced AI right on your laptop or workstation, offline, with less than 8GB of RAM or VRAM?

Today, thanks to model quantization and efficient local inference tools, developers can harness impressive LLMs directly on consumer hardware. This guide explains the core concepts, compares top local LLMs, and shows how API-focused teams can leverage these advances—no cloud dependency required.

💡 Looking for an API platform that boosts team productivity? Generate beautiful API documentation, collaborate seamlessly, and replace Postman at a better price with Apidog—all in one place. Learn how Apidog supports high-performance teams.

button

Understanding Local LLMs: Quantization & Hardware Basics

Before running LLMs on your own machine, it’s important to grasp how quantization and memory interact:


How to Run Local LLMs: Ollama & LM Studio

Several mature tools make local LLM deployment easy—even for developers new to AI:

Image

1. Ollama

A CLI-first, developer-focused tool for running LLMs locally. Key features:

2. LM Studio

Prefer a GUI? LM Studio provides:

Image

Under the Hood:
Many of these tools use Llama.cpp for fast inference, supporting both CPU and GPU acceleration.


Top 10 Small Local LLMs Under 8GB VRAM/RAM

Below are ten high-performing LLMs you can run locally on standard hardware. Each section includes quantized file sizes and best use cases for API and backend teams.


1. Llama 3.1 8B (Quantized)

Command: ollama run llama3.1:8b

Meta’s Llama 3.1 8B is a versatile open-source model with impressive general and coding performance.

Image


2. Mistral 7B (Quantized)

Command: ollama run mistral:7b

Highly optimized, with innovations like Grouped-Query Attention (GQA) and Sliding Window Attention (SWA).

Image


3. Gemma 3:4B (Quantized)

Command: ollama run gemma3:4b

Google DeepMind’s compact 4B model—ultra lightweight.

Image


4. Gemma 7B (Quantized)

Command: ollama run gemma:7b

Larger sibling to 3:4B, shares Gemini infrastructure.

Image


5. Phi-3 Mini (3.8B, Quantized)

Command: ollama run phi3

Microsoft’s compact, logic-focused model—efficient and strong in reasoning.

Image


6. DeepSeek R1 7B/8B (Quantized)

Command: ollama run deepseek-r1:7b

Known for excellent reasoning and code performance.

Image


7. Qwen 1.5/2.5 7B (Quantized)

Command: ollama run qwen:7b

Alibaba’s multilingual, context-rich models.

Image


8. Deepseek-coder-v2 6.7B (Quantized)

Command: ollama run deepseek-coder-v2:6.7b

Specialized for code gen and understanding.


9. BitNet b1.58 2B4T

Command: ollama run hf.co/microsoft/bitnet-b1.58-2B-4T-gguf

Microsoft’s ultra-efficient 1.58-bit weight model—exceptional for edge and CPU-only inference.

Image0


10. Orca-Mini 7B (Quantized)

Command: ollama run orca-mini:7b

A general-purpose model based on Llama/Llama 2, trained on Orca Style data.


These small models run happily on a single machine, but the moment you need to serve one to real traffic, vLLM's high-throughput inference engine is the standard way to turn it into a scalable API.

And if even an 8GB footprint is more than your hardware allows, you can still run these models remotely — free open source LLM APIs serve many of the same weights from hosted endpoints.

Choosing a model that fits in 8GB is only half the setup — you still need software to load and serve it, and the best tools for running LLMs locally cover that side of the equation.

Key Takeaways for API & Backend Teams

For teams building, testing, or documenting APIs, maximizing efficiency is critical. Apidog’s unified platform helps you collaborate, generate robust API docs, and streamline development workflows—making it an ideal complement to local LLM solutions. Boost your team’s productivity with Apidog and see why it’s a more affordable Postman alternative (compare here).

button

Explore more

Claude Sonnet 5 Benchmarks: What the Numbers Actually Say

Claude Sonnet 5 Benchmarks: What the Numbers Actually Say

Claude Sonnet 5 benchmarks explained: SWE-bench Pro 63.2%, Terminal-Bench 80.4%, OSWorld 81.2%, and how close it gets to Opus 4.8 at a lower price.

1 July 2026

What Is Claude Sonnet 5? Features, Benchmarks, and Pricing

What Is Claude Sonnet 5? Features, Benchmarks, and Pricing

Claude Sonnet 5 explained: the June 2026 launch, 1M context, adaptive thinking, launch benchmarks vs Opus 4.8, intro pricing, availability, and who it's for.

1 July 2026

What Is Kreya?

What Is Kreya?

A look at the gRPC-first, privacy-first desktop API client by riok: protocols, offline use, git-diffable storage, pricing, and who it suits.

30 June 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs

10 Best Small Local LLMs to Run on 8GB RAM or VRAM (No Cloud Required)