10 Best Small Local LLMs to Try Out (< 8GB)

The world of Large Language Models (LLMs) has exploded, often conjuring images of massive, cloud-bound supercomputers churning out text. But what if you could harness significant AI power right on your personal computer, without constant internet connectivity or hefty cloud subscriptions? The exciting reality is that you can. Thanks to advancements in optimization techniques, a new breed of "small local LLMs" has emerged, delivering remarkable capabilities while fitting comfortably within the memory constraints of consumer-grade hardware – specifically, requiring less than 8GB of RAM or VRAM.

💡

Want a great API Testing tool that generates beautiful API Documentation?

Want an integrated, All-in-One platform for your Developer Team to work together with maximum productivity?

Apidog delivers all your demands, and replaces Postman at a much more affordable price!

button

Let's Talk About LLM Quantizations First

To effectively leverage small local LLMs, a foundational understanding of key technical concepts is essential. The interplay between hardware components and model optimization techniques dictates performance and accessibility.

A common point of confusion for new users is the difference between VRAM (Video RAM) and system RAM. VRAM is a specialized, high-speed memory located directly on your graphics card (GPU). It is specifically engineered for the rapid, parallel processing tasks that GPUs excel at, such as rendering graphics or performing the massive matrix multiplications central to LLM inference. In contrast, regular system RAM is slower but typically more abundant, serving as the main memory for the computer's central processing unit (CPU) and general applications. For efficient LLM operation, the model's parameters (weights) and intermediate calculations (activations) ideally reside entirely within the fast VRAM, allowing the GPU to access them instantly and process information quickly. If a model's components are forced to reside in slower system RAM, the inference process will be significantly hampered, leading to much slower response times.

The cornerstone technology that makes running large language models feasible on consumer-grade hardware is quantization.

This process drastically reduces the memory footprint of LLMs by representing model weights with fewer bits, for example, using 4-bit or 8-bit integers instead of the standard 16-bit or 32-bit floating-point precision. This technique allows a 7-billion-parameter model, which might typically require approximately 14GB in FP16 (full precision), to run on as little as 4-5GB using 4-bit quantization. This reduction in memory and computational load directly addresses the barriers of high hardware cost and energy consumption, making advanced AI capabilities accessible on standard consumer devices.

The GGUF format has emerged as the standard for storing and loading quantized local models, offering broad compatibility across various inference engines. Within the GGUF ecosystem, different quantization types exist, each offering a distinct trade-off between file size, quality, and inference speed. For many general use cases, Q4_K_M is frequently recommended as it strikes a balanced compromise between quality and memory efficiency. While quantization is highly effective, pushing to very low bit rates, such as Q2_K or IQ3_XS, can lead to a noticeable degradation in model quality.

It is also important to note that the actual VRAM or RAM requirement for running an LLM is slightly higher than the model's quantized file size. This is because additional memory is needed to store input data (prompts and context) and intermediate calculation results (activations). Typically, this overhead can be estimated as approximately 1.2 times the model's base size.

Getting Started with Local LLMs and Ollama

The ecosystem for running local LLMs has matured significantly, offering a variety of tools tailored to different user preferences and technical proficiencies. Two prominent platforms stand out for their ease of use and robust capabilities.

Ollama is a powerful and developer-focused tool designed for running LLMs locally with simplicity and efficiency. Its primary interface is a command-line interface (CLI), which allows for straightforward setup and model management. Ollama excels in its built-in model packaging and the "Modelfile" feature, which enables users to customize models and seamlessly integrate them into scripts and various applications. The platform is lightweight and performance-optimized, making it ideal for fast, repeatable deployments in development environments or automated workflows.

For users who prefer a graphical interface (GUI), LM Studio is often the go-to choice. It offers an intuitive desktop application with a clean design, a built-in chat interface, and a user-friendly system for browsing and downloading GGUF-formatted models directly from Hugging Face. LM Studio simplifies model management, allowing users to easily switch between different LLMs and adjust parameters directly from the user interface. This immediate visual feedback is particularly beneficial for beginners and non-technical users, facilitating quick experimentation and prompt testing without requiring any command-line knowledge.

Many user-friendly tools, including LM Studio, often leverage Llama.cpp as their underlying inference engine. Llama.cpp is a high-performance inference engine written in C++ that primarily utilizes the GGUF format and supports acceleration on both CPUs and GPUs.

The following selection highlights ten highly capable small LLMs that can be run locally on systems with less than 8GB of VRAM, offering a balance of performance, versatility, and efficiency. The memory footprints provided focus on quantized GGUF versions, which are optimized for consumer hardware.

Small LLM s that You Can Explore

Llama 3.1 8B (Quantized)

ollama run llama3.1:8b

Meta's Llama 3.1 8B is a highly acclaimed open-source model, recognized for its strong general performance and impressive cost efficiency. It is part of the Llama 3.1 family, which has benefited from substantial improvements in training data and optimization techniques, including a sevenfold increase in training data (over 15 trillion tokens) compared to its predecessors.

While the full 8B model typically requires more VRAM, its lower quantized versions are designed to fit within the 8GB VRAM/RAM limit. For instance, the Q2_K quantization has a file size of 3.18 GB and requires approximately 7.20 GB of memory. Similarly, Q3_K_M (4.02 GB file, 7.98 GB required memory) is a viable option for systems with limited memory.

Llama 3.1 8B excels in conversational AI performance, as measured by AlpacaEval 2.0 Win Rate. It demonstrates strong capabilities in code generation (HumanEval Pass@1), text summarization (CNN/DailyMail Rouge-L-Sum for processing product reviews and emails), and Retrieval-Augmented Generation (RAG) tasks (MS Marco Rouge-L-Sum for accurate question answering and natural language search summarization). It is also effective for generating structured output from text, such as extracting concepts into a JSON payload, and for providing overviews of short code snippets. Its efficiency makes it suitable for batch processing and agentic workflows.

Mistral 7B (Quantized)

ollama run mistral:7b

Mistral 7B is a fully dense transformer model widely praised for its efficiency, speed, and compact VRAM footprint. It incorporates advanced architectural techniques such as Grouped-Query Attention (GQA) and Sliding Window Attention (SWA) to enhance its performance.

This model is highly optimized for low VRAM environments. Quantized versions like Q4_K_M (4.37 GB file, 6.87 GB required memory) and Q5_K_M (5.13 GB file, 7.63 GB required memory) fit comfortably within an 8GB VRAM budget. Mistral 7B is an excellent choice for fast, self-contained AI inference and real-time applications where low latency is critical. It demonstrates strong performance in general knowledge and structured reasoning tasks. Its compact VRAM footprint makes it suitable for edge device deployment. It is effective for multi-turn chat and can be used in AI chatbot solutions for general inquiries. Its Apache 2.0 license is particularly favorable for commercial use cases.

Gemma 3:4b (Quantized)

ollama run gemma3:4b

The Gemma 3:4B parameter model is a member of Google DeepMind's Gemma family, specifically engineered for efficiency and state-of-the-art performance within a lightweight package. Its memory footprint is exceptionally small, making it highly accessible for a wide range of hardware.

For instance, the Q4_K_M quantization has a file size of 1.71 GB and is recommended for systems with 4GB of VRAM. This minimal memory usage makes it an ideal candidate for rapid prototyping and deployment on very low-end hardware, including mobile devices. Gemma 3:4B is well-suited for basic text generation, question answering, and summarization tasks. It can be effective for quick information retrieval and Optical Character Recognition (OCR) applications. Despite its small size, Gemma 3:4B demonstrates strong performance.

Gemma 7B (Quantized)

ollama run gemma:7b

As the larger sibling in the Gemma family, the 7B model offers enhanced capabilities while remaining runnable on consumer-grade hardware. It shares technical and infrastructure components with Google's more extensive Gemini models, allowing it to achieve high performance directly on developer laptops or desktop computers.

Quantized versions of Gemma 7B, such as Q5_K_M (6.14 GB file) and Q6_K (7.01 GB file), fit comfortably within the 8GB VRAM limit. It generally requires at least 8GB of system RAM for optimal performance. Gemma 7B is a versatile model, capable of handling a wide array of natural language processing tasks, including text generation, question answering, summarization, and reasoning. It demonstrates capabilities in code generation and interpretation, as well as addressing mathematical queries. Its architecture, shared with larger Gemini models, allows for high performance on developer laptops or desktop computers, making it a valuable tool for content creation, conversational AI, and knowledge exploration.

Phi-3 Mini (3.8B, Quantized)

ollama run phi3

Microsoft's Phi-3 Mini is a lightweight, state-of-the-art model distinguished by its exceptional efficiency and a strong focus on high-quality, reasoning-dense properties. This model challenges the conventional notion that only larger LLMs can effectively handle complex tasks. Phi-3 Mini is remarkably memory-efficient. For example, the Q8_0 quantization has a file size of 4.06 GB and requires approximately 7.48 GB of memory, placing it well within the 8GB limit.

Even its FP16 (full precision) version has a file size of 7.64 GB, though it requires 10.82 GB of memory. Phi-3 Mini excels in language understanding, logical reasoning, coding, and mathematical problem-solving. Its compact size and design make it suitable for memory/compute constrained environments and latency-bound scenarios, including deployment on mobile devices. It is particularly well-suited for prompts delivered in a chat format and can serve as a building block for generative AI-powered features.

DeepSeek R1 7B/8B (Quantized)

ollama run deepseek-r1:7b

DeepSeek models, including their 7B and 8B variants, are recognized for their robust reasoning capabilities and computational efficiency. The DeepSeek-R1-0528-Qwen3-8B variant has been highlighted as probably the best reasoning model in the 8B size, having been distilled from a larger model to achieve high performance. The DeepSeek R1 7B Q4_K_M quantization has a file size of 4.22 GB and requires approximately 6.72 GB of memory.

The DeepSeek R1 8B model has a general model size of 4.9 GB, with a recommended VRAM of 6GB. These configurations fit comfortably within the 8GB constraint. DeepSeek models are strong in natural language understanding, text generation, question answering, and particularly excel in reasoning and code generation. Their relatively low computational footprint makes them an attractive option for small and medium-sized businesses (SMBs) and developers seeking to deploy AI solutions without incurring massive cloud costs, suitable for intelligent customer support systems, advanced data analysis, and automated content generation.

Qwen 1.5/2.5 7B (Quantized)

ollama run qwen:7b

The Qwen series from Alibaba offers a diverse range of models, with the 7B variants serving as a balanced powerhouse for general-purpose AI applications. Qwen 1.5, considered the beta version of Qwen2, provides multilingual support and a stable context length of 32K tokens.

For memory footprint, the Qwen 1.5 7B Q5_K_M quantization has a file size of 5.53 GB. Qwen2.5 7B has a general model size of 4.7 GB, with a recommended VRAM of 6GB. These models are well within the 8GB VRAM limit. The Qwen 7B models are versatile, suitable for conversational AI, content generation, basic reasoning tasks, and language translation. Specifically, the Qwen 7B Chat model demonstrates strong performance in Chinese and English understanding, coding, and mathematics, and supports ReAct Prompting for tool usage. Its efficiency makes it suitable for customer support chatbots and basic programming assistance.

Deepseek-coder-v2 6.7B (Quantized)

ollama run deepseek-coder-v2:6.7b

Deepseek-coder-v2 6.7B is a specialized model from DeepSeek, meticulously designed for coding-specific tasks. This fine-tuned variant aims to significantly enhance code generation and understanding capabilities. With a model size of 3.8 GB and a recommended VRAM of 6GB, it fits comfortably within the 8GB constraint, making it highly accessible for developers with limited hardware. Its primary use cases include code completion, generating code snippets, and interpreting existing code. For developers and programmers operating with limited VRAM, Deepseek-coder-v2 6.7B offers highly specialized capabilities, establishing it as a top choice for local coding assistance.

BitNet b1.58 2B4T

ollama run hf.co/microsoft/bitnet-b1.58-2B-4T-gguf

Microsoft's BitNet b1.58 2B4T represents a revolutionary open-source model that employs a 1.58-bit weight format, leading to drastic reductions in memory and energy consumption while maintaining competitive performance. Its unparalleled memory efficiency, requiring only 0.4 GB of non-embedded memory, makes it ideally suited for extremely resource-constrained environments, including edge AI devices such as smartphones, laptops, and IoT devices, and for efficient CPU-only inference.

It brings high-performance LLM capabilities to devices that lack dedicated GPU support, enabling on-device translation, content recommendation, and more capable mobile voice assistants without constant cloud connectivity. While it may exhibit slightly less accuracy compared to much larger models, its performance relative to its size is remarkable. Its unparalleled memory efficiency and ability to run effectively on CPUs position it as a game-changer for accessibility and sustainability in the AI landscape.

Orca-Mini 7B (Quantized)

ollama run orca-mini:7b

Orca-Mini 7B is a general-purpose model built upon the Llama and Llama 2 architectures, trained on Orca Style datasets. It is available in various sizes, with the 7B variant proving to be a suitable option for entry-level hardware. The orca-mini:7b model has a file size of 3.8 GB. Quantized versions such as Q4_K_M (4.08 GB file, 6.58 GB required memory) and Q5_K_M (4.78 GB file, 7.28 GB required memory) fit within the 8GB constraint. It generally requires at least 8GB of system RAM for optimal operation. Orca-Mini 7B is well-suited for general text generation, answering questions, and conversational tasks. It demonstrates strong instruction following and can be effectively utilized for building AI agents. The fine-tuned Mistral-7B-OpenOrca variant, based on Orca research, shows exceptional performance in generating text and code, answering questions, and engaging in conversation.

Conclusion

The models highlighted in this report—including Llama 3 8B, Mistral 7B, Gemma 2B and 7B, Phi-3 Mini, DeepSeek R1 7B/8B, Qwen 1.5/2.5 7B, Deepseek-coder-v2 6.7B, BitNet b1.58 2B4T, and Orca-Mini 7B—represent the vanguard of this accessibility. Each offers a unique blend of capabilities, memory efficiency, and ideal use cases, making them suitable for a diverse range of tasks from general conversation and creative writing to specialized coding assistance and complex reasoning.

The effectiveness of these models on systems with limited VRAM is largely attributable to advanced quantization techniques, which drastically reduce their memory footprint without severe quality degradation. The continuous advancements in model efficiency and the increasing focus on edge AI deployment signal a future where sophisticated AI capabilities are seamlessly integrated into everyday devices. Users are encouraged to experiment with the recommended models, as the "best" choice is ultimately subjective and depends on individual hardware configurations and specific application requirements. The vibrant open-source community continues to contribute to this evolving landscape, ensuring a dynamic and innovative future for local LLMs.

💡

button