ExLlama: Memory-Efficient Llama Implementation for Quantized Weights

ExLlama

3.5 | 59 | 0
Type:
Open Source Projects
Last Updated:
2025/10/18
Description:
ExLlama is a memory-efficient, standalone Python/C++/CUDA implementation of Llama for fast inference with 4-bit GPTQ quantized weights on modern GPUs.
Share:
Llama inference
GPTQ quantization
CUDA
memory efficiency
large language models

Overview of ExLlama

ExLlama: Memory-Efficient Llama Implementation for Quantized Weights

ExLlama is a standalone Python/C++/CUDA implementation of Llama designed for speed and memory efficiency when using 4-bit GPTQ weights on modern GPUs. This project aims to provide a faster and more memory-efficient alternative to the Hugging Face Transformers implementation, particularly for users working with quantized models.

What is ExLlama?

ExLlama is designed to be a high-performance inference engine for the Llama family of language models. It leverages CUDA for GPU acceleration and is optimized for 4-bit GPTQ quantized weights, enabling users to run large language models on GPUs with limited memory.

How does ExLlama work?

ExLlama optimizes memory usage and inference speed through several techniques:

  • CUDA Implementation: Utilizes CUDA for efficient GPU computation.
  • Quantization Support: Specifically designed for 4-bit GPTQ quantized weights.
  • Memory Efficiency: Reduces memory footprint compared to standard implementations.

Key Features and Benefits:

  • High Performance: Optimized for fast inference.
  • Memory Efficiency: Allows running large models on less powerful GPUs.
  • Standalone Implementation: No need for the Hugging Face Transformers library.
  • Web UI: Includes a simple web UI for easy interaction with the model (JavaScript written by ChatGPT, so beware!).
  • Docker Support: Can be run in a Docker container for easier deployment and security.

How to use ExLlama?

  1. Installation:

    • Clone the repository: git clone https://github.com/turboderp/exllama
    • Navigate to the directory: cd exllama
    • Install dependencies: pip install -r requirements.txt
  2. Running the Benchmark:

    • python test_benchmark_inference.py -d <path_to_model_files> -p -ppl
  3. Running the Chatbot Example:

    • python example_chatbot.py -d <path_to_model_files> -un "Jeff" -p prompt_chatbort.txt
  4. Web UI:

    • Install additional dependencies: pip install -r requirements-web.txt
    • Run the web UI: python webui/app.py -d <path_to_model_files>

Why choose ExLlama?

ExLlama offers several advantages:

  • Performance: Delivers faster inference speeds compared to other implementations.
  • Accessibility: Enables users with limited GPU memory to run large language models.
  • Flexibility: Can be integrated into other projects via the Python module.
  • Ease of Use: Provides a simple web UI for interacting with the model.

Who is ExLlama for?

ExLlama is suitable for:

  • Researchers and developers working with large language models.
  • Users with NVIDIA GPUs (30-series and later recommended).
  • Those seeking a memory-efficient and high-performance inference solution.
  • Anyone interested in running Llama models with 4-bit GPTQ quantization.

Hardware Requirements:

  • NVIDIA GPUs (RTX 30-series or later recommended)
  • ROCm support is theoretical but untested

Dependencies:

  • Python 3.9+
  • PyTorch (tested on 2.0.1 and 2.1.0 nightly) with CUDA 11.8
  • safetensors 0.3.2
  • sentencepiece
  • ninja
  • flask and waitress (for web UI)

Docker Support:

ExLlama can be run in a Docker container for easier deployment and security. The Docker image supports NVIDIA GPUs.

Results and Benchmarks:

ExLlama demonstrates significant performance improvements compared to other implementations, especially in terms of tokens per second (t/s) during inference. Benchmarks are provided for various Llama model sizes (7B, 13B, 33B, 65B, 70B) on different GPU configurations.

Example usage

import torch
from exllama.model import ExLlama, ExLlamaCache, ExLlamaConfig
from exllama.tokenizer import ExLlamaTokenizer

## Initialize model and tokenizer
model_directory = "/path/to/your/model"
tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")

config = ExLlamaConfig(model_config_path)
config.model_path = os.path.join(model_directory, "model.safetensors")

tokenizer = ExLlamaTokenizer(tokenizer_path)
model = ExLlama(config)
cache = ExLlamaCache(model)

## Prepare input
prompt = "The quick brown fox jumps over the lazy"
input_ids = tokenizer.encode(prompt)

## Generate output
model.forward(input_ids, cache)
token = model.sample(temperature = 0.7, top_k = 50, top_p = 0.7)

output = tokenizer.decode([token])
print(prompt + output)

Compatibility and Model Support:

ExLlama is compatible with a range of Llama models, including Llama 1 and Llama 2. The project is continuously updated to support new models and features.

ExLlama is a powerful tool for anyone looking to run Llama models efficiently. Its focus on memory optimization and speed makes it an excellent choice for both research and practical applications.

Best Alternative Tools to "ExLlama"

LM Studio
No Image Available
50 0

LM Studio enables you to run local AI models like gpt-oss, Qwen, Gemma, and DeepSeek on your computer, privately and for free. It supports developer resources like JS and Python SDKs.

local AI
AI model runtime
offline AI
Friendli Inference
No Image Available
111 0

Friendli Inference is the fastest LLM inference engine, optimized for speed and cost-effectiveness, slashing GPU costs by 50-90% while delivering high throughput and low latency.

LLM serving
GPU optimization
llama.cpp
No Image Available
103 0

Enable efficient LLM inference with llama.cpp, a C/C++ library optimized for diverse hardware, supporting quantization, CUDA, and GGUF models. Ideal for local and cloud deployment.

LLM inference
C/C++ library
llm-answer-engine
No Image Available
162 0

Build a Perplexity-inspired AI answer engine using Next.js, Groq, Llama-3, and Langchain. Get sources, answers, images, and follow-up questions efficiently.

AI answer engine
semantic search
vLLM
No Image Available
151 0

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs, featuring PagedAttention and continuous batching for optimized performance.

LLM inference engine
PagedAttention
Nebius AI Studio Inference Service
No Image Available
155 0

Nebius AI Studio Inference Service offers hosted open-source models for faster, cheaper, and more accurate results than proprietary APIs. Scale seamlessly with no MLOps needed, ideal for RAG and production workloads.

AI inference
open-source LLMs
ChatLLaMA
No Image Available
153 0

ChatLLaMA is a LoRA-trained AI assistant based on LLaMA models, enabling custom personal conversations on your local GPU. Features desktop GUI, trained on Anthropic's HH dataset, available for 7B, 13B, and 30B models.

LoRA fine-tuning
conversational AI
Inweave
No Image Available
142 0

Inweave is an AI-powered platform designed for startups and scaleups to automate workflows efficiently. Deploy customizable AI assistants using top models like GPT and Llama via chat or API for seamless productivity gains.

workflow automation
AI assistants
Sagify
No Image Available
143 0

Sagify is an open-source Python tool that streamlines machine learning pipelines on AWS SageMaker, offering a unified LLM Gateway for seamless integration of proprietary and open-source large language models to boost productivity.

ML deployment
LLM gateway
LlamaChat
No Image Available
104 0

LlamaChat is a macOS app that allows you to chat with LLaMA, Alpaca, and GPT4All models locally on your Mac. Download now and experience local LLM chatting!

local LLM
macOS app
LLaMA
Arbius
No Image Available
334 0

Arbius is a decentralized network powered by GPUs globally, creating a shared economy around generative AI. It allows users to participate in governance, earn fees via staking, and promote open AI.

decentralized AI
GPU computing
Featherless.ai
No Image Available
313 0

Instantly run any Llama model from HuggingFace without setting up any servers. Over 11,900+ models available. Starting at $10/month for unlimited access.

LLM hosting
AI inference
serverless
Venice
No Image Available
216 0

Venice.ai: Private and uncensored AI for text, images, characters, and code. Access leading open-source models privately.

private AI
uncensored AI
AI model
Fireworks AI
No Image Available
348 0

Fireworks AI delivers blazing-fast inference for generative AI using state-of-the-art, open-source models. Fine-tune and deploy your own models at no extra cost. Scale AI workloads globally.

inference engine
open-source LLMs