ExLlama
Overview of ExLlama
ExLlama: Memory-Efficient Llama Implementation for Quantized Weights
ExLlama is a standalone Python/C++/CUDA implementation of Llama designed for speed and memory efficiency when using 4-bit GPTQ weights on modern GPUs. This project aims to provide a faster and more memory-efficient alternative to the Hugging Face Transformers implementation, particularly for users working with quantized models.
What is ExLlama?
ExLlama is designed to be a high-performance inference engine for the Llama family of language models. It leverages CUDA for GPU acceleration and is optimized for 4-bit GPTQ quantized weights, enabling users to run large language models on GPUs with limited memory.
How does ExLlama work?
ExLlama optimizes memory usage and inference speed through several techniques:
- CUDA Implementation: Utilizes CUDA for efficient GPU computation.
- Quantization Support: Specifically designed for 4-bit GPTQ quantized weights.
- Memory Efficiency: Reduces memory footprint compared to standard implementations.
Key Features and Benefits:
- High Performance: Optimized for fast inference.
- Memory Efficiency: Allows running large models on less powerful GPUs.
- Standalone Implementation: No need for the Hugging Face Transformers library.
- Web UI: Includes a simple web UI for easy interaction with the model (JavaScript written by ChatGPT, so beware!).
- Docker Support: Can be run in a Docker container for easier deployment and security.
How to use ExLlama?
Installation:
- Clone the repository:
git clone https://github.com/turboderp/exllama - Navigate to the directory:
cd exllama - Install dependencies:
pip install -r requirements.txt
- Clone the repository:
Running the Benchmark:
python test_benchmark_inference.py -d <path_to_model_files> -p -ppl
Running the Chatbot Example:
python example_chatbot.py -d <path_to_model_files> -un "Jeff" -p prompt_chatbort.txt
Web UI:
- Install additional dependencies:
pip install -r requirements-web.txt - Run the web UI:
python webui/app.py -d <path_to_model_files>
- Install additional dependencies:
Why choose ExLlama?
ExLlama offers several advantages:
- Performance: Delivers faster inference speeds compared to other implementations.
- Accessibility: Enables users with limited GPU memory to run large language models.
- Flexibility: Can be integrated into other projects via the Python module.
- Ease of Use: Provides a simple web UI for interacting with the model.
Who is ExLlama for?
ExLlama is suitable for:
- Researchers and developers working with large language models.
- Users with NVIDIA GPUs (30-series and later recommended).
- Those seeking a memory-efficient and high-performance inference solution.
- Anyone interested in running Llama models with 4-bit GPTQ quantization.
Hardware Requirements:
- NVIDIA GPUs (RTX 30-series or later recommended)
- ROCm support is theoretical but untested
Dependencies:
- Python 3.9+
- PyTorch (tested on 2.0.1 and 2.1.0 nightly) with CUDA 11.8
- safetensors 0.3.2
- sentencepiece
- ninja
- flask and waitress (for web UI)
Docker Support:
ExLlama can be run in a Docker container for easier deployment and security. The Docker image supports NVIDIA GPUs.
Results and Benchmarks:
ExLlama demonstrates significant performance improvements compared to other implementations, especially in terms of tokens per second (t/s) during inference. Benchmarks are provided for various Llama model sizes (7B, 13B, 33B, 65B, 70B) on different GPU configurations.
Example usage
import torch
from exllama.model import ExLlama, ExLlamaCache, ExLlamaConfig
from exllama.tokenizer import ExLlamaTokenizer
## Initialize model and tokenizer
model_directory = "/path/to/your/model"
tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
config = ExLlamaConfig(model_config_path)
config.model_path = os.path.join(model_directory, "model.safetensors")
tokenizer = ExLlamaTokenizer(tokenizer_path)
model = ExLlama(config)
cache = ExLlamaCache(model)
## Prepare input
prompt = "The quick brown fox jumps over the lazy"
input_ids = tokenizer.encode(prompt)
## Generate output
model.forward(input_ids, cache)
token = model.sample(temperature = 0.7, top_k = 50, top_p = 0.7)
output = tokenizer.decode([token])
print(prompt + output)
Compatibility and Model Support:
ExLlama is compatible with a range of Llama models, including Llama 1 and Llama 2. The project is continuously updated to support new models and features.
ExLlama is a powerful tool for anyone looking to run Llama models efficiently. Its focus on memory optimization and speed makes it an excellent choice for both research and practical applications.
Best Alternative Tools to "ExLlama"
LM Studio enables you to run local AI models like gpt-oss, Qwen, Gemma, and DeepSeek on your computer, privately and for free. It supports developer resources like JS and Python SDKs.
Friendli Inference is the fastest LLM inference engine, optimized for speed and cost-effectiveness, slashing GPU costs by 50-90% while delivering high throughput and low latency.
Enable efficient LLM inference with llama.cpp, a C/C++ library optimized for diverse hardware, supporting quantization, CUDA, and GGUF models. Ideal for local and cloud deployment.
Build a Perplexity-inspired AI answer engine using Next.js, Groq, Llama-3, and Langchain. Get sources, answers, images, and follow-up questions efficiently.
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs, featuring PagedAttention and continuous batching for optimized performance.
Nebius AI Studio Inference Service offers hosted open-source models for faster, cheaper, and more accurate results than proprietary APIs. Scale seamlessly with no MLOps needed, ideal for RAG and production workloads.
ChatLLaMA is a LoRA-trained AI assistant based on LLaMA models, enabling custom personal conversations on your local GPU. Features desktop GUI, trained on Anthropic's HH dataset, available for 7B, 13B, and 30B models.
Inweave is an AI-powered platform designed for startups and scaleups to automate workflows efficiently. Deploy customizable AI assistants using top models like GPT and Llama via chat or API for seamless productivity gains.
Sagify is an open-source Python tool that streamlines machine learning pipelines on AWS SageMaker, offering a unified LLM Gateway for seamless integration of proprietary and open-source large language models to boost productivity.
LlamaChat is a macOS app that allows you to chat with LLaMA, Alpaca, and GPT4All models locally on your Mac. Download now and experience local LLM chatting!
Arbius is a decentralized network powered by GPUs globally, creating a shared economy around generative AI. It allows users to participate in governance, earn fees via staking, and promote open AI.
Instantly run any Llama model from HuggingFace without setting up any servers. Over 11,900+ models available. Starting at $10/month for unlimited access.
Venice.ai: Private and uncensored AI for text, images, characters, and code. Access leading open-source models privately.
Fireworks AI delivers blazing-fast inference for generative AI using state-of-the-art, open-source models. Fine-tune and deploy your own models at no extra cost. Scale AI workloads globally.