How to Self-Host an LLM on a VPS — Full Deployment Guide
Published: June 10, 2025
Reading Time: ~6 min
Category: Guides
// 1. What Is a Self-Hosted LLM and Why Consider It?
A self-hosted LLM means running your own large language model on infrastructure you control, instead of relying on API providers like OpenAI or Anthropic. You download the model weights, run an inference server, and interact with it through a local API — all on your own hardware.
Why people choose to self-host:
- No rate limits — Send as many requests as your hardware can handle. No throttling, no waiting.
- No vendor lock-in — Switch models freely. Run Mistral today, LLaMA tomorrow, or both at once.
- Full data privacy — Your prompts and responses never leave your server. Nothing is logged by a third party.
- Offline capability — Once downloaded, the model runs without an internet connection.
- Fine-tuning support — Train the model on your own data for domain-specific accuracy.
Common use cases:
- Internal chatbots for teams or customer support
- Code assistants and pair programming tools
- Document search and summarization pipelines
- Customer support automation with custom knowledge bases
- Private notes, journaling, and personal AI assistants
Best for: Companies handling sensitive data (healthcare, finance, legal), high-usage scenarios (1M+ tokens/day), and anyone who needs custom fine-tuned models without sharing data with external APIs.
// 2. Hardware Requirements
| Model Size |
RAM |
CPU |
Storage |
Recommended Plan |
| Tiny (2-3B, e.g. Phi-2) |
4 GB |
2 vCPU |
30 GB SSD |
VPS Start ($24.99/mo) |
| Small (7B, e.g. Mistral 7B) |
8 GB |
4 vCPU |
50 GB SSD |
VPS Premium ($64.99/mo) ★ |
| Medium (13B, e.g. LLaMA 2 13B) |
16 GB+ |
4+ vCPU |
100 GB+ |
Dedicated 4 vCores ($99/mo) |
Note: Using quantized models (GGUF format) significantly reduces RAM requirements. A quantized 7B model can run comfortably on 4-6 GB RAM, making even mid-range VPS plans viable for real LLM workloads.
CPU:
████████████████
Xeon 4x2.20 GHz
DRIVE:
██████████░░░░░░
50 GB SSD
BANDWIDTH:
████████████████
Unmetered
1GBPS PORT
WEEKLY BACKUP
KVM
Order Now
CPU:
████████████████
AMD EPYC 2 vCores
RAM:
████████████████
8 GB DDR4
DRIVE:
████████████████
125 GB SSD NVMe
BANDWIDTH:
████████████████
Unmetered Fair-Use
1GBPS PORT
FULL ROOT
KVM
Order Now
// 3. Popular Models to Self-Host
These are the most practical models for VPS deployment. All support quantization (GGUF) for reduced memory usage.
◆ Mistral 7B
7B parameters | Apache 2.0 license. Fast, efficient, and great for chat and Q&A tasks. Runs well on 8 GB RAM with quantization. One of the best performance-per-parameter models available.
◆ LLaMA 2
7B/13B parameters | Meta license. Versatile general-purpose model with strong community support. Excellent fine-tuning ecosystem and wide compatibility with inference backends.
◆ Phi-2
2.7B parameters | MIT license. Surprisingly capable for its size. Perfect for lightweight tasks on smaller VPS plans. Ideal when you need fast responses with minimal resource usage.
◆ Falcon
7B/40B parameters | Apache 2.0 license. Strong multilingual support and higher accuracy on benchmarks. Trades more resource usage for better output quality across languages.
// 4. Step-by-Step: Deploy an LLM on DejavuHost
1
Order Your Server
Want it pre-installed? Select "Other (Raise a Ticket)" from the OS dropdown and tell us which LLM you want. We'll deliver your server with everything set up — the model downloaded, inference server running, and API endpoint ready. No extra cost.
Prefer to set it up yourself? Choose Ubuntu 22.04 LTS and follow the steps below.
- Head to our VPS Hosting or Dedicated Servers page and pick a plan
- Choose Ubuntu 22.04 LTS as your OS
- Pick a location closest to your users from our 12 global locations
- Complete checkout (we accept Bitcoin via BTCPay!)
2
Install Dependencies
Once your server is provisioned and you have SSH access, install the essentials:
sudo apt update && sudo apt upgrade -y
sudo apt install python3 python3-pip git wget curl -y
pip install torch --index-url https://download.pytorch.org/whl/cpu
3
Set Up llama.cpp (Recommended for VPS)
llama.cpp is the most efficient way to run LLMs on CPU-only servers. It supports quantized models and uses minimal resources.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
mkdir models
wget -P models/ https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
4
Start the Inference Server
Launch the built-in server with an OpenAI-compatible API endpoint:
./server -m models/mistral-7b-instruct-v0.1.Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-c 4096 -ngl 0
5
Test Your LLM
Send a test prompt to confirm everything is working:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello, who are you?"}],
"max_tokens": 100
}'
If you get a JSON response with the model's reply, your self-hosted LLM is live.
Pro tip: llama.cpp exposes an OpenAI-compatible API. You can point any tool that works with OpenAI (chatbots, coding assistants, automation frameworks) at your self-hosted LLM instead — just change the base URL to your server's address.
// 5. Pre-Installation & Custom Setup
Unlike other providers that hand you a blank server, DejavuHost offers application pre-installation at no extra cost. Whether it's an LLM, a game server, a VPN, or any other application — tell us what you need and we'll set it up.
How it works:
- Select "Other (Raise a Ticket)" from the OS dropdown during checkout
- In your ticket, mention the model you want (e.g., "Mistral 7B via llama.cpp on Ubuntu 22.04")
- Complete your order as normal
- Our team provisions your server with everything installed — model downloaded, inference server configured, API endpoint ready
- You receive SSH credentials and can start using your LLM immediately
This also works for alternative inference backends:
- Ollama — Simplified model management with a single binary
- vLLM — High-throughput serving for production workloads
- Text Generation WebUI — Browser-based chat interface with model switching
- Any other backend — Just tell us what you need
No extra cost, no hassle — your server arrives ready to use.
// 7. Wrap-Up
Self-hosted LLMs give you privacy, control, and zero rate limits — all on infrastructure you own. With quantized models and efficient backends like llama.cpp, running a capable AI model on a VPS is more accessible than ever.
Our VPS Premium plan ($64.99/mo) handles quantized 7B models comfortably with 4 vCPU Xeon cores, 8 GB RAM, and unmetered bandwidth.
For production workloads that need consistent performance, our Dedicated 2 vCores plan ($69/mo) offers NVMe storage, dedicated AMD EPYC resources, and fair-use unmetered bandwidth.
Or skip the setup entirely — tell us what you need and we'll deliver it ready to go.