How to Self-Host an LLM on a VPS — Full Deployment Guide

Published: June 10, 2025 Reading Time: ~6 min Category: Guides

// TL;DR

Self-hosted LLMs give you full data control, no rate limits, and customizable models
Small models (7B params) run fine on 4 GB+ RAM; larger models need more resources
We recommend VPS Premium ($64.99/mo) for small LLMs or Dedicated 2 vCores ($69/mo) for serious workloads
DejavuHost exclusive: We offer pre-installation — select "Other" OS, mention your LLM, and we'll deliver your server with everything set up

// 1. What Is a Self-Hosted LLM and Why Consider It?

A self-hosted LLM means running your own large language model on infrastructure you control, instead of relying on API providers like OpenAI or Anthropic. You download the model weights, run an inference server, and interact with it through a local API — all on your own hardware.

Why people choose to self-host:

No rate limits — Send as many requests as your hardware can handle. No throttling, no waiting.
No vendor lock-in — Switch models freely. Run Mistral today, LLaMA tomorrow, or both at once.
Full data privacy — Your prompts and responses never leave your server. Nothing is logged by a third party.
Offline capability — Once downloaded, the model runs without an internet connection.
Fine-tuning support — Train the model on your own data for domain-specific accuracy.

Common use cases:

Internal chatbots for teams or customer support
Code assistants and pair programming tools
Document search and summarization pipelines
Customer support automation with custom knowledge bases
Private notes, journaling, and personal AI assistants

Best for: Companies handling sensitive data (healthcare, finance, legal), high-usage scenarios (1M+ tokens/day), and anyone who needs custom fine-tuned models without sharing data with external APIs.

// 2. Hardware Requirements

Model Size	RAM	CPU	Storage	Recommended Plan
Tiny (2-3B, e.g. Phi-2)	4 GB	2 vCPU	30 GB SSD	VPS Start ($24.99/mo)
Small (7B, e.g. Mistral 7B)	8 GB	4 vCPU	50 GB SSD	VPS Premium ($64.99/mo) ★
Medium (13B, e.g. LLaMA 2 13B)	16 GB+	4+ vCPU	100 GB+	Dedicated 4 vCores ($99/mo)

Note: Using quantized models (GGUF format) significantly reduces RAM requirements. A quantized 7B model can run comfortably on 4-6 GB RAM, making even mid-range VPS plans viable for real LLM workloads.

VPS PREMIUM

$64.99/month

RECOMMENDED

Ideal for quantized 7B models

CPU:

████████████████

Xeon 4x2.20 GHz

RAM:

████████████████

8 GB

DRIVE:

██████████░░░░░░

50 GB SSD

BANDWIDTH:

████████████████

Unmetered

1GBPS PORT WEEKLY BACKUP KVM

Order Now

DEDICATED 2 vCORES

$69/month

Best for production LLM workloads

CPU:

████████████████

AMD EPYC 2 vCores

RAM:

████████████████

8 GB DDR4

DRIVE:

████████████████

125 GB SSD NVMe

BANDWIDTH:

████████████████

Unmetered Fair-Use

1GBPS PORT FULL ROOT KVM

Order Now

// 3. Popular Models to Self-Host

These are the most practical models for VPS deployment. All support quantization (GGUF) for reduced memory usage.

◆ Mistral 7B

7B parameters | Apache 2.0 license. Fast, efficient, and great for chat and Q&A tasks. Runs well on 8 GB RAM with quantization. One of the best performance-per-parameter models available.

◆ LLaMA 2

7B/13B parameters | Meta license. Versatile general-purpose model with strong community support. Excellent fine-tuning ecosystem and wide compatibility with inference backends.

◆ Phi-2

2.7B parameters | MIT license. Surprisingly capable for its size. Perfect for lightweight tasks on smaller VPS plans. Ideal when you need fast responses with minimal resource usage.

◆ Falcon

7B/40B parameters | Apache 2.0 license. Strong multilingual support and higher accuracy on benchmarks. Trades more resource usage for better output quality across languages.

// 4. Step-by-Step: Deploy an LLM on DejavuHost

Order Your Server

Want it pre-installed? Select "Other (Raise a Ticket)" from the OS dropdown and tell us which LLM you want. We'll deliver your server with everything set up — the model downloaded, inference server running, and API endpoint ready. No extra cost.

Prefer to set it up yourself? Choose Ubuntu 22.04 LTS and follow the steps below.

Head to our VPS Hosting or Dedicated Servers page and pick a plan
Choose Ubuntu 22.04 LTS as your OS
Pick a location closest to your users from our 12 global locations
Complete checkout (we accept Bitcoin via BTCPay!)

Install Dependencies

Once your server is provisioned and you have SSH access, install the essentials:

# Update system and install essentials
sudo apt update && sudo apt upgrade -y
sudo apt install python3 python3-pip git wget curl -y

# Install PyTorch (CPU version - sufficient for quantized models)
pip install torch --index-url https://download.pytorch.org/whl/cpu

Set Up llama.cpp (Recommended for VPS)

llama.cpp is the most efficient way to run LLMs on CPU-only servers. It supports quantized models and uses minimal resources.

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download a quantized model (Mistral 7B Q4)
mkdir models
wget -P models/ https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf

Start the Inference Server

Launch the built-in server with an OpenAI-compatible API endpoint:

# Launch with OpenAI-compatible API
./server -m models/mistral-7b-instruct-v0.1.Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 4096 -ngl 0

Test Your LLM

Send a test prompt to confirm everything is working:

# Send a test prompt
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "max_tokens": 100
  }'

If you get a JSON response with the model's reply, your self-hosted LLM is live.

Pro tip: llama.cpp exposes an OpenAI-compatible API. You can point any tool that works with OpenAI (chatbots, coding assistants, automation frameworks) at your self-hosted LLM instead — just change the base URL to your server's address.

// 5. Pre-Installation & Custom Setup

Unlike other providers that hand you a blank server, DejavuHost offers application pre-installation at no extra cost. Whether it's an LLM, a game server, a VPN, or any other application — tell us what you need and we'll set it up.

How it works:

Select "Other (Raise a Ticket)" from the OS dropdown during checkout
In your ticket, mention the model you want (e.g., "Mistral 7B via llama.cpp on Ubuntu 22.04")
Complete your order as normal
Our team provisions your server with everything installed — model downloaded, inference server configured, API endpoint ready
You receive SSH credentials and can start using your LLM immediately

This also works for alternative inference backends:

Ollama — Simplified model management with a single binary
vLLM — High-throughput serving for production workloads
Text Generation WebUI — Browser-based chat interface with model switching
Any other backend — Just tell us what you need

No extra cost, no hassle — your server arrives ready to use.

// 6. Security Best Practices

◆ Firewall Configuration

Only open the ports you need. Keep the LLM API behind a firewall or SSH tunnel. Never expose port 8080 directly to the public internet without authentication.

◆ API Authentication

Add API key authentication before exposing your endpoint. Use a reverse proxy like Nginx with basic auth or token validation. Never run an open LLM API on the public internet.

◆ Keep System Updated

Regular apt update && apt upgrade. LLM backends like llama.cpp update frequently with performance improvements and security patches.

◆ Resource Monitoring

LLMs are memory-intensive. Monitor RAM and CPU usage to avoid OOM crashes. Use htop or set up a lightweight monitoring tool like Netdata.

◆ Model Validation

Only download models from trusted sources (HuggingFace official repos, verified uploaders). Verify checksums when available to prevent tampered weights.

◆ SSH Key Auth

Disable password login and use SSH keys for all server access. It's the single most important step for securing your server against brute-force attacks.

// 7. Wrap-Up

Self-hosted LLMs give you privacy, control, and zero rate limits — all on infrastructure you own. With quantized models and efficient backends like llama.cpp, running a capable AI model on a VPS is more accessible than ever.

Our VPS Premium plan ($64.99/mo) handles quantized 7B models comfortably with 4 vCPU Xeon cores, 8 GB RAM, and unmetered bandwidth.

For production workloads that need consistent performance, our Dedicated 2 vCores plan ($69/mo) offers NVMe storage, dedicated AMD EPYC resources, and fair-use unmetered bandwidth.

Or skip the setup entirely — tell us what you need and we'll deliver it ready to go.

Ready to run your own AI?

Order your server and deploy a self-hosted LLM today.

Order VPS Premium - $64.99/mo Order Dedicated 2 vCores - $69/mo View All Plans

VPS Plans

From $11.99/mo

Dedicated

From $69/mo

Locations

Uptime SLA

99.9%