How to Self-Host an LLM on a VPS — Full Deployment Guide

// TL;DR

// 1. What Is a Self-Hosted LLM and Why Consider It?

A self-hosted LLM means running your own large language model on infrastructure you control, instead of relying on API providers like OpenAI or Anthropic. You download the model weights, run an inference server, and interact with it through a local API — all on your own hardware.

Why people choose to self-host:

Common use cases:

Best for: Companies handling sensitive data (healthcare, finance, legal), high-usage scenarios (1M+ tokens/day), and anyone who needs custom fine-tuned models without sharing data with external APIs.

// 2. Hardware Requirements

Model Size RAM CPU Storage Recommended Plan
Tiny (2-3B, e.g. Phi-2) 4 GB 2 vCPU 30 GB SSD VPS Start ($24.99/mo)
Medium (13B, e.g. LLaMA 2 13B) 16 GB+ 4+ vCPU 100 GB+ Dedicated 4 vCores ($99/mo)
Note: Using quantized models (GGUF format) significantly reduces RAM requirements. A quantized 7B model can run comfortably on 4-6 GB RAM, making even mid-range VPS plans viable for real LLM workloads.

// 3. Popular Models to Self-Host

These are the most practical models for VPS deployment. All support quantization (GGUF) for reduced memory usage.

◆ Mistral 7B

7B parameters | Apache 2.0 license. Fast, efficient, and great for chat and Q&A tasks. Runs well on 8 GB RAM with quantization. One of the best performance-per-parameter models available.

◆ LLaMA 2

7B/13B parameters | Meta license. Versatile general-purpose model with strong community support. Excellent fine-tuning ecosystem and wide compatibility with inference backends.

◆ Phi-2

2.7B parameters | MIT license. Surprisingly capable for its size. Perfect for lightweight tasks on smaller VPS plans. Ideal when you need fast responses with minimal resource usage.

◆ Falcon

7B/40B parameters | Apache 2.0 license. Strong multilingual support and higher accuracy on benchmarks. Trades more resource usage for better output quality across languages.

// 4. Step-by-Step: Deploy an LLM on DejavuHost

1

Order Your Server

Want it pre-installed? Select "Other (Raise a Ticket)" from the OS dropdown and tell us which LLM you want. We'll deliver your server with everything set up — the model downloaded, inference server running, and API endpoint ready. No extra cost.

Prefer to set it up yourself? Choose Ubuntu 22.04 LTS and follow the steps below.

2

Install Dependencies

Once your server is provisioned and you have SSH access, install the essentials:

# Update system and install essentials sudo apt update && sudo apt upgrade -y sudo apt install python3 python3-pip git wget curl -y # Install PyTorch (CPU version - sufficient for quantized models) pip install torch --index-url https://download.pytorch.org/whl/cpu
3

Set Up llama.cpp (Recommended for VPS)

llama.cpp is the most efficient way to run LLMs on CPU-only servers. It supports quantized models and uses minimal resources.

# Clone and build llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # Download a quantized model (Mistral 7B Q4) mkdir models wget -P models/ https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
4

Start the Inference Server

Launch the built-in server with an OpenAI-compatible API endpoint:

# Launch with OpenAI-compatible API ./server -m models/mistral-7b-instruct-v0.1.Q4_K_M.gguf \ --host 0.0.0.0 --port 8080 \ -c 4096 -ngl 0
5

Test Your LLM

Send a test prompt to confirm everything is working:

# Send a test prompt curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "Hello, who are you?"}], "max_tokens": 100 }'

If you get a JSON response with the model's reply, your self-hosted LLM is live.

Pro tip: llama.cpp exposes an OpenAI-compatible API. You can point any tool that works with OpenAI (chatbots, coding assistants, automation frameworks) at your self-hosted LLM instead — just change the base URL to your server's address.

// 5. Pre-Installation & Custom Setup

Unlike other providers that hand you a blank server, DejavuHost offers application pre-installation at no extra cost. Whether it's an LLM, a game server, a VPN, or any other application — tell us what you need and we'll set it up.

How it works:

  1. Select "Other (Raise a Ticket)" from the OS dropdown during checkout
  2. In your ticket, mention the model you want (e.g., "Mistral 7B via llama.cpp on Ubuntu 22.04")
  3. Complete your order as normal
  4. Our team provisions your server with everything installed — model downloaded, inference server configured, API endpoint ready
  5. You receive SSH credentials and can start using your LLM immediately

This also works for alternative inference backends:

No extra cost, no hassle — your server arrives ready to use.

// 6. Security Best Practices

◆ Firewall Configuration

Only open the ports you need. Keep the LLM API behind a firewall or SSH tunnel. Never expose port 8080 directly to the public internet without authentication.

◆ API Authentication

Add API key authentication before exposing your endpoint. Use a reverse proxy like Nginx with basic auth or token validation. Never run an open LLM API on the public internet.

◆ Keep System Updated

Regular apt update && apt upgrade. LLM backends like llama.cpp update frequently with performance improvements and security patches.

◆ Resource Monitoring

LLMs are memory-intensive. Monitor RAM and CPU usage to avoid OOM crashes. Use htop or set up a lightweight monitoring tool like Netdata.

◆ Model Validation

Only download models from trusted sources (HuggingFace official repos, verified uploaders). Verify checksums when available to prevent tampered weights.

◆ SSH Key Auth

Disable password login and use SSH keys for all server access. It's the single most important step for securing your server against brute-force attacks.

// 7. Wrap-Up

Self-hosted LLMs give you privacy, control, and zero rate limits — all on infrastructure you own. With quantized models and efficient backends like llama.cpp, running a capable AI model on a VPS is more accessible than ever.

Our VPS Premium plan ($64.99/mo) handles quantized 7B models comfortably with 4 vCPU Xeon cores, 8 GB RAM, and unmetered bandwidth.

For production workloads that need consistent performance, our Dedicated 2 vCores plan ($69/mo) offers NVMe storage, dedicated AMD EPYC resources, and fair-use unmetered bandwidth.

Or skip the setup entirely — tell us what you need and we'll deliver it ready to go.

Ready to run your own AI?

Order your server and deploy a self-hosted LLM today.

VPS Plans
From $11.99/mo
Dedicated
From $69/mo
Locations
12
Uptime SLA
99.9%