Step-by-step guide to running Gemma-4 26B locally on budget GPUs
The local AI revolution just got a serious upgrade. Google's Gemma-4 26B model, combined with Unsloth's Quantization-Aware Training GGUF formats, makes it possible to run a 26-billion parameter model on a standard developer laptop with an 8GB GPU. No cloud API keys. No $5,000 workstation. Just your everyday hardware running cutting-edge AI.
This guide walks you through the complete setup: building llama.cpp with CUDA support, downloading the right model file, configuring the -cmoe memory split, managing VRAM thermals with VRAM Shield, and connecting your application to the local inference server. By the end, you'll have a stable, high-performance local AI development environment that runs on budget hardware.
There's one catch. Running a 26B model on 8GB VRAM requires intelligent memory management. The -cmoe flag in llama.cpp splits the model across system RAM and GPU VRAM, but this creates a sustained thermal load on the memory modules. Without proper thermal management, your inference speed will degrade after 15 minutes. We'll solve this with VRAM Shield's pulse throttling technology.
If you're a web developer, frontend engineer, or open-source contributor who wants to integrate local reasoning LLMs into your daily workflow, this guide is for you. Let's get started.
Prerequisites
Before you begin, make sure you have the following hardware: an NVIDIA RTX 4060 with 8GB VRAM or equivalent, at least 16GB of system RAM (32GB recommended), 20GB of free storage for the model and build files, and Windows 10/11 (for VRAM Shield) or Linux.
For software, you'll need CUDA Toolkit 12.x for GPU acceleration, CMake 3.20 or later for building llama.cpp, Git for cloning repositories, and Python 3.10+ for monitoring scripts.
The model file you'll be working with is the Unsloth QAT GGUF: gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf, which weighs in at 13.2GB. The hardware requirements are modest. An RTX 4060 laptop (typically $1,000 to $1,400) is sufficient. The key is having 8GB of VRAM for the attention layers and enough system RAM for the expert weights.
If you're on Linux, the setup is similar but VRAM Shield isn't available yet. You'll need to implement your own thermal monitoring using nvidia-smi or NVML. We'll cover that in the advanced section.
Let's start by building llama.cpp.
Step 1: build llama.cpp
First, clone the llama.cpp repository and build it with CUDA support. Run the following commands in your terminal:
# Clone the repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Create build directory
cmake -B build -DGGML_CUDA=ON
# Build the binaries
cmake --build build --config Release -j --target llama-cli llama-server
The build process takes 5 to 10 minutes on a modern CPU. Once complete, you'll find the binaries in ./build/bin/. The llama-cli binary handles command-line inference, while llama-server runs an OpenAI-compatible API server.
Verify the build succeeded by running ./build/bin/llama-cli --version. You should see version information with CUDA support enabled. If you get a "command not found" error, check that your CUDA toolkit is properly installed and the build completed without errors.
For Windows users, you may need to specify the CUDA architecture explicitly: cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89. The 89 corresponds to the Ada Lovelace architecture (RTX 4060/4070/4080/4090).
Common build issues include "CUDA not found" (make sure CUDA Toolkit 12.x is installed and nvcc is in your PATH), "CMake version too old" (upgrade to CMake 3.20 or later), and "Build failed with errors" (check that you have Visual Studio Build Tools on Windows or build-essential on Linux).
Step 2: download the model
Unsloth provides optimized QAT GGUF formats for Gemma-4. The recommended version for 8GB VRAM is UD-Q4_K_XL.
# Create models directory
mkdir -p models
# Download the model from Hugging Face
# This is a 13.2GB file - download time depends on your connection
curl -L -o models/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \
"https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf"
The download takes 10 to 30 minutes depending on your internet speed. Once complete, verify the file size with ls -lh models/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf. It should show approximately 13.2GB.
Which quantization should you choose? For 8GB VRAM, stick with UD-Q4_K_XL. It gives the best quality-to-size ratio while fitting within the -cmoe memory split. The Q4_K_M variant is slightly smaller at 12.5GB with marginally lower quality, useful if you have other GPU processes competing for VRAM. If you have 24GB or more VRAM, the Q8_0 quantization at 26.9GB offers near-original quality.
The reason QAT matters: traditional quantization compresses a trained model post-hoc, often degrading quality. Quantization-Aware Training bakes the quantization constraints directly into the training process. The result is near-original accuracy at a fraction of the memory footprint. The 26B-A4B model shrinks from 48GB (BF16) to 13.2GB (Q4_K_XL) while retaining less than 1% quality loss on standard benchmarks.
Step 3: configure the -cmoe memory split
The -cmoe flag is the key to running a 26B model on 8GB VRAM. Here's the command:
# Run Gemma-4 26B with -cmoe memory split
./build/bin/llama-cli \
-m "models/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" \
-cmoe \
-c 248000 \
-v
The -m flag specifies the path to the GGUF model file. The -cmoe flag activates the Mixture of Experts memory split. The -c 248000 sets the context window to 248K tokens, Gemma-4's maximum. The -v flag enables verbose logging for monitoring.
What happens inside the engine? The attention mechanism runs on every token. It's compute-bound and latency-sensitive. Keeping it in VRAM ensures consistent token generation speed. The expert weights are memory-bound and tolerate the PCIe transfer penalty.
┌─────────────────────────────────────────────────────────────┐
│ -cmoe MEMORY ALLOCATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ SYSTEM RAM (DDR5) GPU VRAM (8GB GDDR6X) │
│ ┌─────────────────────┐ ┌────────────────────┐ │
│ │ Expert Weights │ │ Attention Layers │ │
│ │ (120 of 128 experts)│ │ (Q, K, V, O) │ │
│ │ ~11.5 GB │ │ ~1.2 GB │ │
│ │ │ │ │ │
│ │ Swapped on-demand │◄────────────►│ Always resident │ │
│ │ by router network │ PCIe 4.0 │ │ │
│ │ │ ~16 GB/s │ KV Cache │ │
│ │ │ │ ~0.5 GB │ │
│ └─────────────────────┘ └────────────────────┘ │
│ │
│ Token Generation: 20 t/s sustained │
│ Expert Swap Latency: <2ms per token │
└─────────────────────────────────────────────────────────────┘
The VRAM heat trap: because the attention heads and KV cache are kept in VRAM, the memory bus runs at maximum frequency. In laptops, this rapidly saturates the shared heatsink. After 15 to 20 minutes, the VRAM junction temperature hits 105°C and the GPU firmware throttles performance. This is where VRAM Shield comes in.
For most use cases, 32K to 64K tokens is sufficient. Only use 248K if you genuinely need the full context for long document analysis or codebase-wide refactoring. A smaller context window means less KV cache, which means more VRAM headroom for the attention layers.
Step 4: manage VRAM thermals
The -cmoe memory split creates a sustained thermal load on the VRAM modules. Without thermal management, your inference speed will degrade after 15 minutes. This is the "15-minute cliff" phenomenon.
Timeline Without Thermal Management:
0-10 min: 20 t/s 75°C core 85°C VRAM (stable)
10-14 min: 18 t/s 75°C core 98°C VRAM (thermal creep)
15+ min: 5 t/s 75°C core 105°C VRAM (firmware throttled)
The problem is straightforward. The GPU core temperature looks fine at 75°C, but the VRAM junction temperature climbs steadily. By the time you hit 105°C, the firmware slams on the brakes. Your throughput drops from 20 tokens per second to 5. The model still runs, but it feels like it's wading through molasses.
VRAM Shield introduces micro-suspensions in the GPU compute stream, giving the heat-pipes time to clear accumulated thermal energy. The duty cycle approach sacrifices 10% of peak performance to prevent the 75% cliff.
Download and install VRAM Shield from the GitHub releases page:
# Download from GitHub Releases
curl -L -o VRAMShield_2.2.2.exe https://github.com/53-software/vram-shield/releases/download/v2.2.2/VRAMShield_2.2.2.exe
# Run as Administrator (required for hardware sensor access)
.\VRAMShield_2.2.2.exe
Set the target temperature to 95°C for VRAM junction, select duty cycle mode to Pulse Throttling at 90%, and enable auto-monitoring.
Recommended Settings for Gemma-4 26B on 8GB VRAM:
┌─────────────────────────────────────────────────┐
│ Target VRAM Temp: 95°C │
│ Duty Cycle: 90% (Pulse mode) │
│ Panic Threshold: 108°C (emergency halt) │
│ Monitoring Rate: 500ms intervals │
└─────────────────────────────────────────────────┘
With VRAM Shield active, the timeline looks very different. Your sustained throughput stays at 18 tokens per second for hours, not minutes. The 15-minute cliff vanishes.
Start VRAM Shield first, then launch llama.cpp:
# Terminal 1: Start VRAM Shield (run as Administrator)
.\VRAMShield_2.2.2.exe
# Terminal 2: Start llama.cpp inference
./build/bin/llama-cli \
-m "models/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" \
-cmoe \
-c 248000 \
-v
Watch the VRAM Shield dashboard as the model loads. You'll see the VRAM junction temperature spike briefly during initial weight loading, then stabilize as pulse throttling engages.
Step 5: connect your application
For web development, you'll want to run llama.cpp as a server and connect your application to it. Start the server with the following command:
# Start llama.cpp server with OpenAI-compatible API
./build/bin/llama-server \
-m "models/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf" \
-cmoe \
-c 64000 \
--host 0.0.0.0 \
--port 8080
The server runs on http://localhost:8080 and exposes an OpenAI-compatible API. Create a .env file in your project root with your local LLM configuration.
# .env.local
LOCAL_LLM_API_URL=http://localhost:8080/v1
LOCAL_LLM_MODEL=gemma-4-26B-A4B-it-qat-UD-Q4_K_XL
Test the connection with curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-26B-A4B-it-qat-UD-Q4_K_XL",
"messages": [
{"role": "user", "content": "Explain the -cmoe flag in llama.cpp"}
],
"temperature": 0.7,
"max_tokens": 500
}'
For Node.js or Next.js applications, create a helper module that wraps the API calls:
// lib/local-llm.js
const LOCAL_LLM_URL = process.env.LOCAL_LLM_API_URL || 'http://localhost:8080/v1';
export async function chatCompletion(messages, options = {}) {
const response = await fetch(`${LOCAL_LLM_URL}/chat/completions`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: options.model || 'gemma-4-26B-A4B-it-qat-UD-Q4_K_XL',
messages,
temperature: options.temperature || 0.7,
max_tokens: options.maxTokens || 1000,
}),
});
if (!response.ok) {
throw new Error(`LLM API error: ${response.status}`);
}
return response.json();
}
// Usage
const result = await chatCompletion([
{ role: 'system', content: 'You are a helpful coding assistant.' },
{ role: 'user', content: 'Write a React component for a todo list.' },
]);
console.log(result.choices[0].message.content);
For Python applications, the pattern is similar:
# local_llm.py
import os
import requests
LOCAL_LLM_URL = os.getenv("LOCAL_LLM_API_URL", "http://localhost:8080/v1")
def chat_completion(messages, model="gemma-4-26B-A4B-it-qat-UD-Q4_K_XL",
temperature=0.7, max_tokens=1000):
response = requests.post(
f"{LOCAL_LLM_URL}/chat/completions",
json={
"model": model,
"messages": messages,
"temperature": temperature,
"max_tokens": max_tokens,
}
)
response.raise_for_status()
return response.json()
# Usage
result = chat_completion([
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to sort a list."},
])
print(result["choices"][0]["message"]["content"])
Performance notes: first token latency is around 200 milliseconds on cold start. Sustained throughput stays at 18 to 20 tokens per second with VRAM Shield running. For context window sizing, 32K to 64K tokens works best for web applications. llama.cpp handles one request at a time, so use a queue if you have multiple concurrent users.
Your local AI development environment is now ready. You have a 26B parameter model running on your laptop, accessible via a standard API, with thermal management that prevents performance degradation.
Advanced configuration
If you need to reduce memory usage further, try the Q4_K_M quantization. It's 12.5GB versus 13.2GB for Q4_K_XL, with slightly lower quality. Use it if you have other GPU processes competing for VRAM.
The KV cache memory footprint scales linearly with context length. A 32K context uses about 400MB of KV cache and keeps total VRAM usage around 1.6GB. A 64K context doubles that to 800MB and 2.0GB. For web applications, 32K to 64K provides the best balance of performance and capability. Only push to 128K or 248K if you genuinely need the full context for long document analysis.
For Linux users, VRAM Shield isn't available yet. Use nvidia-smi monitoring to watch VRAM temperature in real-time, or write a simple Python script using pynvml to poll the thermal sensors.
For maximum throughput, optimize the server launch parameters. Use -t 8 to allocate 8 CPU threads for the forward pass, --mlock to prevent the model from being swapped out, and a smaller context window like -c 32000 to free up VRAM headroom.
Troubleshooting: if you encounter OOM errors, close other GPU-accelerated applications like browsers and Discord, reduce the context window to 32K, or switch to Q4_K_M. If inference is slow, verify CUDA is enabled by checking the version output, confirm GPU utilization with nvidia-smi, and make sure VRAM Shield is running with thermal management active.
Summary
You now have a complete local AI development environment running Gemma-4 26B on budget hardware. You've built llama.cpp with CUDA support, configured the -cmoe memory split that fits a 26B model on 8GB VRAM, set up VRAM Shield thermal management that prevents the 15-minute cliff, and connected an OpenAI-compatible API server to your web applications.
The local-first AI revolution is here. You don't need expensive cloud API keys or $5,000 workstations. A standard developer laptop with an RTX 4060 is enough to run cutting-edge 26B parameter models. The key insight is that running large models on budget hardware requires intelligent memory management and thermal control. The -cmoe flag solves the memory problem. VRAM Shield solves the thermal problem. Together, they make local AI development stable and accessible.
Get started
Star the VRAM Shield repository: github.com/53-software/vram-shield
Download VRAM Shield: vramshield.com or GitHub Releases
Join the community: Share your thermal benchmarks and configuration tips
The tools are open-source. The models are open-weight. The future of AI development is local. Build it on your own hardware.

