onInit logo
onInit.io
/Projects

Local AI Server

Self-contained AI infrastructure: Run capable local AI models without external API dependencies.

Local AIInfrastructure

The Challenge

Pricing for cloud-based AI APIs continues to rise, rate limits tighten, and major providers increasingly acknowledge their inability to sustain services at current pricing levels. Organizations relying on AI-powered workflows and agents face a growing dependency on unpredictable, ever-increasing cost structures.

At the same time, many local alternatives fail at the hardware level: pure CPU inference is too slow, while enterprise GPU solutions are prohibitively expensive. The gap lies in a practical, cost-efficient architecture that can truly run high-performance models in production on available prosumer equipment.

Our Solution

We built a system based on a Threadripper Pro 3975WX processor paired with four AMD Radeon RX 7900 XTX GPUs, optimally combining raw computational power with commercially available hardware. Through AMD ROCm, GPU acceleration is effectively harnessed for AI inference on consumer-grade silicon.

The server runs two inference backends side by side: llama.cpp and LM Studio for users who value simple configuration and low power consumption, vLLM for scenarios where maximum throughput and enterprise-grade features are paramount. The appropriate backend is selected based on the requirements of each use case.

The result is a fully autonomous server: no data leaves the network, no ongoing API fees, no rate limits. After the initial investment, the only cost is electricity - and the system performance allows productive use of the largest and most complex models available.

Outcomes

€0 APIongoing costs after initial investment
120Bmodels up to 120 billion parameters
no rate limits

System Architecture

01

Prosumer hardware foundation: Threadripper Pro 3975WX + 4× RX 7900 XTX

02

AMD ROCm: GPU-accelerated AI inference on consumer hardware

03

llama.cpp: Quantization and resource-efficient inference with minimal overhead

04

LM Studio: Model management, API layer, and easy-to-use interface

05

vLLM: High throughput, PagedAttention, and enterprise-grade serving features

06

Dynamic backend routing: selection of the right backend per requirement

Technology Stack

Hardware

Threadripper Pro 3975WX4× Radeon RX 7900 XTX

GPU-Beschleunigung

AMD ROCm
AMD ROCm

Inference-Engine

llama.cpp
llama.cpp
vLLM
vLLM
Other projects
01 / 01
Caeliq Logo

01

Caeliq

Prototype: AI agent proves fully automated travel proposal creation - request to PDF.

A working prototype demonstrates the full workflow: an AI agent receives a travel request, selects suitable flights, assembles the proposal, and outputs it as a PDF. Next step: pilot partnership with live GDS connection.

Ready to take the first step?

30 minutes. No pitch. You'll leave with clarity - not another proposal.

Book a Free Discovery Call →

Free discovery call · No commitment