AI
How to Migrate from Cloud AI to On-Premise with Xinity
A practical guide for engineering teams ready to take control of their AI infrastructure.
You already know why. Here's how.
If you're reading this, you've probably already felt it: unpredictable API bills, data leaving your infrastructure with every call, and a growing sense that your AI stack is built on someone else's terms.
This guide walks you through the process — from auditing your current cloud AI usage to running your first on-premise inference call with Xinity.
Step 1: Audit Your Current AI Usage
Before you touch any infrastructure, understand what you're actually using.
Map every API call. Go through your codebase and identify every point where you call OpenAI, Anthropic, Azure OpenAI, or any other cloud AI provider. Document the model, the endpoint, and the approximate volume.
Categorize by criticality. Your customer-facing chatbot running at 10,000 requests per day is a different migration than your internal document summarizer running once a week. Rank them: production-critical, internal tooling, experimental.
Calculate your actual spend. Pull your invoices for the last 6 months. Plot the trend. You can use our ROI Calculator to see what those numbers look like on your own hardware.
Identify data sensitivity. Which workloads process customer data? Employee data? Anything under GDPR Article 9 — health, biometric, political data — is an immediate candidate for migration.
Step 2: Choose Your Model Equivalents
Cloud providers lock you into their proprietary models. On-premise, you have choices. Here's what we run and recommend:
Ministral 3B Instruct — Mistral's compact 3B parameter model, optimized for edge deployment. Handles chat, instruction-following, classification, extraction, and routing tasks at high speed with minimal resources. Fits in 8GB VRAM. Ideal for high-throughput workloads where you'd currently use GPT-3.5 or GPT-4o-mini — the tasks that make up the bulk of most API bills.
Qwen 3.5 35B — Alibaba's 35B parameter model for complex reasoning, analysis, and generation tasks. This is your GPT-4 equivalent — multi-step reasoning, nuanced language understanding, long-form content generation. Needs more VRAM but delivers the quality your production workloads demand.
Qwen 3.6 35B FP8 — The latest generation Qwen at 35B parameters in FP8 quantization. Same class of capabilities as the 3.5, with improved performance and efficiency from the FP8 format — meaning faster inference at the same quality level.
The key insight: you don't need one model to replace everything. Route fast, simple tasks to Ministral 3B. Route complex reasoning to Qwen 35B. Xinity handles this routing natively through the OpenAI-compatible API — your application just specifies the model name in the request, exactly like you do with OpenAI today.
Step 3: Get Your Hardware
We recommend the ASUS Ascent GX10.
It's a desktop AI supercomputer powered by the NVIDIA GB10 Grace Blackwell Superchip. What makes it practical for this migration:
128GB unified memory — run models up to 200 billion parameters on your desk
Up to 1 petaflop of AI performance — production-grade inference and fine-tuning
150 x 150 x 51mm — fits on a desk, no server room required
240W power adapter — standard outlet, no special infrastructure
Scalable — connect two units via NVIDIA ConnectX-7 to double to 2 petaflops with 256GB memory
Ubuntu Linux pre-installed — ready for AI workloads with PyTorch, TensorFlow, and Ollama included
10G Ethernet + ConnectX-7 SmartNIC — fast data movement for production workloads
No rack space. No cooling rooms. Enterprise-grade AI compute that plugs in and runs.
Step 4: Install Xinity
Install the CLI and bring up the full stack:
# Install the Xinity CLI curl -fsSL https://get.xinity.ai/install.sh | bash # Set up everything (Postgres, inference engine, dashboard)
# Install the Xinity CLI curl -fsSL https://get.xinity.ai/install.sh | bash # Set up everything (Postgres, inference engine, dashboard)
# Install the Xinity CLI curl -fsSL https://get.xinity.ai/install.sh | bash # Set up everything (Postgres, inference engine, dashboard)
If you're deploying to a remote server instead of the local machine:
xinity up all --target-host
xinity up all --target-host
xinity up all --target-host
Create your admin account from the terminal:
Check that everything is healthy:
Step 5: Deploy Your First Model
Deploy a model directly from the CLI. Here's a quick start with Phi-3 Mini:
xinity act deployment.create '{ "name": "Phi-3 Mini", "publicSpecifier": "phi-3-mini", "modelSpecifier": "phi3:mini", "enabled": true }'
xinity act deployment.create '{ "name": "Phi-3 Mini", "publicSpecifier": "phi-3-mini", "modelSpecifier": "phi3:mini", "enabled": true }'
xinity act deployment.create '{ "name": "Phi-3 Mini", "publicSpecifier": "phi-3-mini", "modelSpecifier": "phi3:mini", "enabled": true }'
Check the deployment status:
xinity act deployment.list '{"withStatus": true}'xinity act deployment.list '{"withStatus": true}'xinity act deployment.list '{"withStatus": true}'Once it shows "ready", you have a running inference endpoint. You can also deploy and manage models from the Xinity dashboard at localhost:3100 — the Model Hub gives you a visual interface to deploy, test, edit, and monitor all your models.
Step 6: Make Your First Call
Hit your local OpenAI-compatible API:
curl http://localhost:3000/v1/chat/completions \ -H "Authorization: Bearer sk_..." \ -H "Content-Type: application/json" \ -d '{ "model": "phi-3-mini", "messages": [{"role": "user", "content": "Hello from on-prem."}] }'
curl http://localhost:3000/v1/chat/completions \ -H "Authorization: Bearer sk_..." \ -H "Content-Type: application/json" \ -d '{ "model": "phi-3-mini", "messages": [{"role": "user", "content": "Hello from on-prem."}] }'
curl http://localhost:3000/v1/chat/completions \ -H "Authorization: Bearer sk_..." \ -H "Content-Type: application/json" \ -d '{ "model": "phi-3-mini", "messages": [{"role": "user", "content": "Hello from on-prem."}] }'
Same endpoint format, same request body, same response structure as OpenAI.
Step 7: Switch Your Application Code
In your application, the change is two lines:
Before (OpenAI):
from openai import OpenAI client = OpenAI(api_key="sk-...")
from openai import OpenAI client = OpenAI(api_key="sk-...")
from openai import OpenAI client = OpenAI(api_key="sk-...")
After (Xinity):
from openai import OpenAI client = OpenAI( base_url="http://your-xinity-instance:3000/v1", api_key="sk_your_xinity_key" )
from openai import OpenAI client = OpenAI( base_url="http://your-xinity-instance:3000/v1", api_key="sk_your_xinity_key" )
from openai import OpenAI client = OpenAI( base_url="http://your-xinity-instance:3000/v1", api_key="sk_your_xinity_key" )
Same SDK. Same method calls. client.chat.completions.create() works identically. Your application code doesn't change — only the configuration.
Step 8: Migrate Gradually
Don't cut over everything at once.
Start with your lowest-risk workload. Internal tools, development environments, non-customer-facing features. Point them at Xinity.
Compare outputs. For your first few hundred requests, log both responses side by side. Most teams find the quality is equivalent for their specific use case.
Shift traffic progressively. Move workloads one by one, starting with internal and progressing to production.
The Economics
Cloud AI pricing assumes bursty, unpredictable workloads — roughly 15-20% GPU utilization. Production AI agents run at 80-90% utilization. At that utilization, dedicated on-premise hardware delivers roughly 80% cost savings compared to equivalent cloud capacity.
We call this the Utilization Inversion: the moment your AI workloads become predictable enough that owning beats renting.
Run the numbers for your setup: xinity.ai/roi-calculator
What You Get After Migration
No per-token charges. Fixed, predictable infrastructure costs.
No data egress. Every request stays on your hardware, in your jurisdiction.
No vendor lock-in. Xinity is open source under Apache 2.0.
Full compliance. GDPR, EU AI Act — handled by architecture, not policy promises.
Same developer experience. Your engineers keep using the OpenAI SDK.
Ready to Start?
The full migration whitepaper with hardware sizing and model benchmarks is available at xinity.ai/whitepaper.
Or skip straight to doing — Xinity is open source:
curl -fsSL https://get.xinity.ai/install.sh | bash
curl -fsSL https://get.xinity.ai/install.sh | bash
curl -fsSL https://get.xinity.ai/install.sh | bash
Own your AI. Own your data. Control your costs.
YOUR AI. YOUR SERVERS.
Ready to Run any AI on Your Own Terms?
No commitment. 30 minutes. We'll show you exactly what deployment looks like for your company.
Use Link
Company
Am Gestade 5/2
1010 Vienna, Austria
© 2026 Xinity
YOUR AI. YOUR SERVERS.
Ready to Run any AI on Your Own Terms?
No commitment. 30 minutes. We'll show you exactly what deployment looks like for your company.
Use Link
Company
Am Gestade 5/2
1010 Vienna, Austria
© 2026 Xinity
YOUR AI. YOUR SERVERS.
Ready to Run any AI on Your Own Terms?
No commitment. 30 minutes. We'll show you exactly what deployment looks like for your company.
Use Link
Company
Am Gestade 5/2
1010 Vienna, Austria
