Genie Inference
One OpenAI-compatible endpoint over a multi-engine local stack. Send a persona; Genie picks the model, the quant, the GPU, and the engine that wins that workload.
What it is
The same wire shape as api.openai.com — point your existing OpenAI SDK at https://api.genie.tech/v1 and go. Behind it: vLLM, SGLang, and ktransformers running open-weights models on the fleet. Every model in the catalog is open-weights; customer prompts never touch a closed-weight API.
The multi-engine router
You don't choose an engine — you choose a persona, and the router picks the engine that's fastest for that workload:
- Prefix-shared work (e.g. review) routes to SGLang.
- Unique-prompt work routes to vLLM.
- Long-context work routes to the engine that pages KV cache best.
No vendor lock-in, no model-name strings in your code — the persona is the contract, and the routing can change underneath without touching your integration.
Local-first
Inference is governed by the local-first axiom: the best model runs on local hardware, every optimization technique is on the table to raise the local ceiling, and cloud fan-out is a last-resort overflow valve — never a capability replacement. See Genie Fleet.
API reference
- Chat completions —
POST /v1/chat/completions, sync · stream · async. - Image generation —
POST /v1/images/generations. - Video generation —
POST /v1/videos/generations. - Models —
GET /v1/models, the catalog the fleet can serve. - Authentication and budgets & caps.