Genie Inference

One OpenAI-compatible endpoint over a multi-engine local stack. Send a persona; Genie picks the model, the quant, the GPU, and the engine that wins that workload.

What it is

The same wire shape as api.openai.com — point your existing OpenAI SDK at https://api.genie.tech/v1 and go. Behind it: vLLM, SGLang, and ktransformers running open-weights models on the fleet. Every model in the catalog is open-weights; customer prompts never touch a closed-weight API.

The multi-engine router

You don't choose an engine — you choose a persona, and the router picks the engine that's fastest for that workload:

Prefix-shared work (e.g. review) routes to SGLang.
Unique-prompt work routes to vLLM.
Long-context work routes to the engine that pages KV cache best.

No vendor lock-in, no model-name strings in your code — the persona is the contract, and the routing can change underneath without touching your integration.

Local-first

Inference is governed by the local-first axiom: the best model runs on local hardware, every optimization technique is on the table to raise the local ceiling, and cloud fan-out is a last-resort overflow valve — never a capability replacement. See Genie Fleet.

API reference

Chat completions — POST /v1/chat/completions, sync · stream · async.
Image generation — POST /v1/images/generations.
Video generation — POST /v1/videos/generations.
Models — GET /v1/models, the catalog the fleet can serve.
Authentication and budgets & caps.