Optimize open models for production - RunInfra

Source: https://runinfra.ai/

RunInfraby RightNow

- Use cases

- Pricing

- Research

- Resources

- Contact

Dashboard Sign in Get started

Backed by!Image 1: Y CombinatorCombinator

Optimize open model production Optimize any open model for production any for

Paste a model. RunInfra benchmarks the options and picks the winner. Deploy it, or own the stack.

Describe the inference workload you want to deploy...

Auto engine Auto GPU

Example workloads

Compare engines Find the best serving engine for !Image 2Qwen 2.5 7B Tune latency Optimize !Image 3Qwen 2.5 7B for low latency Ship speech Deploy !Image 4Whisper Large V3 Turbo with p95 and cost checks Scale retrieval Build !Image 5BGE-M3 embeddings with batch throughput metrics

!Image 6

Every optimization ends with a result you can inspect and run.

You get a benchmark receipt and a runnable deployment kit. Nothing hidden.

Serving engine

Compared, not assumed

GPU target

Sized to the model

p95 latency

Benchmarked

Throughput

Measured per GPU

VRAM

Checked for fit

Cost

Tracked per run

GPU kernels

Tuned where supported

Deployment kit

Run it or export it

RUNINFRA

From prompt to a production stack you own

RunInfra compares, tunes, and benchmarks the stack. Deploy or export it.

Describe a Llama 3.1 70B Qwen 2.5 7B DeepSeek V3 Mistral 7B Phi-4 Gemma 2 9B Mixtral 8x7B!Image 7Whisper Large V3 Llama 3.1 70B workload in plain English.

RunInfra compares!Image 8vLLM!Image 9SGLang!Image 10TensorRT-LLM!Image 11vLLM-Omni!Image 12SGLang and every other engine your model can run on.

It tunes speculative decoding kernel generation server tuning quantization KV cache reuse FlashAttention v2 continuous batching server tuning where it helps, with no config to hand-write.

Deploy on NVIDIA H100 NVIDIA H200 NVIDIA B200 NVIDIA A100 NVIDIA L40S NVIDIA L4 NVIDIA A100 and pay per million tokens, or export the stack and self-host.

!Image 13

New Session

Optimize !Image 14Llama-3.1-8B-Instruct on !Image 15vLLM for cheapest GPU with latency and VRAM checks

Thought for 1s

Capturing cost-first intent for Llama 3.1 8B on vLLM.

Intake updated Model, engine, goal

Model Llama 3.1 8B

Engine vLLM

Goal Low cost, latency checked

Requirements collected 575ms

Model, engine, GPU target, latency goal

Plan drafted 3.1s

10 execution phases prepared

Plan ready review / 10 phases / ~23m

Optimize Llama 3.1 8B on vLLM for cheapest GPU with latency checks

Recommended path: vLLM on L4. Review the plan before execution.

Open Plan Accept and run

Working...

Runbook generated from the workload

Llama 3.1 8B on vLLM, lowest viable cost

draft

Latency target

p95 under 60ms

VRAM budget

24 GB

Est. runtime

~23 min

Execution plan 10 phases, 3 validated

AWQ int4 quantization ready

Weight-only int4, calibrated offline

FlashAttention v2 ready

Fused attention kernels

Continuous batching queued

In-flight request scheduling

Paged KV cache queued

fp8 cache in paged blocks

CUDA graph capture queued

Replay the decode-step graph

Speculative decoding queued

Draft model proposes tokens

Prefix caching queued

Reuse shared prompt prefixes

Tensor-parallel sizing ready

Single GPU, no sharding

Warmup and autotune queued

Lock kernel shapes pre-serve

Serving-config tune queued

Batch size and concurrency

Review the plan, then run.vLLM, L4 to A100

!Image 16

Optimization run

running 5/6 phases

Benchmarking the tuned config against cheaper GPU candidates.

Candidate set built 0.6s

Serving config tuned 2.9s

AWQ int4 quantization applied 41.7s

FlashAttention v2 kernels compiled 58.3s

Candidates benchmarked 94.2s

Confirming winner live

Baseline vs optimized best candidate so far

Metric Baseline Optimized Delta

P95 latency 184ms 38ms-79%

Time to first token 120ms 22ms-82%

Throughput 45 tok/s 142 tok/s+216%

VRAM 28.4 GB 12.1 GB-57%

Cost / 1M tokens$0.42$0.12-71%

GPU candidates winner marked

GPU candidate Cost / 1M p95 Pick

NVIDIA L4 misses latency

$0.08 84ms-

NVIDIA L40S

$0.12 38ms

NVIDIA A100 overspec

$0.21 31ms-

Deploy or export the stack

Pick a target for the optimized L40S build.

ready

Target Supported GPUs

!Image 17

Managed by RunInfra

selected

Fully managed endpoint, billed per million tokens

H100 L40S A100 L4

!Image 18

Your RunPod

Deploy to your own RunPod account

H100 A100 RTX 4090 L40S

!Image 19

Modal

Serverless deploy on RunInfra Modal

H200 H100 A100 L40S

!Image 20

Your Modal

Deploy to your own Modal workspace

H100 A100 L40S T4

Generated stack Dockerfile serve.sh runinfra.yaml

1 FROM runinfra/vllm:0.6.3-l40s

2 ENV MODEL=Llama-3.1-8B-Instruct

3 ENV QUANTIZATION=awq_marlin

4 COPY ./weights /models

5# optimized serving config

6 CMD vllm serve $MODEL \

7--quantization awq_marlin \

8--kv-cache-dtype fp8 \

9--enable-prefix-caching \

10--max-num-seqs 256 \

11--gpu-memory-utilization 0.92

$0.12 / 1M 38ms p95 L40S

Export kit Deploy

!Image 21

Deploy on the hardware you choose

Compare real GPU prices and deploy wherever fits. No lock-in, no rewrites.

Own the winning stack before you ship

Not a black box. You get the measured stack to run, deploy, or export.

Benchmark Verified

p99 latency 64 ms 64%

throughput 3.4k tok/s 2.8x

cost / 1M$0.15 64%

Llama 3.1 8B, vLLM, L4 24GB

Benchmark receipt

Before and after, in one record you can reproduce.

p99 latency

throughput

VRAM

cost

reproduction

runinfra.yaml yaml

1 2 3 4 5

engine: vLLM quantization: awq-int4 kv_cache: fp8 max_num_seqs: 256 speculative: eagle

Optimized runtime config

Every serving setting RunInfra tuned. Read or change it.

engine flags

batch settings

quantization

kernel paths

kv cache

deployment-kit Runnable

Dockerfile

compose.yaml

k8s/

deployment.yaml

serve.py

benchmark.md

Exportable stack

A runnable repo you take with you.

Dockerfile

compose

launch scripts

reports

Live p50 38 ms

POST/v1/chat/completions 200

L4 2 replicas autoscale

Managed endpoint

The same measured stack, hosted by us.

RunInfra Cloud

resource config

inspectable

portable

Why owning your AI stack matters

Data privacy and control

Keep sensitive workloads on infrastructure you choose.

Customization

Tune the model, runtime, and GPU to your workload.

Performance ownership

Real tuning, measured, not assumed.

Portability

Run it on our cloud, or export it to yours.

Supported across the stack

Open models, serving engines, GPUs, and the clouds you deploy to. RunInfra supports every layer of the inference stack.

Models

Models: Llama 3.3, Whisper, Qwen-Image, NV-Embed, Parler-TTS, Qwen2.5, Cosmos, Pixtral, EmbeddingGemma, RoBERTa, DeepSeek-V3, Sana, Parakeet, Mistral, Wan 2.1, GTE, Qwen2-VL, MMS-TTS, Qwen3 Reranker, Gemma 2, MusicGen, DeepSeek-VL2, Nemotron, FastPitch, PaliGemma, NV-RerankQA, Qwen2-Audio, Hermes 3, Canary, BERT, Llama 3.2 Vision, Qwen3 Embedding

!Image 22LLM Llama 3.3!Image 23ASR Whisper!Image 24Image Qwen-Image!Image 25Embed NV-Embed!Image 26TTS Parler-TTS!Image 27LLM Qwen2.5!Image 28Video Cosmos!Image 29Vision Pixtral!Image 30Embed EmbeddingGemma!Image 31Classify RoBERTa!Image 32LLM DeepSeek-V3!Image 33Image Sana!Image 34ASR Parakeet!Image 35LLM Mistral!Image 36Video Wan 2.1!Image 37Embed GTE!Image 38Vision Qwen2-VL!Image 39TTS MMS-TTS!Image 40Rerank Qwen3 Reranker!Image 41LLM Gemma 2!Image 42Audio MusicGen!Image 43Vision DeepSeek-VL2!Image 44LLM Nemotron!Image 45TTS FastPitch!Image 46Vision PaliGemma!Image 47Rerank NV-RerankQA!Image 48Audio Qwen2-Audio LLM Hermes 3!Image 49ASR Canary!Image 50Classify BERT!Image 51Vision Llama 3.2 Vision!Image 52Embed Qwen3 Embedding!Image 53LLM Llama 3.3!Image 54ASR Whisper!Image 55Image Qwen-Image!Image 56Embed NV-Embed!Image 57TTS Parler-TTS!Image 58LLM Qwen2.5!Image 59Video Cosmos!Image 60Vision Pixtral!Image 61Embed EmbeddingGemma!Image 62Classify RoBERTa!Image 63LLM DeepSeek-V3!Image 64Image Sana!Image 65ASR Parakeet!Image 66LLM Mistral!Image 67Video Wan 2.1!Image 68Embed GTE!Image 69Vision Qwen2-VL!Image 70TTS MMS-TTS!Image 71Rerank Qwen3 Reranker!Image 72LLM Gemma 2!Image 73Audio MusicGen!Image 74Vision DeepSeek-VL2!Image 75LLM Nemotron!Image 76TTS FastPitch!Image 77Vision PaliGemma!Image 78Rerank NV-RerankQA!Image 79Audio Qwen2-Audio LLM Hermes 3!Image 80ASR Canary!Image 81Classify BERT!Image 82Vision Llama 3.2 Vision!Image 83Embed Qwen3 Embedding

Engines

Engines: vLLM, SGLang, TensorRT-LLM, vLLM Omni, TEI, Transformers

!Image 84Engine vLLM!Image 85Engine SGLang!Image 86Engine TensorRT-LLM!Image 87Engine vLLM Omni!Image 88Engine TEI!Image 89Engine Transformers!Image 90Engine vLLM!Image 91Engine SGLang!Image 92Engine TensorRT-LLM!Image 93Engine vLLM Omni!Image 94Engine TEI!Image 95Engine Transformers!Image 96Engine vLLM!Image 97Engine SGLang!Image 98Engine TensorRT-LLM!Image 99Engine vLLM Omni!Image 100Engine TEI!Image 101Engine Transformers!Image 102Engine vLLM!Image 103Engine SGLang!Image 104Engine TensorRT-LLM!Image 105Engine vLLM Omni!Image 106Engine TEI!Image 107Engine Transformers!Image 108Engine vLLM!Image 109Engine SGLang!Image 110Engine TensorRT-LLM!Image 111Engine vLLM Omni!Image 112Engine TEI!Image 113Engine Transformers!Image 114Engine vLLM!Image 115Engine SGLang!Image 116Engine TensorRT-LLM!Image 117Engine vLLM Omni!Image 118Engine TEI!Image 119Engine Transformers

GPUs

GPUs: L4, A10, L40S, RTX 4090, A100, H100, H200, B200

!Image 12024 GB L4!Image 12124 GB A10!Image 12248 GB L40S!Image 12324 GB RTX 4090!Image 12480 GB A100!Image 12580 GB H100!Image 126141 GB H200!Image 127192 GB B200!Image 12824 GB L4!Image 12924 GB A10!Image 13048 GB L40S!Image 13124 GB RTX 4090!Image 13280 GB A100!Image 13380 GB H100!Image 134141 GB H200!Image 135192 GB B200!Image 13624 GB L4!Image 13724 GB A10!Image 13848 GB L40S!Image 13924 GB RTX 4090!Image 14080 GB A100!Image 14180 GB H100!Image 142141 GB H200!Image 143192 GB B200!Image 14424 GB L4!Image 14524 GB A10!Image 14648 GB L40S!Image 14724 GB RTX 4090!Image 14880 GB A100!Image 14980 GB H100!Image 150141 GB H200!Image 151192 GB B200

Clouds

Clouds: RunInfra Cloud, Modal, RunPod, Vast.ai

B200 RunInfra Cloud!Image 152H100 Modal!Image 153A100 RunPod!Image 154RTX 4090 Vast.ai B200 RunInfra Cloud!Image 155H100 Modal!Image 156A100 RunPod!Image 157RTX 4090 Vast.ai B200 RunInfra Cloud!Image 158H100 Modal!Image 159A100 RunPod!Image 160RTX 4090 Vast.ai B200 RunInfra Cloud!Image 161H100 Modal!Image 162A100 RunPod!Image 163RTX 4090 Vast.ai B200 RunInfra Cloud!Image 164H100 Modal!Image 165A100 RunPod!Image 166RTX 4090 Vast.ai B200 RunInfra Cloud!Image 167H100 Modal!Image 168A100 RunPod!Image 169RTX 4090 Vast.ai B200 RunInfra Cloud!Image 170H100 Modal!Image 171A100 RunPod!Image 172RTX 4090 Vast.ai B200 RunInfra Cloud!Image 173H100 Modal!Image 174A100 RunPod!Image 175RTX 4090 Vast.ai

Common questions

Can't find what you're looking for?Get in touch

What is RunInfra?How do I build my first pipeline?Which AI models are supported?How does GPU kernel optimization work?Can I deploy pipelines as APIs?How is this different from using closed-source APIs?Is my data secure?

What is RunInfra?

Describe what you want to run.RunInfra picks compatible open models,benchmarks GPUs,tunes the runtime,and gives you a deploy-ready stack.

!Image 176: RunInfra

Deploy your first optimized model,measured before you ship

Describe the goal. RunInfra builds and optimizes the stack.

Start Building View Pricing

End-to-end encryption

Isolated GPU infrastructure

Zero data retention

SOC 2 Type II

![Image 177RunInfra](https://runinfra.ai/)by RightNow

![Image 178](https://github.com/RightNow-AI)![Image 179](https://x.com/runinfrai)![Image 180](https://www.linkedin.com/showcase/runinfra/)

All systems operational

Pipeline Builder Pricing Docs Research News Contact

Backed by

!Image 181: YCombinator

AICPA Type II ![Image 182SOC 2](https://trust.rightnowai.co/)

!Image 183: NVIDIA Inception Program!Image 184: NVIDIA Inception Program

Ask AI about RunInfra

![Image 185](https://chatgpt.com/?q=What%20is%20RunInfra%20(runinfra.ai)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)![Image 186](https://claude.ai/new?q=What%20is%20RunInfra%20(runinfra.ai)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)![Image 187](https://www.google.com/search?udm=50&source=searchlabs&q=What%20is%20RunInfra%20(runinfra.ai)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)![Image 188](https://grok.com/?q=What%20is%20RunInfra%20(runinfra.ai)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)![Image 189](https://www.perplexity.ai/search?q=What%20is%20RunInfra%20(runinfra.ai)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)

![Image 190!Image 191Part of RightNow](https://www.rightnowai.co/)Security DPA AUP Cookies Terms Privacy

!Image 192!Image 193!Image 194