Optimize open models for production - RunInfra

Source: https://runinfra.ai/

RunInfraby RightNow

- Use cases

- Pricing

- Research

- Resources

- Contact

Dashboard Sign in Get started

Backed by!Image 6: Y CombinatorCombinator

Optimize open model production Optimize any open model for production any for

Paste a model. RunInfra benchmarks the options and picks the winner. Deploy it, or own the stack.

Describe the inference workload you want to deploy...

Models!Image 7Auto engine Auto GPU

Example workloads

Compare engines Find the best serving engine for !Image 8Qwen 2.5 7B Tune latency Optimize !Image 9Qwen 2.5 7B for low latency Ship speech Deploy !Image 10Whisper Large V3 Turbo with p95 and cost checks Scale retrieval Build !Image 11BGE-M3 embeddings with batch throughput metrics

!Image 12

Every optimization ends with a result you can inspect and run.

You get a benchmark receipt and a runnable deployment kit. Nothing hidden.

Serving engine

Compared, not assumed

GPU target

Sized to the model

p95 latency

Benchmarked

Throughput

Measured per GPU

VRAM

Checked for fit

Cost

Tracked per run

GPU kernels

Tuned where supported

Deployment kit

Run it or export it

RUNINFRA

From prompt to a production stack you own

RunInfra compares, tunes, and benchmarks the stack. Deploy or export it.

Describe a Llama 3.1 70B Qwen 2.5 7B DeepSeek V3 Mistral 7B Phi-4 Gemma 2 9B Mixtral 8x7B!Image 13Whisper Large V3 Llama 3.1 70B workload in plain English.

RunInfra compares!Image 14vLLM!Image 15SGLang!Image 16TensorRT-LLM!Image 17vLLM-Omni!Image 18SGLang and every other engine your model can run on.

It tunes speculative decoding kernel generation server tuning quantization KV cache reuse FlashAttention v2 continuous batching server tuning where it helps, with no config to hand-write.

Deploy on NVIDIA H100 NVIDIA H200 NVIDIA B200 NVIDIA A100 NVIDIA L40S NVIDIA L4 NVIDIA A100 and pay per million tokens, or export the stack and self-host.

!Image 19

New Session

Optimize !Image 20Llama-3.1-8B-Instruct on !Image 21vLLM for cheapest GPU with latency and VRAM checks

Thought for 1s

Capturing cost-first intent for Llama 3.1 8B on vLLM.

Intake updated Model, engine, goal

Model Llama 3.1 8B

Engine vLLM

Goal Low cost, latency checked

Requirements collected 575ms

Model, engine, GPU target, latency goal

Plan drafted 3.1s

10 execution phases prepared

Plan ready review / 10 phases / ~23m

Optimize Llama 3.1 8B on vLLM for cheapest GPU with latency checks

Recommended path: vLLM on L4. Review the plan before execution.

Open Plan Accept and run

Working...

Runbook generated from the workload

Llama 3.1 8B on vLLM, lowest viable cost

draft

Latency target

p95 under 60ms

VRAM budget

24 GB

Est. runtime

~23 min

Execution plan 10 phases, 3 validated

AWQ int4 quantization ready

Weight-only int4, calibrated offline

FlashAttention v2 ready

Fused attention kernels

Continuous batching queued

In-flight request scheduling

Paged KV cache queued

fp8 cache in paged blocks

CUDA graph capture queued

Replay the decode-step graph

Speculative decoding queued

Draft model proposes tokens

Prefix caching queued

Reuse shared prompt prefixes

Tensor-parallel sizing ready

Single GPU, no sharding

Warmup and autotune queued

Lock kernel shapes pre-serve

Serving-config tune queued

Batch size and concurrency

Review the plan, then run.vLLM, L4 to A100

!Image 22

Optimization run

running 5/6 phases

Benchmarking the tuned config against cheaper GPU candidates.

Candidate set built 0.6s

Serving config tuned 2.9s

AWQ int4 quantization applied 41.7s

FlashAttention v2 kernels compiled 58.3s

Candidates benchmarked 94.2s

Confirming winner live

Baseline vs optimized best candidate so far

Metric Baseline Optimized Delta

P95 latency 184ms 38ms-79%

Time to first token 120ms 22ms-82%

Throughput 45 tok/s 142 tok/s+216%

VRAM 28.4 GB 12.1 GB-57%

Cost / 1M tokens$0.42$0.12-71%

GPU candidates winner marked

GPU candidate Cost / 1M p95 Pick

NVIDIA L4 misses latency

$0.08 84ms-

NVIDIA L40S

$0.12 38ms

NVIDIA A100 overspec

$0.21 31ms-

Deploy or export the stack

Pick a target for the optimized L40S build.

ready

Target Supported GPUs

!Image 23

Managed by RunInfra

selected

Fully managed endpoint, billed per million tokens

H100 L40S A100 L4

!Image 24

Your RunPod

Deploy to your own RunPod account

H100 A100 RTX 4090 L40S

!Image 25

Modal

Serverless deploy on RunInfra Modal

H200 H100 A100 L40S

!Image 26

Your Modal

Deploy to your own Modal workspace

H100 A100 L40S T4

Generated stack Dockerfile serve.sh runinfra.yaml

1 FROM runinfra/vllm:0.6.3-l40s

2 ENV MODEL=Llama-3.1-8B-Instruct

3 ENV QUANTIZATION=awq_marlin

4 COPY ./weights /models

5# optimized serving config

6 CMD vllm serve $MODEL \

7--quantization awq_marlin \

8--kv-cache-dtype fp8 \

9--enable-prefix-caching \

10--max-num-seqs 256 \

11--gpu-memory-utilization 0.92

$0.12 / 1M 38ms p95 L40S

Export kit Deploy

!Image 27

Deploy on the hardware you choose

Compare real GPU prices and deploy wherever fits. No lock-in, no rewrites.

Own the winning stack before you ship

Not a black box. You get the measured stack to run, deploy, or export.

Benchmark Verified

p99 latency 64 ms 64%

throughput 3.4k tok/s 2.8x

cost / 1M$0.15 64%

Llama 3.1 8B, vLLM, L4 24GB

Benchmark receipt

Before and after, in one record you can reproduce.

p99 latency

throughput

VRAM

cost

reproduction

runinfra.yaml yaml

1 2 3 4 5

engine: vLLM quantization: awq-int4 kv_cache: fp8 max_num_seqs: 256 speculative: eagle

Optimized runtime config

Every serving setting RunInfra tuned. Read or change it.

engine flags

batch settings

quantization

kernel paths

kv cache

deployment-kit Runnable

Dockerfile

compose.yaml

k8s/

deployment.yaml

serve.py

benchmark.md

Exportable stack

A runnable repo you take with you.

Dockerfile

compose

launch scripts

reports

Live p50 38 ms

POST/v1/chat/completions 200

L4 2 replicas autoscale

Managed endpoint

The same measured stack, hosted by us.

RunInfra Cloud

resource config

inspectable

portable

Why owning your AI stack matters

Data privacy and control

Keep sensitive workloads on infrastructure you choose.

Customization

Tune the model, runtime, and GPU to your workload.

Performance ownership

Real tuning, measured, not assumed.

Portability

Run it on our cloud, or export it to yours.

Supported across the stack

Open models, serving engines, GPUs, and the clouds you deploy to. RunInfra supports every layer of the inference stack.

Models

Models: Llama 3.3, Whisper, Qwen-Image, NV-Embed, Parler-TTS, Qwen2.5, Cosmos, Pixtral, EmbeddingGemma, RoBERTa, DeepSeek-V3, Sana, Parakeet, Mistral, Wan 2.1, GTE, Qwen2-VL, MMS-TTS, Qwen3 Reranker, Gemma 2, MusicGen, DeepSeek-VL2, Nemotron, FastPitch, PaliGemma, NV-RerankQA, Qwen2-Audio, Hermes 3, Canary, BERT, Llama 3.2 Vision, Qwen3 Embedding

!Image 28LLM Llama 3.3!Image 29ASR Whisper!Image 30Image Qwen-Image!Image 31Embed NV-Embed!Image 32TTS Parler-TTS!Image 33LLM Qwen2.5!Image 34Video Cosmos!Image 35Vision Pixtral!Image 36Embed EmbeddingGemma!Image 37Classify RoBERTa!Image 38LLM DeepSeek-V3!Image 39Image Sana!Image 40ASR Parakeet!Image 41LLM Mistral!Image 42Video Wan 2.1!Image 43Embed GTE!Image 44Vision Qwen2-VL!Image 45TTS MMS-TTS!Image 46Rerank Qwen3 Reranker!Image 47LLM Gemma 2!Image 48Audio MusicGen!Image 49Vision DeepSeek-VL2!Image 50LLM Nemotron!Image 51TTS FastPitch!Image 52Vision PaliGemma!Image 53Rerank NV-RerankQA!Image 54Audio Qwen2-Audio LLM Hermes 3!Image 55ASR Canary!Image 56Classify BERT!Image 57Vision Llama 3.2 Vision!Image 58Embed Qwen3 Embedding!Image 59LLM Llama 3.3!Image 60ASR Whisper!Image 61Image Qwen-Image!Image 62Embed NV-Embed!Image 63TTS Parler-TTS!Image 64LLM Qwen2.5!Image 65Video Cosmos!Image 66Vision Pixtral!Image 67Embed EmbeddingGemma!Image 68Classify RoBERTa!Image 69LLM DeepSeek-V3!Image 70Image Sana!Image 71ASR Parakeet!Image 72LLM Mistral!Image 73Video Wan 2.1!Image 74Embed GTE!Image 75Vision Qwen2-VL!Image 76TTS MMS-TTS!Image 77Rerank Qwen3 Reranker!Image 78LLM Gemma 2!Image 79Audio MusicGen!Image 80Vision DeepSeek-VL2!Image 81LLM Nemotron!Image 82TTS FastPitch!Image 83Vision PaliGemma!Image 84Rerank NV-RerankQA!Image 85Audio Qwen2-Audio LLM Hermes 3!Image 86ASR Canary!Image 87Classify BERT!Image 88Vision Llama 3.2 Vision!Image 89Embed Qwen3 Embedding

Engines

Engines: vLLM, SGLang, TensorRT-LLM, vLLM Omni, TEI, Transformers

!Image 90Engine vLLM!Image 91Engine SGLang!Image 92Engine TensorRT-LLM!Image 93Engine vLLM Omni!Image 94Engine TEI!Image 95Engine Transformers!Image 96Engine vLLM!Image 97Engine SGLang!Image 98Engine TensorRT-LLM!Image 99Engine vLLM Omni!Image 100Engine TEI!Image 101Engine Transformers!Image 102Engine vLLM!Image 103Engine SGLang!Image 104Engine TensorRT-LLM!Image 105Engine vLLM Omni!Image 106Engine TEI!Image 107Engine Transformers!Image 108Engine vLLM!Image 109Engine SGLang!Image 110Engine TensorRT-LLM!Image 111Engine vLLM Omni!Image 112Engine TEI!Image 113Engine Transformers!Image 114Engine vLLM!Image 115Engine SGLang!Image 116Engine TensorRT-LLM!Image 117Engine vLLM Omni!Image 118Engine TEI!Image 119Engine Transformers!Image 120Engine vLLM!Image 121Engine SGLang!Image 122Engine TensorRT-LLM!Image 123Engine vLLM Omni!Image 124Engine TEI!Image 125Engine Transformers

GPUs

GPUs: L4, A10, L40S, RTX 4090, A100, H100, H200, B200

!Image 12624 GB L4!Image 12724 GB A10!Image 12848 GB L40S!Image 12924 GB RTX 4090!Image 13080 GB A100!Image 13180 GB H100!Image 132141 GB H200!Image 133192 GB B200!Image 13424 GB L4!Image 13524 GB A10!Image 13648 GB L40S!Image 13724 GB RTX 4090!Image 13880 GB A100!Image 13980 GB H100!Image 140141 GB H200!Image 141192 GB B200!Image 14224 GB L4!Image 14324 GB A10!Image 14448 GB L40S!Image 14524 GB RTX 4090!Image 14680 GB A100!Image 14780 GB H100!Image 148141 GB H200!Image 149192 GB B200!Image 15024 GB L4!Image 15124 GB A10!Image 15248 GB L40S!Image 15324 GB RTX 4090!Image 15480 GB A100!Image 15580 GB H100!Image 156141 GB H200!Image 157192 GB B200

Clouds

Clouds: RunInfra Cloud, Modal, RunPod, Vast.ai

B200 RunInfra Cloud!Image 158H100 Modal!Image 159A100 RunPod!Image 160RTX 4090 Vast.ai B200 RunInfra Cloud!Image 161H100 Modal!Image 162A100 RunPod!Image 163RTX 4090 Vast.ai B200 RunInfra Cloud!Image 164H100 Modal!Image 165A100 RunPod!Image 166RTX 4090 Vast.ai B200 RunInfra Cloud!Image 167H100 Modal!Image 168A100 RunPod!Image 169RTX 4090 Vast.ai B200 RunInfra Cloud!Image 170H100 Modal!Image 171A100 RunPod!Image 172RTX 4090 Vast.ai B200 RunInfra Cloud!Image 173H100 Modal!Image 174A100 RunPod!Image 175RTX 4090 Vast.ai B200 RunInfra Cloud!Image 176H100 Modal!Image 177A100 RunPod!Image 178RTX 4090 Vast.ai B200 RunInfra Cloud!Image 179H100 Modal!Image 180A100 RunPod!Image 181RTX 4090 Vast.ai

Common questions

Can't find what you're looking for?Get in touch

What is RunInfra?How do I build my first pipeline?Which AI models are supported?How does GPU kernel optimization work?Can I deploy pipelines as APIs?How is this different from using closed-source APIs?Is my data secure?

What is RunInfra?

Describe what you want to run.RunInfra picks compatible open models,benchmarks GPUs,tunes the runtime,and gives you a deploy-ready stack.

!Image 182: RunInfra

Deploy your first optimized model,measured before you ship

Describe the goal. RunInfra builds and optimizes the stack.

Start Building View Pricing

End-to-end encryption

Isolated GPU infrastructure

Zero data retention

SOC 2 Type II

![Image 183RunInfra](https://runinfra.ai/)by RightNow

![Image 184](https://github.com/RightNow-AI)![Image 185](https://x.com/runinfrai)![Image 186](https://www.linkedin.com/showcase/runinfra/)

All systems operational

Pipeline Builder Pricing Docs Research News Contact

Backed by

!Image 187: YCombinator

AICPA Type II ![Image 188SOC 2](https://trust.rightnowai.co/)

!Image 189: NVIDIA Inception Program!Image 190: NVIDIA Inception Program

Ask AI about RunInfra

![Image 191](https://chatgpt.com/?q=What%20is%20RunInfra%20(runinfra.ai)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)![Image 192](https://claude.ai/new?q=What%20is%20RunInfra%20(runinfra.ai)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)![Image 193](https://www.google.com/search?udm=50&source=searchlabs&q=What%20is%20RunInfra%20(runinfra.ai)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)![Image 194](https://grok.com/?q=What%20is%20RunInfra%20(runinfra.ai)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)![Image 195](https://www.perplexity.ai/search?q=What%20is%20RunInfra%20(runinfra.ai)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)

![Image 196!Image 197Part of RightNow](https://www.rightnowai.co/)Security DPA AUP Cookies Terms Privacy

!Image 198!Image 199!Image 200