Optimize open models for production - RunInfra
RunInfraby RightNow
- Use cases
- Pricing
- Research
- Resources
- Contact
Backed by!Image 6: Y CombinatorCombinator
Optimize open model production Optimize any open model for production any for
Paste a model. RunInfra benchmarks the options and picks the winner. Deploy it, or own the stack.
Describe the inference workload you want to deploy...
Models!Image 7Auto engine Auto GPU
Example workloads
Compare engines Find the best serving engine for !Image 8Qwen 2.5 7B Tune latency Optimize !Image 9Qwen 2.5 7B for low latency Ship speech Deploy !Image 10Whisper Large V3 Turbo with p95 and cost checks Scale retrieval Build !Image 11BGE-M3 embeddings with batch throughput metrics
Every optimization ends with a result you can inspect and run.
You get a benchmark receipt and a runnable deployment kit. Nothing hidden.
Serving engine
Compared, not assumed
GPU target
Sized to the model
p95 latency
Benchmarked
Throughput
Measured per GPU
VRAM
Checked for fit
Cost
Tracked per run
GPU kernels
Tuned where supported
Deployment kit
Run it or export it
RUNINFRA
From prompt to a production stack you own
RunInfra compares, tunes, and benchmarks the stack. Deploy or export it.
Describe a Llama 3.1 70B Qwen 2.5 7B DeepSeek V3 Mistral 7B Phi-4 Gemma 2 9B Mixtral 8x7B!Image 13Whisper Large V3 Llama 3.1 70B workload in plain English.
RunInfra compares!Image 14vLLM!Image 15SGLang!Image 16TensorRT-LLM!Image 17vLLM-Omni!Image 18SGLang and every other engine your model can run on.
It tunes speculative decoding kernel generation server tuning quantization KV cache reuse FlashAttention v2 continuous batching server tuning where it helps, with no config to hand-write.
Deploy on NVIDIA H100 NVIDIA H200 NVIDIA B200 NVIDIA A100 NVIDIA L40S NVIDIA L4 NVIDIA A100 and pay per million tokens, or export the stack and self-host.
New Session
Optimize !Image 20Llama-3.1-8B-Instruct on !Image 21vLLM for cheapest GPU with latency and VRAM checks
Thought for 1s
Capturing cost-first intent for Llama 3.1 8B on vLLM.
Intake updated Model, engine, goal
Model Llama 3.1 8B
Engine vLLM
Goal Low cost, latency checked
Requirements collected 575ms
Model, engine, GPU target, latency goal
Plan drafted 3.1s
10 execution phases prepared
Plan ready review / 10 phases / ~23m
Optimize Llama 3.1 8B on vLLM for cheapest GPU with latency checks
Recommended path: vLLM on L4. Review the plan before execution.
Open Plan Accept and run
Working...
Runbook generated from the workload
Llama 3.1 8B on vLLM, lowest viable cost
draft
Latency target
p95 under 60ms
VRAM budget
24 GB
Est. runtime
~23 min
Execution plan 10 phases, 3 validated
01
AWQ int4 quantization ready
Weight-only int4, calibrated offline
02
FlashAttention v2 ready
Fused attention kernels
03
Continuous batching queued
In-flight request scheduling
04
Paged KV cache queued
fp8 cache in paged blocks
05
CUDA graph capture queued
Replay the decode-step graph
06
Speculative decoding queued
Draft model proposes tokens
07
Prefix caching queued
Reuse shared prompt prefixes
08
Tensor-parallel sizing ready
Single GPU, no sharding
09
Warmup and autotune queued
Lock kernel shapes pre-serve
10
Serving-config tune queued
Batch size and concurrency
Review the plan, then run.vLLM, L4 to A100
Optimization run
running 5/6 phases
Benchmarking the tuned config against cheaper GPU candidates.
Candidate set built 0.6s
Serving config tuned 2.9s
AWQ int4 quantization applied 41.7s
FlashAttention v2 kernels compiled 58.3s
Candidates benchmarked 94.2s
Confirming winner live
Baseline vs optimized best candidate so far
Metric Baseline Optimized Delta
P95 latency 184ms 38ms-79%
Time to first token 120ms 22ms-82%
Throughput 45 tok/s 142 tok/s+216%
VRAM 28.4 GB 12.1 GB-57%
Cost / 1M tokens$0.42$0.12-71%
GPU candidates winner marked
GPU candidate Cost / 1M p95 Pick
NVIDIA L4 misses latency
$0.08 84ms-
NVIDIA L40S
$0.12 38ms
NVIDIA A100 overspec
$0.21 31ms-
Deploy or export the stack
Pick a target for the optimized L40S build.
ready
Target Supported GPUs
Managed by RunInfra
selected
Fully managed endpoint, billed per million tokens
H100 L40S A100 L4
Your RunPod
Deploy to your own RunPod account
H100 A100 RTX 4090 L40S
Modal
Serverless deploy on RunInfra Modal
H200 H100 A100 L40S
Your Modal
Deploy to your own Modal workspace
H100 A100 L40S T4
Generated stack Dockerfile serve.sh runinfra.yaml
1 FROM runinfra/vllm:0.6.3-l40s
2 ENV MODEL=Llama-3.1-8B-Instruct
3 ENV QUANTIZATION=awq_marlin
4 COPY ./weights /models
5# optimized serving config
6 CMD vllm serve $MODEL \
7--quantization awq_marlin \
8--kv-cache-dtype fp8 \
9--enable-prefix-caching \
10--max-num-seqs 256 \
11--gpu-memory-utilization 0.92
$0.12 / 1M 38ms p95 L40S
Export kit Deploy
Deploy on the hardware you choose
Compare real GPU prices and deploy wherever fits. No lock-in, no rewrites.
Own the winning stack before you ship
Not a black box. You get the measured stack to run, deploy, or export.
Benchmark Verified
p99 latency 64 ms 64%
throughput 3.4k tok/s 2.8x
cost / 1M$0.15 64%
Llama 3.1 8B, vLLM, L4 24GB
Benchmark receipt
Before and after, in one record you can reproduce.
p99 latency
throughput
VRAM
cost
reproduction
runinfra.yaml yaml
1 2 3 4 5
engine: vLLM quantization: awq-int4 kv_cache: fp8 max_num_seqs: 256 speculative: eagle
Optimized runtime config
Every serving setting RunInfra tuned. Read or change it.
engine flags
batch settings
quantization
kernel paths
kv cache
deployment-kit Runnable
Dockerfile
compose.yaml
k8s/
deployment.yaml
serve.py
benchmark.md
Exportable stack
A runnable repo you take with you.
Dockerfile
compose
launch scripts
reports
Live p50 38 ms
POST/v1/chat/completions 200
L4 2 replicas autoscale
Managed endpoint
The same measured stack, hosted by us.
RunInfra Cloud
resource config
inspectable
portable
Why owning your AI stack matters
Data privacy and control
Keep sensitive workloads on infrastructure you choose.
Customization
Tune the model, runtime, and GPU to your workload.
Performance ownership
Real tuning, measured, not assumed.
Portability
Run it on our cloud, or export it to yours.
Supported across the stack
Open models, serving engines, GPUs, and the clouds you deploy to. RunInfra supports every layer of the inference stack.
Models
Models: Llama 3.3, Whisper, Qwen-Image, NV-Embed, Parler-TTS, Qwen2.5, Cosmos, Pixtral, EmbeddingGemma, RoBERTa, DeepSeek-V3, Sana, Parakeet, Mistral, Wan 2.1, GTE, Qwen2-VL, MMS-TTS, Qwen3 Reranker, Gemma 2, MusicGen, DeepSeek-VL2, Nemotron, FastPitch, PaliGemma, NV-RerankQA, Qwen2-Audio, Hermes 3, Canary, BERT, Llama 3.2 Vision, Qwen3 Embedding
!Image 28LLM Llama 3.3!Image 29ASR Whisper!Image 30Image Qwen-Image!Image 31Embed NV-Embed!Image 32TTS Parler-TTS!Image 33LLM Qwen2.5!Image 34Video Cosmos!Image 35Vision Pixtral!Image 36Embed EmbeddingGemma!Image 37Classify RoBERTa!Image 38LLM DeepSeek-V3!Image 39Image Sana!Image 40ASR Parakeet!Image 41LLM Mistral!Image 42Video Wan 2.1!Image 43Embed GTE!Image 44Vision Qwen2-VL!Image 45TTS MMS-TTS!Image 46Rerank Qwen3 Reranker!Image 47LLM Gemma 2!Image 48Audio MusicGen!Image 49Vision DeepSeek-VL2!Image 50LLM Nemotron!Image 51TTS FastPitch!Image 52Vision PaliGemma!Image 53Rerank NV-RerankQA!Image 54Audio Qwen2-Audio LLM Hermes 3!Image 55ASR Canary!Image 56Classify BERT!Image 57Vision Llama 3.2 Vision!Image 58Embed Qwen3 Embedding!Image 59LLM Llama 3.3!Image 60ASR Whisper!Image 61Image Qwen-Image!Image 62Embed NV-Embed!Image 63TTS Parler-TTS!Image 64LLM Qwen2.5!Image 65Video Cosmos!Image 66Vision Pixtral!Image 67Embed EmbeddingGemma!Image 68Classify RoBERTa!Image 69LLM DeepSeek-V3!Image 70Image Sana!Image 71ASR Parakeet!Image 72LLM Mistral!Image 73Video Wan 2.1!Image 74Embed GTE!Image 75Vision Qwen2-VL!Image 76TTS MMS-TTS!Image 77Rerank Qwen3 Reranker!Image 78LLM Gemma 2!Image 79Audio MusicGen!Image 80Vision DeepSeek-VL2!Image 81LLM Nemotron!Image 82TTS FastPitch!Image 83Vision PaliGemma!Image 84Rerank NV-RerankQA!Image 85Audio Qwen2-Audio LLM Hermes 3!Image 86ASR Canary!Image 87Classify BERT!Image 88Vision Llama 3.2 Vision!Image 89Embed Qwen3 Embedding
Engines
Engines: vLLM, SGLang, TensorRT-LLM, vLLM Omni, TEI, Transformers
!Image 90Engine vLLM!Image 91Engine SGLang!Image 92Engine TensorRT-LLM!Image 93Engine vLLM Omni!Image 94Engine TEI!Image 95Engine Transformers!Image 96Engine vLLM!Image 97Engine SGLang!Image 98Engine TensorRT-LLM!Image 99Engine vLLM Omni!Image 100Engine TEI!Image 101Engine Transformers!Image 102Engine vLLM!Image 103Engine SGLang!Image 104Engine TensorRT-LLM!Image 105Engine vLLM Omni!Image 106Engine TEI!Image 107Engine Transformers!Image 108Engine vLLM!Image 109Engine SGLang!Image 110Engine TensorRT-LLM!Image 111Engine vLLM Omni!Image 112Engine TEI!Image 113Engine Transformers!Image 114Engine vLLM!Image 115Engine SGLang!Image 116Engine TensorRT-LLM!Image 117Engine vLLM Omni!Image 118Engine TEI!Image 119Engine Transformers!Image 120Engine vLLM!Image 121Engine SGLang!Image 122Engine TensorRT-LLM!Image 123Engine vLLM Omni!Image 124Engine TEI!Image 125Engine Transformers
GPUs
GPUs: L4, A10, L40S, RTX 4090, A100, H100, H200, B200
!Image 12624 GB L4!Image 12724 GB A10!Image 12848 GB L40S!Image 12924 GB RTX 4090!Image 13080 GB A100!Image 13180 GB H100!Image 132141 GB H200!Image 133192 GB B200!Image 13424 GB L4!Image 13524 GB A10!Image 13648 GB L40S!Image 13724 GB RTX 4090!Image 13880 GB A100!Image 13980 GB H100!Image 140141 GB H200!Image 141192 GB B200!Image 14224 GB L4!Image 14324 GB A10!Image 14448 GB L40S!Image 14524 GB RTX 4090!Image 14680 GB A100!Image 14780 GB H100!Image 148141 GB H200!Image 149192 GB B200!Image 15024 GB L4!Image 15124 GB A10!Image 15248 GB L40S!Image 15324 GB RTX 4090!Image 15480 GB A100!Image 15580 GB H100!Image 156141 GB H200!Image 157192 GB B200
Clouds
Clouds: RunInfra Cloud, Modal, RunPod, Vast.ai
B200 RunInfra Cloud!Image 158H100 Modal!Image 159A100 RunPod!Image 160RTX 4090 Vast.ai B200 RunInfra Cloud!Image 161H100 Modal!Image 162A100 RunPod!Image 163RTX 4090 Vast.ai B200 RunInfra Cloud!Image 164H100 Modal!Image 165A100 RunPod!Image 166RTX 4090 Vast.ai B200 RunInfra Cloud!Image 167H100 Modal!Image 168A100 RunPod!Image 169RTX 4090 Vast.ai B200 RunInfra Cloud!Image 170H100 Modal!Image 171A100 RunPod!Image 172RTX 4090 Vast.ai B200 RunInfra Cloud!Image 173H100 Modal!Image 174A100 RunPod!Image 175RTX 4090 Vast.ai B200 RunInfra Cloud!Image 176H100 Modal!Image 177A100 RunPod!Image 178RTX 4090 Vast.ai B200 RunInfra Cloud!Image 179H100 Modal!Image 180A100 RunPod!Image 181RTX 4090 Vast.ai
Common questions
Can't find what you're looking for?Get in touch
What is RunInfra?How do I build my first pipeline?Which AI models are supported?How does GPU kernel optimization work?Can I deploy pipelines as APIs?How is this different from using closed-source APIs?Is my data secure?
What is RunInfra?
Describe what you want to run.RunInfra picks compatible open models,benchmarks GPUs,tunes the runtime,and gives you a deploy-ready stack.
Deploy your first optimized model,measured before you ship
Describe the goal. RunInfra builds and optimizes the stack.
End-to-end encryption
Isolated GPU infrastructure
Zero data retention
SOC 2 Type II
by RightNow
© 2026 RunInfra. All rights reserved.

Pipeline BuilderPricingDocsResearchNewsContact
Backed by
!Image 187: YCombinator
AICPA Type II 
!Image 189: NVIDIA Inception Program!Image 190: NVIDIA Inception Program
Ask AI about RunInfra
%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)
SecurityDPAAUPCookiesTermsPrivacy