Optimize open models for production - RunInfra
RunInfraby RightNow
- Use cases
- Pricing
- Research
- Resources
- Contact
Backed by!Image 1: Y CombinatorCombinator
Optimize open model production Optimize any open model for production any for
Paste a model. RunInfra benchmarks the options and picks the winner. Deploy it, or own the stack.
Describe the inference workload you want to deploy...
Auto engine Auto GPU
Example workloads
Compare engines Find the best serving engine for !Image 2Qwen 2.5 7B Tune latency Optimize !Image 3Qwen 2.5 7B for low latency Ship speech Deploy !Image 4Whisper Large V3 Turbo with p95 and cost checks Scale retrieval Build !Image 5BGE-M3 embeddings with batch throughput metrics
Every optimization ends with a result you can inspect and run.
You get a benchmark receipt and a runnable deployment kit. Nothing hidden.
Serving engine
Compared, not assumed
GPU target
Sized to the model
p95 latency
Benchmarked
Throughput
Measured per GPU
VRAM
Checked for fit
Cost
Tracked per run
GPU kernels
Tuned where supported
Deployment kit
Run it or export it
RUNINFRA
From prompt to a production stack you own
RunInfra compares, tunes, and benchmarks the stack. Deploy or export it.
Describe a Llama 3.1 70B Qwen 2.5 7B DeepSeek V3 Mistral 7B Phi-4 Gemma 2 9B Mixtral 8x7B!Image 7Whisper Large V3 Llama 3.1 70B workload in plain English.
RunInfra compares!Image 8vLLM!Image 9SGLang!Image 10TensorRT-LLM!Image 11vLLM-Omni!Image 12SGLang and every other engine your model can run on.
It tunes speculative decoding kernel generation server tuning quantization KV cache reuse FlashAttention v2 continuous batching server tuning where it helps, with no config to hand-write.
Deploy on NVIDIA H100 NVIDIA H200 NVIDIA B200 NVIDIA A100 NVIDIA L40S NVIDIA L4 NVIDIA A100 and pay per million tokens, or export the stack and self-host.
New Session
Optimize !Image 14Llama-3.1-8B-Instruct on !Image 15vLLM for cheapest GPU with latency and VRAM checks
Thought for 1s
Capturing cost-first intent for Llama 3.1 8B on vLLM.
Intake updated Model, engine, goal
Model Llama 3.1 8B
Engine vLLM
Goal Low cost, latency checked
Requirements collected 575ms
Model, engine, GPU target, latency goal
Plan drafted 3.1s
10 execution phases prepared
Plan ready review / 10 phases / ~23m
Optimize Llama 3.1 8B on vLLM for cheapest GPU with latency checks
Recommended path: vLLM on L4. Review the plan before execution.
Open Plan Accept and run
Working...
Runbook generated from the workload
Llama 3.1 8B on vLLM, lowest viable cost
draft
Latency target
p95 under 60ms
VRAM budget
24 GB
Est. runtime
~23 min
Execution plan 10 phases, 3 validated
01
AWQ int4 quantization ready
Weight-only int4, calibrated offline
02
FlashAttention v2 ready
Fused attention kernels
03
Continuous batching queued
In-flight request scheduling
04
Paged KV cache queued
fp8 cache in paged blocks
05
CUDA graph capture queued
Replay the decode-step graph
06
Speculative decoding queued
Draft model proposes tokens
07
Prefix caching queued
Reuse shared prompt prefixes
08
Tensor-parallel sizing ready
Single GPU, no sharding
09
Warmup and autotune queued
Lock kernel shapes pre-serve
10
Serving-config tune queued
Batch size and concurrency
Review the plan, then run.vLLM, L4 to A100
Optimization run
running 5/6 phases
Benchmarking the tuned config against cheaper GPU candidates.
Candidate set built 0.6s
Serving config tuned 2.9s
AWQ int4 quantization applied 41.7s
FlashAttention v2 kernels compiled 58.3s
Candidates benchmarked 94.2s
Confirming winner live
Baseline vs optimized best candidate so far
Metric Baseline Optimized Delta
P95 latency 184ms 38ms-79%
Time to first token 120ms 22ms-82%
Throughput 45 tok/s 142 tok/s+216%
VRAM 28.4 GB 12.1 GB-57%
Cost / 1M tokens$0.42$0.12-71%
GPU candidates winner marked
GPU candidate Cost / 1M p95 Pick
NVIDIA L4 misses latency
$0.08 84ms-
NVIDIA L40S
$0.12 38ms
NVIDIA A100 overspec
$0.21 31ms-
Deploy or export the stack
Pick a target for the optimized L40S build.
ready
Target Supported GPUs
Managed by RunInfra
selected
Fully managed endpoint, billed per million tokens
H100 L40S A100 L4
Your RunPod
Deploy to your own RunPod account
H100 A100 RTX 4090 L40S
Modal
Serverless deploy on RunInfra Modal
H200 H100 A100 L40S
Your Modal
Deploy to your own Modal workspace
H100 A100 L40S T4
Generated stack Dockerfile serve.sh runinfra.yaml
1 FROM runinfra/vllm:0.6.3-l40s
2 ENV MODEL=Llama-3.1-8B-Instruct
3 ENV QUANTIZATION=awq_marlin
4 COPY ./weights /models
5# optimized serving config
6 CMD vllm serve $MODEL \
7--quantization awq_marlin \
8--kv-cache-dtype fp8 \
9--enable-prefix-caching \
10--max-num-seqs 256 \
11--gpu-memory-utilization 0.92
$0.12 / 1M 38ms p95 L40S
Export kit Deploy
Deploy on the hardware you choose
Compare real GPU prices and deploy wherever fits. No lock-in, no rewrites.
Own the winning stack before you ship
Not a black box. You get the measured stack to run, deploy, or export.
Benchmark Verified
p99 latency 64 ms 64%
throughput 3.4k tok/s 2.8x
cost / 1M$0.15 64%
Llama 3.1 8B, vLLM, L4 24GB
Benchmark receipt
Before and after, in one record you can reproduce.
p99 latency
throughput
VRAM
cost
reproduction
runinfra.yaml yaml
1 2 3 4 5
engine: vLLM quantization: awq-int4 kv_cache: fp8 max_num_seqs: 256 speculative: eagle
Optimized runtime config
Every serving setting RunInfra tuned. Read or change it.
engine flags
batch settings
quantization
kernel paths
kv cache
deployment-kit Runnable
Dockerfile
compose.yaml
k8s/
deployment.yaml
serve.py
benchmark.md
Exportable stack
A runnable repo you take with you.
Dockerfile
compose
launch scripts
reports
Live p50 38 ms
POST/v1/chat/completions 200
L4 2 replicas autoscale
Managed endpoint
The same measured stack, hosted by us.
RunInfra Cloud
resource config
inspectable
portable
Why owning your AI stack matters
Data privacy and control
Keep sensitive workloads on infrastructure you choose.
Customization
Tune the model, runtime, and GPU to your workload.
Performance ownership
Real tuning, measured, not assumed.
Portability
Run it on our cloud, or export it to yours.
Supported across the stack
Open models, serving engines, GPUs, and the clouds you deploy to. RunInfra supports every layer of the inference stack.
Models
Models: Llama 3.3, Whisper, Qwen-Image, NV-Embed, Parler-TTS, Qwen2.5, Cosmos, Pixtral, EmbeddingGemma, RoBERTa, DeepSeek-V3, Sana, Parakeet, Mistral, Wan 2.1, GTE, Qwen2-VL, MMS-TTS, Qwen3 Reranker, Gemma 2, MusicGen, DeepSeek-VL2, Nemotron, FastPitch, PaliGemma, NV-RerankQA, Qwen2-Audio, Hermes 3, Canary, BERT, Llama 3.2 Vision, Qwen3 Embedding
!Image 22LLM Llama 3.3!Image 23ASR Whisper!Image 24Image Qwen-Image!Image 25Embed NV-Embed!Image 26TTS Parler-TTS!Image 27LLM Qwen2.5!Image 28Video Cosmos!Image 29Vision Pixtral!Image 30Embed EmbeddingGemma!Image 31Classify RoBERTa!Image 32LLM DeepSeek-V3!Image 33Image Sana!Image 34ASR Parakeet!Image 35LLM Mistral!Image 36Video Wan 2.1!Image 37Embed GTE!Image 38Vision Qwen2-VL!Image 39TTS MMS-TTS!Image 40Rerank Qwen3 Reranker!Image 41LLM Gemma 2!Image 42Audio MusicGen!Image 43Vision DeepSeek-VL2!Image 44LLM Nemotron!Image 45TTS FastPitch!Image 46Vision PaliGemma!Image 47Rerank NV-RerankQA!Image 48Audio Qwen2-Audio LLM Hermes 3!Image 49ASR Canary!Image 50Classify BERT!Image 51Vision Llama 3.2 Vision!Image 52Embed Qwen3 Embedding!Image 53LLM Llama 3.3!Image 54ASR Whisper!Image 55Image Qwen-Image!Image 56Embed NV-Embed!Image 57TTS Parler-TTS!Image 58LLM Qwen2.5!Image 59Video Cosmos!Image 60Vision Pixtral!Image 61Embed EmbeddingGemma!Image 62Classify RoBERTa!Image 63LLM DeepSeek-V3!Image 64Image Sana!Image 65ASR Parakeet!Image 66LLM Mistral!Image 67Video Wan 2.1!Image 68Embed GTE!Image 69Vision Qwen2-VL!Image 70TTS MMS-TTS!Image 71Rerank Qwen3 Reranker!Image 72LLM Gemma 2!Image 73Audio MusicGen!Image 74Vision DeepSeek-VL2!Image 75LLM Nemotron!Image 76TTS FastPitch!Image 77Vision PaliGemma!Image 78Rerank NV-RerankQA!Image 79Audio Qwen2-Audio LLM Hermes 3!Image 80ASR Canary!Image 81Classify BERT!Image 82Vision Llama 3.2 Vision!Image 83Embed Qwen3 Embedding
Engines
Engines: vLLM, SGLang, TensorRT-LLM, vLLM Omni, TEI, Transformers
!Image 84Engine vLLM!Image 85Engine SGLang!Image 86Engine TensorRT-LLM!Image 87Engine vLLM Omni!Image 88Engine TEI!Image 89Engine Transformers!Image 90Engine vLLM!Image 91Engine SGLang!Image 92Engine TensorRT-LLM!Image 93Engine vLLM Omni!Image 94Engine TEI!Image 95Engine Transformers!Image 96Engine vLLM!Image 97Engine SGLang!Image 98Engine TensorRT-LLM!Image 99Engine vLLM Omni!Image 100Engine TEI!Image 101Engine Transformers!Image 102Engine vLLM!Image 103Engine SGLang!Image 104Engine TensorRT-LLM!Image 105Engine vLLM Omni!Image 106Engine TEI!Image 107Engine Transformers!Image 108Engine vLLM!Image 109Engine SGLang!Image 110Engine TensorRT-LLM!Image 111Engine vLLM Omni!Image 112Engine TEI!Image 113Engine Transformers!Image 114Engine vLLM!Image 115Engine SGLang!Image 116Engine TensorRT-LLM!Image 117Engine vLLM Omni!Image 118Engine TEI!Image 119Engine Transformers
GPUs
GPUs: L4, A10, L40S, RTX 4090, A100, H100, H200, B200
!Image 12024 GB L4!Image 12124 GB A10!Image 12248 GB L40S!Image 12324 GB RTX 4090!Image 12480 GB A100!Image 12580 GB H100!Image 126141 GB H200!Image 127192 GB B200!Image 12824 GB L4!Image 12924 GB A10!Image 13048 GB L40S!Image 13124 GB RTX 4090!Image 13280 GB A100!Image 13380 GB H100!Image 134141 GB H200!Image 135192 GB B200!Image 13624 GB L4!Image 13724 GB A10!Image 13848 GB L40S!Image 13924 GB RTX 4090!Image 14080 GB A100!Image 14180 GB H100!Image 142141 GB H200!Image 143192 GB B200!Image 14424 GB L4!Image 14524 GB A10!Image 14648 GB L40S!Image 14724 GB RTX 4090!Image 14880 GB A100!Image 14980 GB H100!Image 150141 GB H200!Image 151192 GB B200
Clouds
Clouds: RunInfra Cloud, Modal, RunPod, Vast.ai
B200 RunInfra Cloud!Image 152H100 Modal!Image 153A100 RunPod!Image 154RTX 4090 Vast.ai B200 RunInfra Cloud!Image 155H100 Modal!Image 156A100 RunPod!Image 157RTX 4090 Vast.ai B200 RunInfra Cloud!Image 158H100 Modal!Image 159A100 RunPod!Image 160RTX 4090 Vast.ai B200 RunInfra Cloud!Image 161H100 Modal!Image 162A100 RunPod!Image 163RTX 4090 Vast.ai B200 RunInfra Cloud!Image 164H100 Modal!Image 165A100 RunPod!Image 166RTX 4090 Vast.ai B200 RunInfra Cloud!Image 167H100 Modal!Image 168A100 RunPod!Image 169RTX 4090 Vast.ai B200 RunInfra Cloud!Image 170H100 Modal!Image 171A100 RunPod!Image 172RTX 4090 Vast.ai B200 RunInfra Cloud!Image 173H100 Modal!Image 174A100 RunPod!Image 175RTX 4090 Vast.ai
Common questions
Can't find what you're looking for?Get in touch
What is RunInfra?How do I build my first pipeline?Which AI models are supported?How does GPU kernel optimization work?Can I deploy pipelines as APIs?How is this different from using closed-source APIs?Is my data secure?
What is RunInfra?
Describe what you want to run.RunInfra picks compatible open models,benchmarks GPUs,tunes the runtime,and gives you a deploy-ready stack.
Deploy your first optimized model,measured before you ship
Describe the goal. RunInfra builds and optimizes the stack.
End-to-end encryption
Isolated GPU infrastructure
Zero data retention
SOC 2 Type II
by RightNow
© 2026 RunInfra. All rights reserved.

Pipeline BuilderPricingDocsResearchNewsContact
Backed by
!Image 181: YCombinator
AICPA Type II 
!Image 183: NVIDIA Inception Program!Image 184: NVIDIA Inception Program
Ask AI about RunInfra
%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)%2C%20the%20chat-native%20AI%20model%20optimization%20and%20infrastructure%20platform%2C%20and%20what%20do%20I%20get%20as%20a%20customer%3F)
SecurityDPAAUPCookiesTermsPrivacy