Inferact

Inferact powers AI inference at scale with the open-source vLLM engine.
Seed $150M total Founded 2025 San Francisco, California 15 employees
Inferact commercializes vLLM, the world's most popular open-source LLM inference engine, delivering 2-4× throughput improvements over competing systems. Built by the original vLLM creators from UC Berkeley, Inferact serves enterprises and AI companies seeking to deploy frontier models at scale without proprietary lock-in. The company owns the roadmap of a widely-adopted community project with 2,000+ contributors while maintaining a lean 14-person team.
Problem solved
Enterprise teams lack efficient, scalable inference engines that work across diverse hardware platforms and model architectures without massive infrastructure overhead or vendor lock-in.
Target customer
Enterprise AI teams, cloud platforms, and AI product companies deploying large language models in production environments requiring cost-efficient, hardware-agnostic inference infrastructure.
Founders
S
Simon Mo
CEO & Co-Founder
Former Anyscale engineer; early vLLM community organizer and UC Berkeley-affiliated engineer.
W
Woosuk Kwon
CTO & Co-Founder
Ph.D. in Computer Science from UC Berkeley under Ion Stoica; co-creator and co-lead of the vLLM open-source project.
K
Kaichao You
Co-Founder
UC Berkeley-affiliated engineer and vLLM core contributor.
R
Roger Wang
Co-Founder
UC Berkeley-affiliated engineer and vLLM core contributor.
J
Joseph Gonzalez
Co-Founder
UC Berkeley computer science faculty and researcher.
Funding history
Seed $150M January 26, 2026 Led by Andreessen Horowitz, Lightspeed Venture Partners · Sequoia Capital, Altimeter Capital, Redpoint Ventures, ZhenFund, Databricks Ventures, UC Berkeley Chancellor's Fund
Total raised: $150M
Pricing
Not publicly available. Expected to offer serverless version with observability, troubleshooting, and disaster recovery features on Kubernetes.
Notable customers
Amazon (Rufus, 250M customers), LinkedIn, Roblox (4B tokens/week), Meta, Mistral AI, IBM, Stripe (73% inference cost reduction), Spotify
Integrations
Model vendors (day-zero architecture support), hardware vendors (GPU/TPU/accelerator integration), major inference services adoption
Tech stack
HSTS (Security) Apple iCloud Mail (Webmail) Google Workspace (Email) Cloudflare (CDN) Vercel (PaaS)
Website
Competitors
NVIDIA TensorRT-LLM
Proprietary inference framework from hardware vendor; Inferact owns the vLLM roadmap as the community standard with broader hardware support.
Hugging Face Text Generation Inference (TGI)
Hugging Face's inference engine; Inferact's vLLM has 2-4× throughput advantage and broader model architecture support.
RadixArk (SGLang commercialization)
Commercializes a different open-source inference project; Inferact has achieved significantly larger valuation ($800M vs $400M) and larger seed round.
Why this matters: Inferact achieved one of Silicon Valley's largest seed rounds ($150M at $800M valuation) by commercializing the world's most popular open-source LLM inference engine, signaling a decisive investor shift from model training to inference infrastructure. The team owns the roadmap of a project with 2,000+ contributors while competing against NVIDIA and Hugging Face.
Best for: Enterprise and platform teams deploying LLMs at scale who need hardware-agnostic, cost-efficient inference without vendor lock-in or massive infrastructure teams.
Use cases
Large-scale LLM deployment across heterogeneous hardware
Organizations like Amazon and LinkedIn deploy vLLM to serve LLMs on diverse hardware platforms (NVIDIA, AMD, Google TPUs, Intel Gaudi, AWS Neuron). PagedAttention technology optimizes memory management, enabling 2-4× throughput improvements on the same infrastructure.
Cost optimization for inference workloads
Stripe reduced inference costs by 73% using vLLM. Roblox runs 4B tokens weekly with optimized throughput. Inferact enables teams to serve more requests per dollar spent compared to alternative inference frameworks.
Model-agnostic inference platform
Support for 500+ model architectures and 200+ accelerator types allows AI teams to standardize on a single inference layer regardless of which frontier models they deploy or what hardware they own.
Alternatives
NVIDIA TensorRT-LLM Choose TensorRT-LLM if you're already invested in NVIDIA ecosystem and want GPU-vendor-optimized inference; choose Inferact for hardware flexibility and community-driven roadmap.
Hugging Face Text Generation Inference Choose Hugging Face TGI if you prioritize Hugging Face Hub integration; choose Inferact for superior throughput and broader model/hardware support.
vLLM (open-source) Choose open-source vLLM if you have large infrastructure teams for deployment, maintenance, and scaling; choose Inferact for managed commercial support and enterprise features.
FAQ
What does Inferact do? +
Inferact commercializes vLLM, an open-source LLM inference engine that optimizes how AI models manage memory and process requests during inference. It enables enterprises to deploy frontier models on any hardware platform with 2-4× better throughput than competing systems, without vendor lock-in.
How much does Inferact cost? +
Pricing is not publicly disclosed. Inferact plans to offer both open-source vLLM and a commercial serverless product with enterprise features like observability, troubleshooting, and disaster recovery. Contact sales for pricing details.
What are alternatives to Inferact? +
NVIDIA TensorRT-LLM (GPU-vendor proprietary framework), Hugging Face Text Generation Inference (integrated with Hugging Face Hub), and the open-source vLLM project itself for teams with dedicated infrastructure resources.
Who uses Inferact? +
Enterprise and platform AI teams including Amazon (Rufus), LinkedIn, Roblox, Meta, Mistral AI, IBM, Stripe, and Spotify. Target customers are organizations deploying LLMs in production at scale across diverse hardware environments.
How does Inferact compare to NVIDIA TensorRT-LLM? +
Inferact's vLLM achieves 2-4× throughput improvements over TensorRT-LLM with comparable latency. Unlike TensorRT-LLM, vLLM supports 500+ model architectures, runs on 200+ accelerator types (not just NVIDIA), and is community-driven rather than vendor-controlled.
Tags
LLM inference vLLM AI infrastructure hardware-agnostic model serving GPU optimization PagedAttention open-source commercialization