Startups > Inferact

Inferact

Inferact powers AI inference at scale with the open-source vLLM engine.

Seed $150M total Founded 2025 San Francisco, California 15 employees

Inferact commercializes vLLM, the world's most popular open-source LLM inference engine, delivering 2-4× throughput improvements over competing systems. Built by the original vLLM creators from UC Berkeley, Inferact serves enterprises and AI companies seeking to deploy frontier models at scale without proprietary lock-in. The company owns the roadmap of a widely-adopted community project with 2,000+ contributors while maintaining a lean 14-person team.

Problem solved

Enterprise teams lack efficient, scalable inference engines that work across diverse hardware platforms and model architectures without massive infrastructure overhead or vendor lock-in.

Target customer

Enterprise AI teams, cloud platforms, and AI product companies deploying large language models in production environments requiring cost-efficient, hardware-agnostic inference infrastructure.

Website LinkedIn Crunchbase Twitter / X

Founders

Simon Mo

CEO & Co-Founder

Former Anyscale engineer; early vLLM community organizer and UC Berkeley-affiliated engineer.

Woosuk Kwon

CTO & Co-Founder

Ph.D. in Computer Science from UC Berkeley under Ion Stoica; co-creator and co-lead of the vLLM open-source project.

Kaichao You

Co-Founder

UC Berkeley-affiliated engineer and vLLM core contributor.

Roger Wang

Co-Founder

UC Berkeley-affiliated engineer and vLLM core contributor.

Joseph Gonzalez

Co-Founder

UC Berkeley computer science faculty and researcher.

Funding history

Seed $150M January 26, 2026 Led by Andreessen Horowitz, Lightspeed Venture Partners · Sequoia Capital, Altimeter Capital, Redpoint Ventures, ZhenFund, Databricks Ventures, UC Berkeley Chancellor's Fund

Total raised: $150M

Industries

Artificial Intelligence (AI) Information Technology Software

Pricing

Not publicly available. Expected to offer serverless version with observability, troubleshooting, and disaster recovery features on Kubernetes.

Notable customers

Amazon (Rufus, 250M customers), LinkedIn, Roblox (4B tokens/week), Meta, Mistral AI, IBM, Stripe (73% inference cost reduction), Spotify

Integrations

Model vendors (day-zero architecture support), hardware vendors (GPU/TPU/accelerator integration), major inference services adoption

Tech stack

HSTS (Security) Apple iCloud Mail (Webmail) Google Workspace (Email) Cloudflare (CDN) Vercel (PaaS)

Website

inferact.ai/

Competitors

NVIDIA TensorRT-LLM

Proprietary inference framework from hardware vendor; Inferact owns the vLLM roadmap as the community standard with broader hardware support.

Hugging Face Text Generation Inference (TGI)

Hugging Face's inference engine; Inferact's vLLM has 2-4× throughput advantage and broader model architecture support.

RadixArk (SGLang commercialization)

Commercializes a different open-source inference project; Inferact has achieved significantly larger valuation ($800M vs $400M) and larger seed round.

Why this matters: Inferact achieved one of Silicon Valley's largest seed rounds ($150M at $800M valuation) by commercializing the world's most popular open-source LLM inference engine, signaling a decisive investor shift from model training to inference infrastructure. The team owns the roadmap of a project with 2,000+ contributors while competing against NVIDIA and Hugging Face.

Best for: Enterprise and platform teams deploying LLMs at scale who need hardware-agnostic, cost-efficient inference without vendor lock-in or massive infrastructure teams.

Use cases

Large-scale LLM deployment across heterogeneous hardware

Organizations like Amazon and LinkedIn deploy vLLM to serve LLMs on diverse hardware platforms (NVIDIA, AMD, Google TPUs, Intel Gaudi, AWS Neuron). PagedAttention technology optimizes memory management, enabling 2-4× throughput improvements on the same infrastructure.

Cost optimization for inference workloads

Stripe reduced inference costs by 73% using vLLM. Roblox runs 4B tokens weekly with optimized throughput. Inferact enables teams to serve more requests per dollar spent compared to alternative inference frameworks.

Model-agnostic inference platform

Support for 500+ model architectures and 200+ accelerator types allows AI teams to standardize on a single inference layer regardless of which frontier models they deploy or what hardware they own.

Alternatives

NVIDIA TensorRT-LLM Choose TensorRT-LLM if you're already invested in NVIDIA ecosystem and want GPU-vendor-optimized inference; choose Inferact for hardware flexibility and community-driven roadmap.

Hugging Face Text Generation Inference Choose Hugging Face TGI if you prioritize Hugging Face Hub integration; choose Inferact for superior throughput and broader model/hardware support.

vLLM (open-source) Choose open-source vLLM if you have large infrastructure teams for deployment, maintenance, and scaling; choose Inferact for managed commercial support and enterprise features.

FAQ

What does Inferact do? +

Inferact commercializes vLLM, an open-source LLM inference engine that optimizes how AI models manage memory and process requests during inference. It enables enterprises to deploy frontier models on any hardware platform with 2-4× better throughput than competing systems, without vendor lock-in.

How much does Inferact cost? +

Pricing is not publicly disclosed. Inferact plans to offer both open-source vLLM and a commercial serverless product with enterprise features like observability, troubleshooting, and disaster recovery. Contact sales for pricing details.

What are alternatives to Inferact? +

NVIDIA TensorRT-LLM (GPU-vendor proprietary framework), Hugging Face Text Generation Inference (integrated with Hugging Face Hub), and the open-source vLLM project itself for teams with dedicated infrastructure resources.

Who uses Inferact? +

Enterprise and platform AI teams including Amazon (Rufus), LinkedIn, Roblox, Meta, Mistral AI, IBM, Stripe, and Spotify. Target customers are organizations deploying LLMs in production at scale across diverse hardware environments.

How does Inferact compare to NVIDIA TensorRT-LLM? +

Inferact's vLLM achieves 2-4× throughput improvements over TensorRT-LLM with comparable latency. Unlike TensorRT-LLM, vLLM supports 500+ model architectures, runs on 200+ accelerator types (not just NVIDIA), and is community-driven rather than vendor-controlled.