
Inference infrastructure: Building efficient token factories
Deep Infra’s Blueprint for AI Inference: Powering the Token Economy with Open Source and Specialized Infrastructure
(This article was generated with AI and it’s based on a AI-generated transcription of a real talk on stage. While we strive for accuracy, we encourage readers to verify important information.)
Mr. Nikola Borisov, CEO and Co-Founder of Deep Infra, addressed Web Summit Vancouver 2026. He highlighted the critical role of inference infrastructure in building efficient “token factories,” explaining how generating tokens for AI inference underpins modern AI applications.
Deep Infra operates a purpose-built inference cloud, a vertically integrated infrastructure for highly efficient token generation. It involves owning and operating GPUs in US data centers, primarily utilizing the latest Nvidia Blackwell B300 chips. These GPU clusters are optimized for inference workloads.
The platform hosts over 200 leading open-source AI models, accessible via simple API. Deep Infra processes approximately 5 trillion tokens weekly. Inference demands immense compute: a 100-billion-parameter model requires 100 billion multiplications per token, 100 million times more intensive than traditional database fetches.
Deep Infra views inference as a core infrastructure challenge, necessitating a vertically integrated solution. This specialized cloud, similar to CDNs, focuses exclusively on inference. It requires purpose-built chips, optimized racks, and clusters distinct from general-purpose training setups.
Effective inference relies on advanced software techniques like quantization, custom CUDA kernels, and intelligent fleet structuring. Open-source models offer substantial benefits: cheaper tokens (10-100x), flexibility to modify weights, greater stability, enhanced privacy, and freedom from vendor lock-in.
The inference infrastructure stack begins with data centers and affordable power, as network latency is less critical. This is followed by purpose-built chips, servers, Nvidia drivers, firmware, and a Kubernetes-based scheduling layer, managing a distributed cloud across multiple data centers.
Efficient inference engines like Tensor TLLM, VLM, and SGLAN are crucial, each optimized for different modalities. Key-Value (KV) caching is vital for reducing computation costs, preventing redundant GPU calculations during repetitive agent requests. Sophisticated orchestration manages hundreds of models scaling dynamically, ensuring full cluster utilization.
API billing and function calling are essential for a robust service. Mr. Borisov concluded by illustrating Deep Infra’s exponential growth in token generation over three years, emphasizing the vast and growing demand for inference.

