5 posts tagged with "vLLM"

vLLM Optimization Techniques: 5 Practical Methods to Improve Performance

February 6, 2026 · 26 min read

Jaydev Tonde

Data Scientist

vLLM optimization techniques cover artwork with five performance methods highlighted

Running large language models efficiently can be challenging. You want good performance without overloading your servers or exceeding your budget. That's where vLLM comes in - but even this powerful inference engine can be made faster and smarter.

In this post, we'll explore five cutting-edge optimization techniques that can dramatically improve your vLLM performance:

Prefix Caching - Stop recomputing what you've already computed
FP8 KV-Cache - Pack more memory efficiency into your cache
CPU Offloading - Make your CPU and GPU work together
Disaggregated P/D - Split processing and serving for better scaling
Zero Reload Sleep Mode - Keep your models warm without wasting resources

Each technique addresses a different bottleneck, and together they can significantly improve your inference pipeline performance. Let's explore how these optimizations work.

Disaggregated Prefill-Decode: The Architecture Behind Meta's LLM Serving

January 29, 2026 · 11 min read

Vishnu Subramanian

Founder @JarvisLabs.ai

Disaggregated Prefill-Decode Architecture

Why I'm Writing This Series

I've been deep in research mode lately, studying how to optimize LLM inference. The goal is to eventually integrate these techniques into JarvisLabs - making it easier for our users to serve models efficiently without having to become infrastructure experts themselves.

As I learn, I want to share what I find. This series is part research notes, part explainer. If you're trying to understand LLM serving optimization, hopefully my journey saves you some time.

This first post covers disaggregated prefill-decode - a pattern I discovered while reading through the vLLM router repository. Meta's team has been working closely with vLLM on this, and it solves a fundamental problem that's been on my mind.

The Complete Guide to LLM Quantization with vLLM: Benchmarks & Best Practices

January 7, 2026 · 46 min read

Jaydev Tonde

Data Scientist

Complete 4-bit quantization benchmark results comparing perplexity, HumanEval Pass@1, throughput, and TTFT across seven methods on Qwen2.5-32B

Introduction

If you've worked with large language models, you've probably run into a common problem: these models are huge and need a lot of GPU memory to run. A 32B parameter model can easily eat up 60+ GB of memory in its default form. That's where quantization comes in.

What is quantization? Simply put, it's the process of reducing the precision of model weights. Instead of storing each weight as a 16-bit floating point number, we can store it as a 4-bit or 8-bit integer. This makes the model smaller and faster to run.

In this blog post, we are going to:

Learn about different quantization techniques available in vLLM
See how each one works under the hood
Run actual benchmarks on an H200 GPU using Qwen2.5-32B-Instruct
Help you decide which technique to use for your use case

The techniques we'll cover include AWQ, GPTQ, Marlin, BitBLAS, GGUF, BitsandBytes, and more. We'll test 4-bit quantization and measure three things:

perplexity (model quality),
code generation accuracy (HumanEval),
and inference speed (ShareGPT benchmark).

Let's get started.

Deploying MiniMax M2.1 with vLLM: Complete Guide for Agentic Workloads

December 26, 2025 · 10 min read

Atharva Ingle

AI Engineer @E2E Networks

MiniMax M2.1 vLLM Deployment Cover

If you're building agentic applications or coding assistants, you've probably noticed that most open-source models fall short on tool calling and multi-step reasoning. MiniMax M2.1 changes that. Released on December 23, 2025, it's currently the strongest open-source model for agentic workloads, matching or beating Claude Sonnet on benchmarks like tau2-Bench, BrowseComp, and GAIA.

What makes M2.1 practical to deploy is its architecture. It's a Mixture-of-Experts model with 230 billion total parameters, but only 10 billion activate per forward pass. You get frontier-class performance on tool calling and software engineering tasks while running inference at a fraction of the compute. The model is MIT-licensed and works out of the box with Cline, Roo Code, OpenCode, and Claude Code.

This guide covers vLLM deployment, benchmarking, tool calling with M2.1's interleaved thinking feature, and integration with coding terminals.

Speculative Decoding in vLLM: Complete Guide to Faster LLM Inference

December 18, 2025 · 34 min read

Jaydev Tonde

Data Scientist

Speculative Decoding vLLM Cover

Introduction

Ever waited for an AI chatbot to finish its answer, watching the text appear word by word slow? It can feel painfully slow, especially when you need a fast response from a powerful Large Language Model (LLM).

The big problem is in how LLMs generate text. They don't just write a paragraph all at once; they follow a strict, word-by-word approach.

The model looks at the prompt and the words it has generated so far.
It calculates the best next word (token).
It adds that word to the text.
It repeats the whole process for the next word.

Each step involves complex calculations, meaning the more text you ask for, the longer the wait. For developers building real-time applications (like chatbots, code assistants, or RAG systems), this slowness (high latency) is a major problem.

Why I'm Writing This Series​

Introduction​

Introduction​

Why I'm Writing This Series

Introduction

Introduction