When does it make sense to run your own LLM inference infrastructure instead of paying per-token to third-party APIs like OpenAI or Anthropic? This session gives you a practical framework for that decision and the technical foundations to execute it. We’ll ground everything in three concrete use cases – interactive chat, batch document processing, and high-volume stream classification – showing how different requirements around latency, throughput, and data sovereignty lead to different answers. You’ll leave with a clear understanding of the tools that make self-hosting tractable (vLLM, open-weight models), the critical inference-server performance metrics, and an honest view of the cost trade-offs involved.

Technical Level of Session: Introductory level/students (some technical knowledge needed)

Supported by