In the rapidly evolving landscape of machine learning (ML) serving, optimising performance is paramount. This presentation delves into the innovative strategy of leveraging distributed caches to accelerate ML serving. By strategically caching frequently accessed model predictions and intermediate computations, organisations can significantly reduce latency and improve throughput in ML inference pipelines. Through practical insights, attendees will gain a comprehensive understanding of the benefits and challenges of incorporating distributed caches into ML serving architectures. From cache design considerations to implementation best practices, this session equips participants with the knowledge and tools necessary to harness the full potential of distributed caching for accelerated ML serving.
Technical level: Technical practitioner
Session Length: 40 minutes