Spark Optimisation: Building an Efficient Lakehouse by Oleksandra Bovkun
Spark optimisation: building an efficient Lakehouse. Apache Spark is a unified analytics engine for large-scale data processing. A lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new open and standardized system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes. In this talk we’ll cover how to use Spark in the most efficient way: how writing an optimised Spark jobs can reduce run time and costs building a strong and future-proof foundation for the lakehouse.
We’ll discuss the topics like partitioning of data, choosing the optimal spark configuration, and main pitfalls to avoid.