Apache Spark is the most popular open source analytics engine for large-scale data processing. Spark is not only a mature system, but thanks to its support of multiple resource managers like Hadoop, Mesos, and Kubernetes it has become a popular choice for both batch and streaming workloads in the industry.

Apache Beam has included a Spark runner since its inception to allow users to execute Beam pipelines on Spark, but until recently the Spark runner could only execute pipelines written in Java. In this talk we will introduce the portability framework and how we adapted it into the existing Spark runner translation to make the Spark runner portable.

We will show you how to execute Beam pipelines written in Python and Golang in Spark with Beam and invite you to use the new Spark Portable Runner. We will mention the use case of Tensorflow Extended, the end-to-end platform for data validation and transformation and ML model analysis. Finally we will discuss ongoing work and some future plans for the portable runner.


Frannz Salon