Apache Beam is an open-source, unified model for defining both batch and
streaming data-parallel processing pipelines. Using one of the open-source Beam
SDKs, users can build a program that defines the pipeline. The Beam Pipeline Runners translate the data processing pipeline that user-defined with his Beam program into the API compatible with the distributed
processing back-end of his choice.
A Beam
Runner runs a Beam pipeline on a specific (often distributed) data processing system. Available runners are listed below:
·
DirectRunner: Runs locally on your machine
·
Apex Runner: Runs on Apache Apex.
·
FlinkRunner: Runs on Apache Flink.
·
SparkRunner: Runs on Apache Spark.
·
DataflowRunner: Runs on Google Cloud Dataflow
·
GearpumpRunner: Runs on Apache Gear pump (incubating).
·
SamzaRunner: Runs on Apache Samza.
·
NemoRunner: Runs on Apache Nemo.
·
JetRunner: Runs on Hazelcast Jet.
The vision of Beam is
to support: End Users who want to write pipelines in the language of their
choice, SDK Writers who wish to unleash the power of beam through various new
languages and finally Runner Writers who has a distributed processing
environment and looking forward to supporting the Beam pipelines.
The Hazelcast Jet Runner is one such runner
that can be used to execute Beam pipelines using Hazelcast Jet. It allows the user to write a modern Java code that focuses purely on data transformation
while it does all the heavy lifting of getting the data flowing and computation
running across a cluster of nodes. It supports working with both bounded (batch)
and unbounded (streaming) data.
As a part of my
course work, I decided to specialize in the track of distributed systems. After
completing Distributed
Systems , I have enrolled for Advanced
Distributed Systems course as well which gave me an interesting opportunity
to develop a streaming query that analyses data from the Linear
Road Benchmark and I deployed that query in a Flink cluster.
This was the initial
spark that triggered my interest in Data Streaming, and I continued to
explore Apache Flink, Apache Spark, Apache Samza, and their runner
support to Apache Beam. While diving deep into Beam Pipeline Runners, the conference talks about Apache Flink runner for Beam
and samza portable runner
for Beam gave me an architectural insight about Beam portable runners. Recently I worked on
these Distributed
Computing Projects, and I gained some hands-on experience with basic data
streaming modules. I have also developed a Blackboard to implement Strict, Loose, and eventual consistency models as a part of my distributed systems course work.
I was trying to explore
more into the field of Distributed Systems. Finally, I found this
interesting DAG-based distributed
computing
Java library, Jet Runner for building fault-tolerant
and elastic data processing pipeline that can distribute DAG tasks across cores and nodes to run in parallel.
One other interesting feature of JET is the use of application-level
cooperative threads that enable efficient parallelism without any overhead
of context switching in OS-level threads. Thus, high-end performance is guaranteed
by Jet with no external planning requirement.
No comments:
Post a Comment