Tuesday, 2 June 2020

Exploring Hazlecast Jet Runner


Apache Beam is an open-source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open-source Beam SDKs, users can build a program that defines the pipelineThe Beam Pipeline Runners translate the data processing pipeline that user-defined with his Beam program into the API compatible with the distributed processing back-end of his choice.
A Beam Runner runs a Beam pipeline on a specific (often distributed) data processing system. Available runners are listed below:
·        DirectRunner: Runs locally on your machine
·        Apex Runner: Runs on Apache Apex.
·        FlinkRunner: Runs on Apache Flink.
·        SparkRunner: Runs on Apache Spark.
·        DataflowRunner: Runs on Google Cloud Dataflow
·        GearpumpRunner: Runs on Apache Gear pump (incubating).
·        SamzaRunner: Runs on Apache Samza.
·        NemoRunner: Runs on Apache Nemo.
·        JetRunner: Runs on Hazelcast Jet.

The vision of Beam is to support: End Users who want to write pipelines in the language of their choice, SDK Writers who wish to unleash the power of beam through various new languages and finally Runner Writers who has a distributed processing environment and looking forward to supporting the Beam pipelines.

The Hazelcast Jet Runner is one such runner that can be used to execute Beam pipelines using Hazelcast Jet. It allows the user to write a modern Java code that focuses purely on data transformation while it does all the heavy lifting of getting the data flowing and computation running across a cluster of nodes. It supports working with both bounded (batch) and unbounded (streaming) data.

As a part of my course work, I decided to specialize in the track of distributed systems. After completing Distributed Systems , I have enrolled for Advanced Distributed Systems course as well which gave me an interesting opportunity to develop a streaming query that analyses data from the Linear Road Benchmark and I deployed that query in a Flink cluster.

This was the initial spark that triggered my interest in Data Streaming, and I continued to explore Apache Flink, Apache Spark, Apache Samza, and their runner support to Apache Beam. While diving deep into Beam Pipeline Runners, the conference talks about Apache Flink runner for Beam and samza portable runner for Beam gave me an architectural insight about Beam portable runners. Recently I worked on these Distributed Computing Projects, and I gained some hands-on experience with basic data streaming modules. I have also developed a Blackboard to implement Strict, Loose, and eventual consistency models as a part of my distributed systems course work.

I was trying to explore more into the field of Distributed Systems. Finally, I found this interesting DAG-based distributed computing Java library, Jet Runner for building fault-tolerant and elastic data processing pipeline that can distribute DAG tasks across cores and nodes to run in parallel. One other interesting feature of JET is the use of application-level cooperative threads that enable efficient parallelism without any overhead of context switching in OS-level threads. Thus, high-end performance is guaranteed by Jet with no external planning requirement.

No comments:

Post a Comment