Getting started with Spark Structured Streaming and Kafka

I was recently doing the Cloud Computing Specialization on Coursera and its capstone project is about processing a set of datasets with batch and then with streaming. During the streaming part, I wanted to use Spark and then I saw this new streaming project called Spark Structured Streaming, which I ended up using for this project, along with other technologies, like Kafka and Cassandra.

Spark Structured Streaming is a new streaming engine built on top the Spark SQL engine and Datasets. With it you can create streaming applications using a higher level API, without really having to care about some of the nuances required with the previous Spark Streaming based on RDDs, like writing intermediate results.

The Spark Structured Streaming Programming Guide already does a great job in covering a lot about this new Engine, so the idea of this article is more towards writing a simple application. Currently, Kafka is pretty much a no-brainer choice for most streaming applications, so we’ll be seeing a use case integrating both Spark Structured Streaming and Kafka.

Continue reading “Getting started with Spark Structured Streaming and Kafka”

Kafka findings and scenarios

When starting with Apache Kafka, you’re overwhelmed with a lot of new concepts: topics, partitions, groups, replicas, etc. Although Kafka documentation does a great job in explaining all of these concepts, sometimes it’s good to see them in practical scenarios. That’s what this post will try to achieve, by pointing a few findings/scenarios and solutions/explanations/links for each of them.

Continue reading “Kafka findings and scenarios”