Getting started with Spark Structured Streaming and Kafka

July 26, 2017 ~ Luciano Molinari ~ Leave a comment

I was recently doing the Cloud Computing Specialization on Coursera and its capstone project is about processing a set of datasets with batch and then with streaming. During the streaming part, I wanted to use Spark and then I saw this new streaming project called Spark Structured Streaming, which I ended up using for this project, along with other technologies, like Kafka and Cassandra.

Spark Structured Streaming is a new streaming engine built on top the Spark SQL engine and Datasets. With it you can create streaming applications using a higher level API, without really having to care about some of the nuances required with the previous Spark Streaming based on RDDs, like writing intermediate results.

The Spark Structured Streaming Programming Guide already does a great job in covering a lot about this new Engine, so the idea of this article is more towards writing a simple application. Currently, Kafka is pretty much a no-brainer choice for most streaming applications, so we’ll be seeing a use case integrating both Spark Structured Streaming and Kafka.

Continue reading “Getting started with Spark Structured Streaming and Kafka” →

Containers being killed on Docker for Mac

June 11, 2017 ~ Luciano Molinari ~ Leave a comment

I recently had to write some Spark jobs and deploy them on AWS/EMR. The jobs also had external dependencies, like Cassandra and Kafka. As the datasets were not that big, I decided to run this full stack locally during development using Docker and Docker-Compose.

Continue reading “Containers being killed on Docker for Mac” →

Searching data in a Cassandra table by different fields

April 28, 2017April 28, 2017 ~ Luciano Molinari ~ Leave a comment

Cassandra is a fantastic database system that provides a lot of cool and important features for systems that need to handle large amounts of data, like horizontal scalability, elasticity, high availability, distributability, flexible consistency model among others. However, as you can probably guess, everything is a trade off and there’s no free lunch when it comes to distributed systems, so Cassandra does impose some restrictions in order to provide all of those nice features.

One of these restrictions, and the one we’ll be talking about today, is about its query/data model. One common requirement in databases is to be able to query records of a given table by different fields. When it comes to this type of operation, I consider it can be split into 2 different scenarios:

Queries that you run to explore data, do some batch processing, generate some reports, etc. In this type of scenario, you don’t need the queries to be executed in real time and, thus, you can afford some extra latency. Spark might be a good fit to handle this scenario, and that’ll probably be subject of a different post.
Queries that are part of your application and that you rely on to be able to process requests. For this type of scenario, the ideal is that you evolve your data model to support these queries, as you need to process them in an efficient way. This is the scenario that will be covered in this post.

Continue reading “Searching data in a Cassandra table by different fields” →

Kafka findings and scenarios

June 22, 2016March 6, 2017 ~ Luciano Molinari ~ Leave a comment

When starting with Apache Kafka, you’re overwhelmed with a lot of new concepts: topics, partitions, groups, replicas, etc. Although Kafka documentation does a great job in explaining all of these concepts, sometimes it’s good to see them in practical scenarios. That’s what this post will try to achieve, by pointing a few findings/scenarios and solutions/explanations/links for each of them.

Continue reading “Kafka findings and scenarios” →