Getting started with Spark Structured Streaming and Kafka

I was recently doing the Cloud Computing Specialization on Coursera and its capstone project is about processing a set of datasets with batch and then with streaming. During the streaming part, I wanted to use Spark and then I saw this new streaming project called Spark Structured Streaming, which I ended up using for this project, along with other technologies, like Kafka and Cassandra.

Spark Structured Streaming is a new streaming engine built on top the Spark SQL engine and Datasets. With it you can create streaming applications using a higher level API, without really having to care about some of the nuances required with the previous Spark Streaming based on RDDs, like writing intermediate results.

The Spark Structured Streaming Programming Guide already does a great job in covering a lot about this new Engine, so the idea of this article is more towards writing a simple application. Currently, Kafka is pretty much a no-brainer choice for most streaming applications, so we’ll be seeing a use case integrating both Spark Structured Streaming and Kafka.

Continue reading “Getting started with Spark Structured Streaming and Kafka”

Searching data in a Cassandra table by different fields

Cassandra is a fantastic database system that provides a lot of cool and important features for systems that need to handle large amounts of data, like horizontal scalability, elasticity, high availability, distributability, flexible consistency model among others. However, as you can probably guess, everything is a trade off and there’s no free lunch when it comes to distributed systems, so Cassandra does impose some restrictions in order to provide all of those nice features.

One of these restrictions, and the one we’ll be talking about today, is about its query/data model. One common requirement in databases is to be able to query records of a given table by different fields. When it comes to this type of operation, I consider it can be split into 2 different scenarios:

  • Queries that you run to explore data, do some batch processing, generate some reports, etc. In this type of scenario, you don’t need the queries to be executed in real time and, thus, you can afford some extra latency. Spark might be a good fit to handle this scenario, and that’ll probably be subject of a different post.
  • Queries that are part of your application and that you rely on to be able to process requests. For this type of scenario, the ideal is that you evolve your data model to support these queries, as you need to process them in an efficient way. This is the scenario that will be covered in this post.

Continue reading “Searching data in a Cassandra table by different fields”

Testing Future Objects with ScalaTest

Scala provides nice and clean ways of dealing with Future objects, like Functional Composition and For Comprehensions. Future objects are a great way of writing concurrent and parallel programs, as they allow you to execute asynchronous code and to extract the result of it at some point in the future.

However, this type of software requires a different mindset in terms of reasoning about it, be it while writing the main code or while thinking about testing it. This article will focus primarily in testing this type of code and, for this, we’ll be looking into ScalaTest, one of the best and most popular testing frameworks in the Scala ecosystem.

Continue reading “Testing Future Objects with ScalaTest”

Kafka findings and scenarios

When starting with Apache Kafka, you’re overwhelmed with a lot of new concepts: topics, partitions, groups, replicas, etc. Although Kafka documentation does a great job in explaining all of these concepts, sometimes it’s good to see them in practical scenarios. That’s what this post will try to achieve, by pointing a few findings/scenarios and solutions/explanations/links for each of them.

Continue reading “Kafka findings and scenarios”

Building a service using Akka HTTP and Redis – Part 2 of 2

In this second and last part, we’ll see how to use Akka HTTP to create RESTful Web Services. The idea is to expose the operations defined in the first part of this article, using HTTP and JSON.

Akka HTTP implements a full server/client-side HTTP based on top of Akka Actors. It describes itself as “a more general toolkit for providing and consuming HTTP-based services instead of a web-framework“. One interesting aspect is that it provides different abstraction levels for doing the same thing in most of the scenarios, i.e., provides high and low level APIs. Akka HTTP is pretty much the continuation of the Spray Project and Spray’s creators are part of Akka HTTP team. In our example, we’ll be using the high level Routing DSL to expose services, which allows routes to be defined and composed by directives. We’ll also use the Spray Json project to provide the infrastructure needed to use JSON with Akka HTTP.

Continue reading “Building a service using Akka HTTP and Redis – Part 2 of 2”

Building a service using Akka HTTP and Redis – Part 1 of 2

This article is split in 2 parts and it aims to show how to create a small application (or a microservice if you prefer) using Akka HTTP (Scala) and Redis database.The application is very focused and provides means for a customer to be added, removed and retrieved.

In terms of technologies/tools/libs, the following ones will be used:

  • Scala
  • SBT
  • Akka HTTP/Akka HTTP Json Support/Akka HTTP Test kit
  • Rediscala
  • Scalatest
  • Redis database

In this first part, we’ll focus on the layer responsible for communicating with Redis, while the second part will focus on Akka HTTP.

Continue reading “Building a service using Akka HTTP and Redis – Part 1 of 2”