Cassandra is a fantastic database system that provides a lot of cool and important features for systems that need to handle large amounts of data, like horizontal scalability, elasticity, high availability, distributability, flexible consistency model among others. However, as you can probably guess, everything is a trade off and there’s no free lunch when it comes to distributed systems, so Cassandra does impose some restrictions in order to provide all of those nice features.
One of these restrictions, and the one we’ll be talking about today, is about its query/data model. One common requirement in databases is to be able to query records of a given table by different fields. When it comes to this type of operation, I consider it can be split into 2 different scenarios:
- Queries that you run to explore data, do some batch processing, generate some reports, etc. In this type of scenario, you don’t need the queries to be executed in real time and, thus, you can afford some extra latency. Spark might be a good fit to handle this scenario, and that’ll probably be subject of a different post.
- Queries that are part of your application and that you rely on to be able to process requests. For this type of scenario, the ideal is that you evolve your data model to support these queries, as you need to process them in an efficient way. This is the scenario that will be covered in this post.