Cassandra is a fantastic database system that provides a lot of cool and important features for systems that need to handle large amounts of data, like horizontal scalability, elasticity, high availability, distributability, flexible consistency model among others. However, as you can probably guess, everything is a trade off and there’s no free lunch when it comes to distributed systems, so Cassandra does impose some restrictions in order to provide all of those nice features.
One of these restrictions, and the one we’ll be talking about today, is about its query/data model. One common requirement in databases is to be able to query records of a given table by different fields. When it comes to this type of operation, I consider it can be split into 2 different scenarios:
- Queries that you run to explore data, do some batch processing, generate some reports, etc. In this type of scenario, you don’t need the queries to be executed in real time and, thus, you can afford some extra latency. Spark might be a good fit to handle this scenario, and that’ll probably be subject of a different post.
- Queries that are part of your application and that you rely on to be able to process requests. For this type of scenario, the ideal is that you evolve your data model to support these queries, as you need to process them in an efficient way. This is the scenario that will be covered in this post.
This post is not an in-depth introduction to Cassandra, so it’s assumed you have some knowledge about it. However, it’s worth to reinforce 3 basic concepts that are quite useful when modelling a table:
- Partition Key: Is the part of a record (one or more columns) that is used to create its hash that is then used to select the node(s) that will store it.
- Clustering column(s): One or more columns that determine how the records will be physically stored for a given partition key. This allows features like range queries and multiple filters (for a given partition key).
- Primary Key: Uniquely identifies a record and is composed by the partition key and clustering columns.
With this basic background, we can move forward.
Problem to be solved
In order to get a better feeling of how a data model can be evolved to support queries by different fields, let’s use a somewhat contrived example. Let’s suppose we have a users table like defined below:
create table users ( login text, email text, name text, primary key (login) );
And we want to search records by
login and sometimes by
Cassandra offers a few options to tackle this problem, some might not be applicable to all scenarios and some are not really recommended, but we’ll go through them so you know they exist.
Allow filtering option allows us to filter a Cassandra row by any column that is not part of the partition key. If we try to search our table users by email, this is what happens:
cqlsh> select * from users where email = '...'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
As you can see, Cassandra can’t search based on the
allow filtering option:
cqlsh> select * from users where email = '...' ALLOW FILTERING ;
This will run your query, but that might be expensive. Think of it as a full scan table in a relational DB. It’ll have to go through all the records to filter them out based on the given condition.
More details about allow filtering:
Another approach that can be used and that’s similar to the one used in relational DBs, is to create a
secondary index on the column we want to search for. In our example, we could do something like:
cqlsh> CREATE INDEX users_email ON users (email);
With that, now you can query on
allow filtering. However, indexes in Cassandra should be used carefully, as they require queries to be sent to most or all of the nodes of the cluster, as these indexes are maintained locally in each node.
More details can be found here:
One approach that is rather common in Cassandra world and is considered cleaner and better is to have the same data in different tables, with different partition keys. For example, we could have 2 tables instead of one:
create table users_by_login ( login text, email text, name text, primary key (login) ); create table users_by_email ( login text, email text, name text, primary key (email) );
With this model, you can search on
users_by_login when you are searching by login and on
users_by_email when you are searching by email. In this approach, however, your application is responsible for inserting the same data in both tables. As Cassandra is super fast to deal with writes, writing the same data into both tables is not a big of an issue.
You do need to be careful, though, to avoid inconsistency between the tables.
For example, if your application is writing to Cassandra as result from processing an event from a messaging system (like Kafka, RabbitMQ, ..), it’s important that you just commit the event once you’ve written to both tables. Another approach is to have different consumers responsible for writing to different tables.
One last approach that we’ll be talking about is Materialized views, that was introduced in Cassandra 3.0. This approach is very similar to the one discussed above (Multiple tables), with the difference that materialized views take on the responsibility for writing the same data in both tables, taking this off from the application logic.
cqlsh> create materialized view users_by_email as select * from users_by_login where email is not null primary key(email, login);
As you probably noticed, we added the
login as a clustering column. This is currently required by materialized views: The primary keys of the base table need to be part of the primary key of the materialized view. Keep in mind, though, that in the example above only the email is the partition key, while login is a clustering column, so you can run queries based only on the email.
More about materialized views:
As you can see, Cassandra offers a variety of options when you need to query your data by different columns. One of the key things when designing your model is that you need to think ahead about the queries you’ll be running on it.
The main goal of this post is just to scratch the surface of some possibilities when it comes to read data from Cassandra by using different filters. I strongly encourage you to go through the links in each section to find out more in-depth details about them.