Feeding the Result of a Subquery to a Query in Cassandra
This blog post introduces Amazon DynamoDB to Cassandra developers and helps you get started with DynamoDB by showing some basic operations in Cassandra, and using AWS CLI to perform the same operations in DynamoDB.
Amazon DynamoDB is a fully managed, multiregion, multimaster NoSQL database that provides consistent single-digit millisecond latency at any scale. It offers built-in security, backup and restore, and in-memory caching. It also lets you offload the administrative burden of operating and scaling a distributed database.
These features make DynamoDB compelling to migrate to from other NoSQL databases such as Apache Cassandra. You can use DynamoDB clients and SDKs to build a variety of applications such as IoT and gaming. For more information, see Using the API.
The following table summarizes important Cassandra and DynamoDB components and how you can relate those concepts to DynamoDB. In DynamoDB, the top-level component you deal with is the table, because it is a fully managed service.
Cassandra | DynamoDB | Description |
Node | NA | Where the data is stored. |
Datacenter | NA | Used for replication strategy; is similar to an availability zone in AWS. |
Cluster | NA | Can have a single node, single datacenter, or a collection of datacenters. |
Keyspace | NA | Similar to a schema in a relational database. |
Table | Table | |
Row | Item | |
Column | Attribute | |
Primary key | Primary Key | |
Partition key | Partition Key | |
Clustering column | Sort Key |
The core components of Cassandra
Cassandra requires that you install and manage it on an infrastructure service such as Amazon EC2. This carries the administrative burden of managing nodes, clusters, and datacenters, which are the foundational infrastructure components of Cassandra. The term datacenter here is a Cassandra-specific term, and is not to be confused with the general definition of a data center. A node stores data. A cluster is a collection of datacenters. A Cassandra datacenter is a collection of related nodes, and can be either a physical or virtual datacenter. Different workloads should use separate datacenters to prevent workload transactions from impacting each other.
Cassandra uses replication for availability and durability through a replication strategy that determines the nodes in which to place replicas and a replication factor to determine how many replicas to create across a cluster. In addition to setting up these infrastructure components, you also have to consider factors such as optimization, capacity planning, configuration, updates, security, operating system patches, and backups.
The data structure components of Cassandra are keyspaces, tables, rows, and columns. A keyspace is the outermost grouping of data similar to a schema in a relational database, and all tables belong to a keyspace. You configure replication at the keyspace level, which means that all tables in that keyspace follow the same replication strategy and replication factor. A table stores data based on a primary key, which consists of a partition key and optional clustering columns, which define the sort order of rows within a partition for each partition key.
The core components of DynamoDB
In DynamoDB, tables, items, and attributes are the data structure components. Table names must be unique within an AWS Region for a single account. Items and attributes are analogous to rows and columns (respectively) in Cassandra. DynamoDB does not require a predefined schema and allows adding new attributes on the fly at the application level, but it does require you to identify the attribute names and data types for the table's primary keys, sort keys, and local secondary indexes. Similar to Cassandra, the primary key includes a partition key. Sort keys are similar to clustering columns in Cassandra. You can add global secondary indexes to your table at any time to use a variety of different attributes as query criteria.
Fully managed features of DynamoDB
The fully managed features of DynamoDB are what represent the core benefits of using DynamoDB. The serverless nature of DynamoDB removes the administrative burden of infrastructure maintenance so you can focus your resources on application functionality. Data is replicated automatically across multiple Availability Zones in an AWS Region, providing built-in high availability and data durability. For more information, see the Amazon DynamoDB Service Level Agreement.
With global tables, you can deploy a multiregion, multimaster database by specifying a set of AWS Regions without having to build and maintain your replication solution. For more information, see DynamoDB Global Tables.
DynamoDB takes care of propagating ongoing data changes to the specified AWS Regions. This is similar to a multiple datacenter deployment of a Cassandra database.
DynamoDB also provides multiple solutions for backup and recovery. On-demand backup allows you to create full backups of your tables for long-term retention and archival for regulatory compliance needs. Point-in-time recovery helps protect your DynamoDB tables from accidental write or delete operations by maintaining incremental backups of tables. You can restore tables to any point in time during the last 35 days. And AWS Backup is a new fully managed backup service that automates the backup of data across AWS services and integrates with DynamoDB. Backup and restore actions execute with zero impact on table performance or availability.
All user data stored in Amazon DynamoDB is fully encrypted at rest, which reduces the operational burden and complexity involved in protecting sensitive data. With encryption at rest, you can build security-sensitive applications that meet strict encryption compliance and regulatory requirements.
Finally, you don't have to worry about planning the removal of tombstones through compaction as you would with Cassandra.
Data modeling
Cassandra provides a SQL-like language called Cassandra Query Language (CQL) to access data. The way you use CQL can be different from how you use SQL. RDBMSs traditionally follow the approach of normalization when designing a database to reduce redundancy and improve data integrity. They support JOINs and subqueries for flexible querying of data. The flip side is that the queries are relatively expensive and don't scale well in high-traffic situations.
In contrast, Cassandra does not support JOINs and subqueries. It performs well when the data is denormalized and stored in as few tables as possible, and when the tables, materialized views, and indexes are designed around the most common and important queries performed. You can query data in a limited number of ways, outside of which queries can be expensive and slow. That's why it's essential to understand the key differences and design approaches between the two languages when modeling your data for Cassandra. The data modeling for DynamoDB is similar in many ways.
The data modeling principles for NoSQL databases rely on writes to the database being relatively cheap, and disk space is generally the cheapest resource. When designing your database, to get the most efficient reads, you might need to duplicate data. In Cassandra, you can create additional tables to address different query patterns. Cassandra 3.0 has the Materialized Views feature, which allows addressing different query patterns efficiently without having to create additional tables.
DynamoDB provides global secondary indexes, which allow you to address different query patterns from a single table. With global secondary indexes, you can specify an alternate partition key and an optional sort key. You can partition data separately based on the partition key to allow different access patterns. The base table's primary key attributes are always part of the global secondary index, and you can choose which other attributes from the table to project into an index. You can avoid having to reference the main table and read only from the index by projecting attributes into a global secondary index, thereby minimizing reads on the database. For more information, see Best Practices for DynamoDB.
Using the AWS CLI with DynamoDB
This post demonstrates some basic table operations to help you get started with DynamoDB. DynamoDB commands start with aws dynamodb
, followed by an operation name, followed by the parameters for that operation. For more information about the various supported operations, see dynamodb in the AWS CLI Command Reference. This post presents examples in both Cassandra and DynamoDB to compare both the DBs concerning these operations.
The following table summarizes Cassandra statement types and their equivalent DynamoDB operations.
Cassandra Statement | DynamoDB Operation | Description |
CREATE KEYSPACE | N/A | |
CREATE TABLE | create-table | |
INSERT | put-item | |
SELECT | get-item/scan/query | |
UPDATE | update-item | |
DELETE FROM TABLE | delete-item | |
DELETE COLUMN FROM TABLE | update-item | –update-expression "REMOVE COLUMN" |
Creating a table
Let's start by creating a table in both Cassandra and DynamoDB and use the table to perform some basic DML operations.
Cassandra
In Cassandra, before creating a table, you have to create a keyspace and specify a replication factor and replication strategy.
The CREATE KEYSPACE MusicKeySpace
statement creates a top-level namespace and sets the keyspace name as MusicKeySpace
. The WITH replication
clause of this statement defines a map of properties and values that represents the replication strategy and the replication factor for this keyspace.
The USE MusicKeySpace
statement switches to this namespace and all subsequent operations on objects are in the context of the MusicKeySpace
keyspace.
The CREATE TABLE
statement is used to create the table MusicCollection
under the MusicKeySpace
keyspace. The PRIMARY KEY
clause in this statement represents Artist
as the partition key and SongTitle
as the clustering key.
The DESCRIBE tables
command displays a list of the tables under the MusicKeySpace
keyspace; in this post, it is MusicCollection
.
See the following code example of these statements:
DynamoDB
In DynamoDB, you start by creating a table. This post creates a table called MusicCollection
, with the attributes Artist
and SongTitle
as the partition and sort key, respectively. To create a table, use the "create-table"
operation and specify the required parameters.
The "--table-name"
parameter represents the name of the table, which for this post is MusicCollection
. "--key-schema"
takes a list as its value. The list's elements represent the attribute name and the key type of the attributes in the primary key.
In the following example, AttributeName=Artist,KeyType=HASH
indicates that the Artist
attribute is the partition key. AttributeName=SongTitle,KeyType=RANGE
indicates that SongTitle
is the sort key. This is also an example of the shorthand syntax supported by the AWS CLI for parameter values. AWS CLI also supports JSON for parameter values; you can represent the value for --key-schema
as the following code:
You must also define the attributes in the KeySchema
(represented by --key-schema
) in the AttributeDefinitions
array (represented by --attribute-definitions
). --attribute-definitions
represents an array of attributes that describes the key schema for the table. Inherently the attributes Artist
and SongTitle
describe the key schema for the table.
--provisioned-throughput
represents the read and write capacity per second allocated to the table. A detailed explanation of read/write capacity and provisioned throughput is outside the scope of this post; suffice to say that DynamoDB allocates the necessary resources to meet the read and write activity your application requires based on the specified provisioned throughput. For more information, see Read/Write Capacity Mode.
You can manually increase or decrease the throughput depending on the traffic to the table. When you create a table from the DynamoDB console, the provisioned throughput settings default to auto scaling. For more information, see Amazon DynamoDB auto scaling: Performance and cost optimization at any scale. You also can configure auto scaling using the AWS CLI. However, even with auto scaling you need to specify the minimum and maximum levels of read and write capacity. Instead, if you do not want to specify how much read and write throughput you expect your application to perform and wish to pay per request for the data reads and writes your application performs on your tables, you can use the On-Demand billing option by using the --billing-mode
parameter with the value PAY_PER_REQUEST
.
The output of this command is the description of the table.
TableArn
is a resource name that uniquely identifies this table as an AWS resource. ARN is short for Amazon Resource Name.
TableStatus
refers to the table's current status, which can be one of the following:
- CREATING – The table is being created.
- UPDATING – The table is being updated.
- DELETING – The table is being deleted.
- ACTIVE – The table is ready for use.
While creating a table, the TableStatus
is CREATING
initially and later changes to ACTIVE
. You can perform read and write operations only on ACTIVE
table.
Additionally, the description of the table consists of information such as key schema, provisioned throughput, attribute definitions, table size in bytes, table name, item count, and the creation date and time, which is represented in UNIX epoch time format. See the following code example:
Before performing read and write operations, you can check whether the table is in the ACTIVE
state using the describe-table
command, which now shows TableStatus
as ACTIVE
. See the following code example:
Inserting data
DynamoDB and Cassandra both require that you specify the full primary key value while inserting an item into a table. By default, if a row already exists with the same primary key, the new INSERT
replaces the old item with the new one. You can override this behavior to insert a new row if a row does not already exist with the same primary key.
Cassandra
To add new columns to a table after creating it, you must add the column definition with the ALTER TABLE
command before inserting the data. The following code alters the table and adds a new column AlbumTitle
, inserts a new row in the table with values for the Artist
, SongTitle
, and the AlbumTitle
columns, and displays the new row:
DynamoDB
In DynamoDB, you can add attributes on the fly while inserting or updating data. An attribute can have different types across Items. The following command uses the put-item
operation and inserts one row into the MusicCollection
table, with values for the Artist
, SongTitle
, and the AlbumTitle
columns.
The --item
parameter takes a JSON map as its value. The map's elements represent attribute name-value pairs. In the following example, "Artist": {"S": "No One You Know"}
implies that the value of the Artist
attribute in the item is of type String
, which is represented by "S"
, and its value is "No One You Know"
.
You must provide all the attributes for the primary. In the following code example, you must provide Artist
and SongTitle
:
You can query the table by using the get-item
API. The value of the --key
parameter is a map of attribute names to attribute values representing the primary key of the item to retrieve. For example,"No One You Know"
is the attribute value for Artist
, which is represented in the command {"Artist": {"S": "No One You Know"}
. The output of the command is the retrieved item. See the following code example:
Using TTL to remove stale data
You can automatically remove items from your table after a period of time by using Time To Live (TTL). Cassandra specifies TTL as the number of seconds from the time of creating or updating a row, after which the row expires. In DynamoDB, TTL is a timestamp value representing the date and time at which the item expires.
Cassandra
This example inserts a new row in the MusicCollection
table and specifies a TTL of 86,400 seconds for the row with the USING TTL
clause. The example also demonstrates that the INSERT
statement requires a value for each component of the primary key, but not for any other columns. This post provides values for Artist
and SongTitle
, but not for AlbumTitle
. See the following code example:
DynamoDB
In DynamoDB, you must explicitly enable TTL on a table by identifying a TTL attribute. This attribute should contain the timestamp of when the item should expire in epoch time format, and you must store it as a number. For more information, see Time to Live: How It Works. You also can archive deleted items automatically to a low-cost storage service such as Amazon S3, a data warehouse such as Amazon Redshift, or Amazon OpenSearch Service. For more information, see Automatically Archive Items to S3 Using DynamoDB Time to Live (TTL) with AWS Lambda and Amazon Kinesis Firehose.
The code in this example enables TTL for the table MusicCollection
and inserts an item into the table by specifying the TTL for the item. You can use the update-time-to-live
operation to enable TTL for the table MusicCollection
. The --time-to-live-specification
parameter represents the settings used to enable or disable TTL for the table. You can specify an attribute having the name ttl
as the attribute that holds the TTL value, which is the timestamp value in epoch time format, and set Enabled as True
. See the following code example:
Use the date
command to retrieve a timestamp value in epoch time format for the date-time representing 86,400 seconds from the current date-time. See the following code example:
The subsequent put-item
operation inserts a new item into the table and assigns the value represented by EXP
to the attribute ttl
. See the following code example:
For clients using macOS, use EXP=`date -v +1d '+%s'`
instead of EXP=`date -d '+86400 secs' +%s`
.
See the following code example:
Updating data
For items you need to update, the Update
operation requires that you specify values for each column of the primary key. The operation modifies the columns or attributes that you provide in the update.
Cassandra
This example shows updating the value of the AlbumTitle
column for the row with values for Artist
and SongTitle
as "No One You Know"
and "Call Me Today"
, respectively. The new value of the column is "New Album"
. See the following code:
DynamoDB
This example uses the DynamoDB update-item
operation to perform the update operation, and later uses the get-item
operation to retrieve the updated row to demonstrate the update.
You can use the parameter --update-expression
to specify new values for the attributes you are updating. See the following code:
:newval
is a placeholder for the value of AlbumTitle
; you must express it using the parameter --expression-attribute-values
. Here, it means that :newval
represents the value "New Album"
, which is of type String
. See the following code:
The following code uses get-item
to retrieve the item that you updated:
Updating behavior when a row does not exist
If an item with the specified partition and sort key does not exist, a new Item is added, making Update
a powerful data modification operation.
Cassandra
This example attempts to perform an update of a row by specifying values for the primary key columns Artist
and SongTitle
that do not exist. A new row is created with these values as the primary key column values. The SELECT
statement that follows displays the newly added row. See the following code:
DynamoDB
Similarly, in the following code, DynamoDB adds a new item because an item with the specified key attributes does not exist:
You can use the get-item
operation with the same key attribute values that you used in the update-item
operation to establish that a new row is inserted. See the following code:
Updating data only if meeting specified conditions
You can perform a conditional update such that a new item is not added if an item with the specified key does not exist.
Cassandra
A row with Artist='This should fail'
and SongTitle='Throw Error'
does not exist. However, when you add the clause IF EXISTS
to the UPDATE
statement, a new row is not added to the table and the operation fails. See the following code:
DynamoDB
In this example, --key
identifies the item to update. Because this parameter has Artist
as one of its attributes, the resulting item has the Artist
attribute. The --condition-expression
parameter specifies the condition the UPDATE
operation must satisfy to succeed. The parameter value "attribute_exists(Artist) "
makes sure that the condition is satisfied only if the item exists. See the following code example:
Deleting data
You can use the DELETE
command in Cassandra to delete an entire row or data from one or more selected columns. In DynamoDB, use the update-item
API to delete data from columns, and use delete-item
to delete entire rows.
Cassandra
You can specify a column or comma-separated list of columns after the DELETE
clause to delete data from the specified columns. In this example, specify the column AlbumTitle
after the DELETE
clause. The result set of the SELECT
statement after the DELETE
statement displays null
for the column AlbumTitle
for the row having Artist='No One You Know'
and SongTitle='Call Me Today'
. The value was 'New Album'
before the DELETE
operation. See the following code:
To delete an entire row in Cassandra, you can use the following code. This example demonstrates that the row with Artist
value 'No One You Know'
and SongTitle
value 'Call Me Today'
is deleted. See the following code:
DynamoDB
You can use the update-item
operation to delete an attribute from an item. The --update-expression
option of the update-item
represents an expression that defines one or more attributes to update, the action to perform on them, and new values for them. The following code performs the REMOVE
action on AlbumTitle
:
The item returned by performing a get-item
operation before the update-item
operation contains the AlbumTitle
attribute. Calling get-item
again after the update-item
operation shows that AlbumTitle
does not exist. See the following code:
You can use the delete-item
operation to delete an entire row. The following code shows deleting the row and performing a get-item
operation to demonstrate that the item has been deleted:
Summary
In this post we looked at the commonly used Cassandra APIs and what the equivalent DynamoDB APIs look like. We walked through commands that will help a Cassandra developer get started with DynamoDB. To learn more about DynamoDB, and learn about advanced features such as auto scaling, Global Tables, TTL, and transactions, see What is Amazon Dynamo DB?.
About the Author
Sravan Kumar is a Consultant with Amazon Web Services. He works with the AWS DMS development team and helps them with the software infrastructure development of their extension framework.
Source: https://aws.amazon.com/blogs/database/introduction-to-amazon-dynamodb-for-cassandra-developers/
0 Response to "Feeding the Result of a Subquery to a Query in Cassandra"
Post a Comment