Dynamo DB
Fundamentals of NoSQL
- Partition Overloading, Secondary INdexes
Partion Key
- Uniquely identifies the item in a DynamoDB Tables
Sort Key
- Partition doesn't define a single item, but a container of many items, and the Sort Key identifies the uniqueness of an item inside of that container.
- Model those one to many relationship and complex relationships and use range operators on sort key to deliver and filter the results set. Only have to select certain items from that partition set based on conditions that the sort key actually meets. i.e. All items for key ==, <, >, >=, <=, "begins with", "between", "in", sroted results, counts, top/bottom N values
When you define an index on a Dynamo table, you're creating a deritive table. If you insert items onto the table, it will be inserted and regrouped on those indexes accourding to the index keys specified on those items
Design Patterns and Best Practices
- Global Tables and Summary Analytics
- Write Sharding for Selective reads
DynamoDB Book Notes
Attributes
There are ten different data types in DynamoDB. It’s helpful to split them into three categories:
- Scalars: Scalars represent a single, simple value, such as a username (string) or an age (integer). There are five scalar types: string, number, binary, boolean, and null.
- Complex: Complex types are the most flexible kind of attribute, as they represent groupings with arbitrary nested attributes. There are two complex types: lists and maps. You can use complex attribute types to hold related elements. In the Big Time Deals example in Chapter 20, we use lists to hold an array of Featured Deals for the front page of our application. In the GitHub example in Chapter 21, we use a map on our Organization items to hold all information about the Payment Plan for the Organization.
- Sets: Sets are a powerful compound type that represents multiple, unique values. They are similar to sets in your favorite programming language. Each element in a set must be the same type, and there are three set types: string sets, number sets, and binary sets. Sets are useful for tracking uniqueness in a particular domain. In the GitHub example in Chapter 21, we use a set to track the different reactions (e.g. heart, thumbs up, smiley face) that a User has attached to a particular issue or pull request.
Primary keys
The primary key can be simple, consisting of a single value, or composite, consisting of two values.
In DynamoDB, there are two kinds of primary keys:
- Simple primary keys, which consist of a single element called a partition key
- Composite primary keys, which consist of two elements, called a partition key and a sort key. Combination of partition and sort key will uniquely identify an item.
A partition key may sometimes be called a "hash" key, and a sorty key called a "range" key.
Sometimes this may also be referred to as a "range key".
Secondary indexes
Secondary indexes allow you to reshape your data into another format for querying, so you can add additional access patterns to your data.
When you create a secondary index on your table, you specify the primary keys for your secondary index, just like when you’re creating a table. AWS will copy all items from your main table into the secondary index in the reshaped form. You can then make queries against the secondary index.
There are two kinds of secondary indexes in DynamoDB:
- Local secondary indexes - A local secondary index uses the same partition key as your table’s primary key but a different sort key. This can be a nice fit when you are often filtering your data by the same top-level property but have access patterns to filter your dataset further. The partition key can act as the top-level property, and the different sort key arrangements will act as your more granular filters.
- Must be created when table is created
- The read and write throughput for the index is the same as the core table’s throughput.
- Local secondary indexes allow you to opt for strongly-consistent reads if you want it. Strongly-consistent reads on local secondary indexes consume more read throughput than eventually-consistent reads, but they can be beneficial if you have strict requirements around consistency.
- Global secondary indexes - In contrast, with a global secondary index, you can choose any attributes you want for your partition key and your sort key. Global secondary indexes are used much more frequently with DynamoDB due to their flexibility.
- For global secondary indexes, you need to provision additional throughput for the secondary index.
- The read and write throughput for the index is separate from the core table’s throughput.
- With global secondary indexes, your only choice is eventual consistency. Data is replicated from the core table to global secondary indexes in an asynchronous manner. This means it’s possible that the data returned in your global secondary index does not reflect the latest writes in your main table. The delay in replication from the main table to the global secondary indexes isn’t large, but it may be something you need to account for in your application.
Key schema | Creation time | Consistency | |
---|---|---|---|
Local secondary index | Must use same partition key as the base table | Must be created when table is created | Eventual consistency by default. Can choose to receive strongly- consistent reads at a cost of higher throughput usage |
Global secondary index | May use any attribute from table as partition and sort keys | Can be created after the table exists | Eventual consistency only |
Consistency model: In distributed systems, a consistency model describes how data is presented as it is replicated across multiple nodes. In a nutshell, "strong consistency" means you will get the same answer from different nodes when querying them. In contrast, "eventual consistency" means you could get slightly different answers from different nodes as data is replicated
Importance of item collections
An item collection refers to a group of items that share the same partition key in either the base table or a secondary index.
DynamoDB partitions your data across a number of nodes in a way that allows for consistent performance as you scale. However, all items with the same partition key will be kept on the same storage node.
Second, item collections are useful for data modeling. The Query API action can retrieve multiple items within a single item collection. It is an efficient yet flexible operation. A lot of data modeling tips will be focused on creating the proper item collections to handle your exact needs.
3.1 DynamoDB Streams
Streams are an immutable sequence of records that can be processed by multiple, independent consumers. The combination of immutability plus multiple consumers has propelled the use of streams as a way to asynchronously share data across multiple systems.
With DynamoDB streams, you can create a stream of data that includes a record of each change to an item in your table. Whenever an item is written, updated, or deleted, a record containing the details of that record will be written to your DynamoDB stream. You can then process this stream with AWS Lambda or other compute infrastructure. DynamoDB streams enable a variety of use cases, from using DynamoDB as a work queue to broadcasting event updates across microservices.
3.2 Time-to-live (TTL)
TTLs allow you to have DynamoDB automatically delete items on a per-item basis. This is a great option for storing short-term data in DynamoDB as you can use TTL to clean up your database rather than handling it manually via a scheduled job.
i.e Imagine an access keys table where user-generated tokens are active until intentionally deactivated, while machine- generated tokens are expired after ten minutes.
A final note on TTLs: your application should be safe around how it handles items with TTLs. Items are generally deleted in a timely manner, but AWS only states that items will usually be deleted within 48 hours after the time indicated by the attribute. This delay could be unacceptable for the access patterns in your application. Rather than relying on the TTL for data accuracy in your application, you should confirm an item is not expired when you retrieve it from DynamoDB.
3.3 Partitions
Partitions are the core storage units underlying your DynamoDB table. When a request comes into DynamoDB, the request router looks at the partition key in the request and applies a hash function to it. The result of that hash function indicates the server where that data will be stored, and the request is forwarded to that server to read or write the data as requested. The beauty of this design is in how it scales—DynamoDB can add additional storage nodes infinitely as your data scales up.
DynamoDB team has added a concept called adaptive capacity. With adaptive capacity, throughput is automatically spread around your table to the items that need it. There’s no more uneven throughput distribution and no more throughput dilution.
While you don’t need to think much about partitions in terms of capacity, partitions are the reason you need to understand item collections. Each item collection will be in a particular partition, and this will enable fast queries on multiple items.
There are two limits around partitions to keep in mind. One is around the maximum throughput for any given partition, and the other is around the item collection size limit when using local secondary indexes.
3.4 Consistency
At a general level, consistency refers to whether a particular read operation receives all write operations that have occurred prior to the read. As we just read in the previous section, DynamoDB splits up, or "shards", its data by splitting it across multiple partitions. This allows DynamoDB to horizontally scale by adding more storage nodes. This horizontal scaling is how it can provide fast, consistent performance no matter your data size.
To handle this, there are vast numbers of storage partitions spread out across a giant fleet of virtual machines.
- When you write data to DynamoDB, there is a request router that is the frontend for all requests. It will authenticate your request to ensure you have access to write to the table. If so, it will hash the partition key of your item and send that key to the proper primary node for that item.
- The primary node for a partition holds the canonical, correct data for the items in that node. When a write request comes in, the primary node will commit the write and commit the write to one of two secondary nodes for the partition. This ensures the write is saved in the event of a loss of a single node.
- After the primary node responds to the client to indicate that the write was successful, it then asynchronously replicates the write to a third storage node.
Thus, there are three nodes—one primary and two secondary—for each partition. These secondary nodes serve a few purposes. First, they provide fault-tolerance in case the primary node goes down. Because that data is stored on two other nodes, DynamoDB can handle a failure of one node without data loss.
Secondly, these secondary nodes can serve read requests to alleviate pressure on the primary node. Rather than requiring that all reads and writes go through the primary, we can have all writes go through the primary and then have the reads shared across the three nodes.
However, notice that there is a potential issue here. Because writes are asynchronously replicated from the primary to secondary nodes, the secondary might be a little behind the primary node. And because you can read from the secondary nodes, it’s possible you could read a value from a secondary node that does not reflect the latest value written to the primary.
With that in mind, let’s look at the two consistency options available with DynamoDB:
- Strong consistency
- Eventual consistency
With strong consistency, any item you read from DynamoDB will reflect all writes that occurred prior to the read being executed. In contrast, with eventual consistency, it’s possible the item(s) you read will not reflect all prior writes.
Finally, there are two times you need to think about consistency with DynamoDB. First, whenever you are reading data from your base table, you can choose your consistency level. By default, DynamoDB will make an eventually-consistent read, meaning that your read may go to a secondary node and may show slightly stale data. However, you can opt into a strongly-consistent read by passing ConsistentRead=True in your API call. An eventually-consistent read consumes half the write capacity of a strongly-consistent read and is a good choice for many applications.
Second, you should think about consistency when choosing your secondary index type. A local secondary index will allow you to make strongly-consistent reads against it, just like the underlying table. However, a global secondary index will only allow you to make eventually-consistent reads. If you do choose a local secondary index, the mechanics are the same as with your base table—you can opt in to strongly consistent reads by setting ConsistentRead=True
.
3.5 DynamoDB limits
- A single DynamoDB item is limited to 400KB of data.
- Query and Scan will read a maximum of 1MB of data from your table. Further, this 1MB limit is applied before any filter expressions are considered.
- This 1MB limit is crucial to keeping DynamoDB’s promise of consistent single-digit response times. If you have a request that will address more than 1MB of data, you will need to paginate through the results by making follow-up requests to DynamoDB.
- A single partition can have a maximum of 3000 Read Capacity Units or 1000 Write Capacity Units.
- If you have a local secondary index, a single item collection cannot be larger than 10GB. An item collection refers to all items with a given partition key, both in your main table and any local secondary indexes. The partition size limit is not a problem for global secondary indexes. If the items in a global secondary index for a partition key exceed 10 GB in total storage, they will be split across different partitions under the hood. This will happen transparently to you— one of the significant benefits of a fully-managed database.
3.6 Overloading keys and indexes
This last concept is a data modeling concept that you’ll use over and over in your application.
4 The Three API Action Types
Split the API actions into three categories:
- Item-based actions - Operating on specific items
- Queries - Operating on an item collection
- Scans - Operating on the whole table
4.1 Item-based actions
- GetItem--used for reading a single item from a table.
- PutItem--used for writing an item to a table. This can completely overwrite an existing item with the same key, if any.
- UpdateItem--used for updating an item in a table. This can create a new item if it doesn’t previously exist, or it can add, remove, or alter properties on an existing item.
- DeleteItem--used for deleting an item from a table.
There are three rules around item-based actions.
- First, the full primary key must be specified in your request.
- Second all actions to alter data—writes, updates, or deletes—must use an item-based action.
- Finally, all item-based actions must be performed on your main table, not a secondary index.
In addition to the core single-item actions above, there are two sub- categories of single-item API actions—batch actions and transaction actions.
- There is a subtle difference between the batch API actions and the transactional API actions. In a batch API request, your reads or writes can succeed or fail independently. The failure of one write won’t affect the other writes in the batch.
- With the transactional API actions, on the other hand, all of your reads or writes will succeed or fail together. The failure of a single write in your transaction will cause the other writes to be rolled back.
4.2 Query
The Query API action lets you retrieve multiple items with the same partition key. This is a powerful operation, particularly when modeling and retrieving data that includes relations. You can use the Query API to easily fetch all related objects in a one-to-many relationship or a many-to-many relationship.
Remember that all items with the same partition key are in the same item collection. Thus, the Query operation is how you efficiently read items in an item collection. This is why you carefully structure your item collections to handle your access patterns.
While the partition key is required, you may also choose to specify conditions on the sort key in a Query operation.
You can use the Query API on either the main table or a secondary index.
Note, you can flip your partition key and sort key to create a global secondary index to allow for additional flexible access patterns. (You just provide the name of the index in IndexName
)
4.3 Scan
A Scan will grab everything in a table. If you have a large table, this will be infeasible in a single request, so it will paginate. Your first request in a Scan call will read a bunch of data and send it back to you, along with a pagination key. You’ll need to make another call, using the pagination key to indicate to DynamoDB where you left off.
The times you may consider using the Scan operation are:
- When you have a very small table;
- When you’re exporting all data from your table to a different system;
- In exceptional situations, where you have specifically modeled a sparse secondary index in a way that expects a scan.
In sum, don’t use Scans.
4.4 How DynamoDB enforces efficiency
The DynamoDB API may seem limited, but it’s very intentional. The key point to understand about DynamoDB is that it won’t let you write a bad query. And by 'bad query', I mean a query that will degrade in performance as it scales.
DynamoDB uses partitions, or small storage nodes of about 10GB, to shard your data across multiple machines. The sharding is done on the basis of the partition key. Thus, if the DynamoDB request router is given the partition key for an item, it can do an O(1) lookup in a hash table to find the exact node or set of nodes where that item resides.
This is why all the single-item actions and the Query action require a partition key. No matter how large your table becomes, including the partition key makes it a constant time operation to find the item or item collection that you want.
All the single-item actions also require the sort key (if using a composite primary key) so that the single-item actions are constant time for the entire operation. But the Query action is different. The Query action fetches multiple items. So how does the Query action stay efficient?
Note that the Query action only allows you to fetch a contiguous block of items within a particular item collection. You can do operations like >=, <=, begins_with(), or between, but you can’t do contains() or ends_with(). This is because an item collection is ordered and stored as a B-tree. Remember that a B-tree is like a phone book or a dictionary. If you go to a dictionary, it’s trivial to find all words between "hippopotamus" and "igloo". It’s much harder to find all words that end in "-ing".
The time complexity of a B-tree search is O(log n). This isn’t as fast as our constant-time O(1) lookup for finding the item collection, but it’s still pretty quick to grab a batch of items from a collection. Further, the size of n is limited. We’re not doing an O(log n) search over our entire 10TB dataset. We’re doing it on a single item collection, which is likely a few GB at most.
Finally, just to really put a cap on how slow an operation can be, DynamoDB limits all Query and Scan operations to 1MB of data in total. Thus, even if you have an item collection with thousands of items and you’re trying to fetch the entire thing, you’ll still be bounded in how slow an individual request can be. If you want to fetch all those items, you’ll need to make multiple, serial requests to DynamoDB to page through the data. Because this is explicit— you’ll need to write code that uses the LastEvaluatedKey parameter—it is much more apparent to you when you’re writing an access pattern that won’t scale.
5
5.1 How expression names and values work
const items = client.query(
(TableName = 'MoviesAndActors'),
(KeyConditionExpression = '#actor = :actor AND #movie BETWEEN :a AND :m'),
(ExpressionAttributeNames = {
'#actor': 'Actor',
'#movie': 'Movie'
}),
(ExpressionAttributeValues = {
':actor': {S: 'Tom Hanks'},
':a': {S: 'A'},
':m': {S: 'M'}
})
)
Let's look at:
KeyConditionExpression: '#actor = :actor AND #movie BETWEEN :a AND :m'
The first thing to note is that there are two types of placeholders. Some placeholders start with a #, like #actor and #movie, and other placeholders start with a ":", like :actor, :a, and :m. he ones that start with colons are your expression attribute values. They are used to represent the value of the attribute you are evaluating in your request. Look at the ExpressionAttributeValues property in our request. You’ll see that there are matching keys in the object for the three colon- prefixed placeholders in our request.
ExpressionAttributeValues = {
':actor': {S: 'Tom Hanks'},
':a': {S: 'A'},
':m': {S: 'M'}
}
The values in the ExpressionAttributeValues property are substituted into our KeyConditionExpression by the DynamoDB server when it receives our request.
Now let’s look at the placeholders that start with a #. These are your expression attribute names. These are used to specify the name of the attribute you are evaluating in your request. Like ExpressionAttributeValues, you’ll see that we have corresponding properties in our ExpressionAttributeNames parameter in the call.
Unlike ExpressionAttributeValues, you are not required to use ExpressionAttributeNames in your expressions. You can just write the name of the attribute directly in your expression, as follows:
KeyConditionExpression: 'Actor = :actor AND Movie BETWEEN :a AND :m'
That said, there are a few times when the use of ExpressionAttributeNames is required. The most common reason is if your attribute name is a reserved word in DynamoDB. There are 573 reserved words in DynamoDB, and many of them will conflict with normal attribute names.
5.3 Understand the optional properties on individual requests
The final advice for working with the DynamoDB API is to understand the optional properties you can add on to individual requests. The properties are:
- ConsistentRead
- ScanIndexForward: this property is available only on the Query operation, and it controls which way you are reading results from the sort key. The ScanIndexForward property allows you to flip the direction in which DynamoDB will read your sort key. If you set ScanIndexForward=False, then DynamoDB will read your sort key in descending order.
- ReturnValues
- ReturnConsumedCapacity - First, you may use these metrics if you are in the early stages of designing your table.
- ReturnItemCollectionMetrics
The first three properties could affect the items you receive back in your DynamoDB request. The last two can return additional metric information about your table usage.
Single Table (ST) Design
- You should maintain as few tables as possible in a DynamoDB application
- Don't allow operations that don't scale
- AWS Dynamo and also Alex DeBrie
Data modeling steps
- Discovery phase
- Find entities and relations between them
- List all the data access patterns
- Use NoSQL Workbench for data modelling
Discovery phase
A broker or shipper, has a shipment that needs to be delivered, and is suggested various carriers from within their network. Carriers will be notified of shipments that need to be delivered based on some suggestion/matching algorithim.
Booked carriers will be provided tasks that are required for each shipment, and they can add dispatchers/drivers as key participants who will be repsonsible for doing the tasks.
Entities
- We need to group items in relavent collections.
- Use sort key to model 1:N relationships.