Amazon SimpleDB is a new service, part of the Amazon Web Services family, which offers a scalable, cloud-based database solution. In this post we will discuss how we are using SimpleDB instead of a traditional relational database to power Glue, our popular browser add-on.
Connecting People and Things Around The Web
Before discussing specifics of how we are using SimpleDB, it is important to understand what Glue does and what technical challenges it faces. Glue is a browser add-on that recognizes books, music, movies and other everyday topics around hundreds of popular web sites and automatically connects people around these topics.
For example, Glue recognizes that pages about Kite Runner on Amazon, Barnes and Noble, Google books and many other popular book sites around the web are all about the same thing - the book Kite Runner. In addition to recognizing topics, Glue automatically connects people around the topics they visit. For example, if 20 people looked at Kite Runner on Amazon, 15 looked at Kite Runner on Barnes and Noble and 7 on Google Books, all 42 are connected through this book in Glue network.
Issues With Using a Relational Database
Glue surfaces a new social layer, a new network of People and Things on top of the existing web. One can start by modeling Glue using a classic Relational Database approach. We have two types of entities - People and Things and the relationship between them. The problem is that this head-on approach can not scale. As I’ve written previously on ReadWriteWeb, the difficulty comes from the fact that Relational Database can not scale beyond a certain point.
The scalability issues that exist with Relational Databases are not accidental. The problem is not just that of scale, but the way of thinking about the data. Relational Databases are designed to think about data in terms of entities and relationships, but for our problem we need a database that scales by focusing on how to partition the data into independent chunks. This is exactly what new distributed storage systems are designed to do. Unlike classic Relational Databases, these solutions automatically partition and replicate data in the ring of identical services, guaranteeing fast access and reliability.
SimpleDB in a Nutshell
SimpleDB consists of multiple domains, each domain stores a set of records. Each record has a unique key and a set of attributes, which may or may not be present in every record. Within each domain, the records are automatically indexed by every attribute. The main operations are to read/write a record by key or to search for a set of records by one or more attributes.
SimpleDB can be used as both data storage and an indexing service. For example, one can use SimpleDB as a flat-file store, where each record maps onto a line. Another use case is to use each record in SimpleDB as meta data for a media file that is stored on Amazon S3. This solution enables quick search of media files via SimpleDB and then retrieval via Amazon S3.
Replacing Joins with Duplicate Data
SimlpeDB is powerful, but it lacks one key feature of Relational Databases - joins. That is, there is no built-in way to easily correlate the data stored in different records. Since Glue is about People, Things and relationships between them, the lack of joins presents a major challenge.
The solution that Glue uses relies on data duplication. Each Person and each Thing in our system has a unique key. In the case of a Person, the key is the username. In the case of a Thing, the key is a combination of the type, its name and an attribute, like author for a book or director for a movie, which provides a way to disambiguate among the objects that have the same type and the same name.
We partition People and Things into distinct domains. There are 30 domains that hold information about People and 30 domains that hold information about Things. To map a Person or a Thing to a domain we are using djb2 hash algorithm, which takes the key and the total number of domains and outputs a domain index. The hashing is necessary to ensure uniform distribution of People and Things among the available domains.
Next, we have a concept of an Interaction - each Person interacts with a Thing in Glue. The Interaction is equivalent to a typical row in a relationship table. The key of the Interaction record is the combination of Person key and Thing key. What is critical, is that each interaction record is stored twice - once in the Person domain, and then in the Things Domain. This redundancy is what allows us to mimic relational joins with SimlpeDB.
For example, if we need to know what things a Person has recently interacted with, we can query the corresponding Person Domain. If we need to know who interacted with a given Thing, we can query the corresponding Thing Domain. SimpleDB is efficient in handling these queries, it was designed to optimize searches within domains by leveraging smart, automatic indices.
Discussion
To solve our problem using SimpleDB, we duplicated the data. In the traditional world of Relational Databases this is not a recommended approach. Typically, Relational Databases focus on data normalization and avoid data redundancy. This is because by design, Relational Databases are meant to support transactional systems where instant data integrity is very important. For example, having duplicate financial data that is out of sync can be quite costly.
However, consumer internet services, like Glue are in a different situation. For Glue, instant and continuous data integrity is not a hard requirement. Given the trade off between potential inconsistencies and scalability, social services have to choose the latter. In addition, the overhead of managing duplicate records in the case of Glue is minimal and we are yet to see the situation where the records got out of sync. Finally, the cost of having a duplicate record is very low because of the SimpleDB price structure.
In principle, it is possible to implement a similar solution with a relational database. For each SimpleDB domain we could have created a table in a relational database and then stored corresponding Person and Thing interaction records there. However, our dataset is not uniform. Things like books have different attributes compared to movies or music albums and this makes it difficult to fit these records into the same tables.
Our choice to go with a SimpleDB over a Relational Database was driven by the fact that SimpleDB is designed from the ground up to address the use case of scaling massive amounts of data utilizing a cloud approach. This, in addition to the fact that our entire offering runs on Amazon Web Services stack made it an easy decision for us to choose SimpleDB over Relational Database.