Relational databases were developed long before the internet, mobile devices, and the concept of big data. They are still the standard data model, but developers needed different solutions to address the ever-growing mass of data they were working with. Relational databases put developers in a corner; they had to know their data needs and architecture from the beginning, a difficult goal in the big data era. To address the inflexibility of relational databases, developers and data engineers begin looking into less rigid models to suit their needs, giving rise to NoSQL databases.
To clear the air, NoSQL does not mean “No SQL”. It is an acronym for “Not Only SQL” and provides developers with much more latitude in addressing their emerging needs while still have the security of a database. NoSQL databases are hybrids and don’t have a common architecture like relational databases do. They are best described as having similar qualities such as:
- Not using a relational model
- Designed to run on clusters
- Designed for web architecture
- No rigid schema
Why the Change
With the growth of and movement toward applications and the increased integration of web connected devices, developers needed more flexibility. With relational databases the data must fit the model and the database dictates the what and how of the query. With the new generation of “apps”, the interrogation happens in the application and in memory as it is needed, more “on the fly” than it is with relational databases and allows developers to utilize in-memory data structures. This need has given rise to aggregate data models, allowing data to be interacted with as a unit.
This aggregation makes distribution across clusters simpler, and more robust since the data can reside on multiple computers rather than on one single computer. When the data are called, the aggregation model allows all associated data to be retrieved as a unit, alleviating the need to query any other related data. This has given rise to map-reduce algorithms to retrieve cluster hosted data.
The data distribution methods that make this possible are:
- Sharding – distributing different data across multiple servers with each serve acting as a sole source for a data subset.
- Replication-copying data across several servers allowing the data to be found in multiple locations.
CAP
CAP is an acronym for Consistency, Availability, and Partition toleration. According to Eric Brewer ,any distributed system needs to manage these variables but can only choose two, leaving the third factor vulnerable. To have availability, a developer may have to trade off consistency. Developers have the ability to tune these parameters to optimize the database to the needs of their application, but this could cause problems if not balance properly.
Types of NoSQL Database
There are essentially four categories of NoSQL databases: key-value database, document databases, column family stores, and graph databases.
Key-Value
These are the most basic NoSQL databases to use from the perspective of an API. Because of the persistent use of primary keys, they generally demonstrate good performance and are very easily scaled. Just like with Python dictionaries, users can call a key and get a value, add a value for a key, or delete a key from a store. The data are essentially blobs in the data store with no real organization; all organization is maintained or enforced by the calling application. Some of the more popular key-value databases are Couchbase and Reddis.
While key-value databases are similar, they are not the same. Some support persistent data while others do not. If data is not persistent, all can be lost if a node is lost, requiring all data to be refreshed. However, is a persistent database, updating old data can be a concern. For these and other reasons, it is important to ensure a key-value database will suit your needs.
Document
The storage structure in document databases is, well, documents. The types of documents stored are numerous but common formats are BSON , JSON , and XML . These are basically hierarchical structures that can contain maps, scalar values, collections of lists, etc. These are stored in a similar manner to key-value pairs but the value can be examined. The most popular databases in this category are MongoDB, CouchDB , and RavenDB .
Column Family
Column family databases store data in rows comprised of many columns, associated with a row key. Column families are comprised of related data that are accessed together. Each of these columns is comparable to a group, or container, of rows in a relational database management system. However, these rows do not have to have the same columns, and columns can be added to any row without having to add it to other rows. These databases are easily scalable and can spread read-write operations across a cluster, with read-write being handled by any cluster. Popular databases in this category are Cassandra, HBase , and Hypertable.
Graph
Graph databases support not just the storage of entities, but also the relationships between entities. Entities are also known as nodes and relations are known as edges. Both of these have properties with edges having directional importance and nodes are organized by relationships that permit the examination of patterns between nodes. This structure lets the data, or graph, be stored and then examined in different ways based on the relationships. This is not easily done with relational databases without significant schema changes and data transfers.
Graph databases can be extremely fast when traversing joins since the relationship between nodes persists and it is not calculated with each query. There can be numerous types of relationships between nodes allowing secondary relationships between other things such as categories, paths, or linked lists. There is no limit to the number and kind of relationships nodes can have, and all can exist in a single graph database. It is in these relationships where most of the value, and power, exist. Because of these relationships, a lot of work must be put into model the relationships. The most popular database in this category are Neo4J.
Final Thoughts
There are several types of NoSQL databases to choose from and special consideration needs to give to the most important needs of your application with choosing one. If programmer productivity and increased access performance for large amounts of data are your concerns, NoSQL databases are worth considering.
I hope this post has provided you with a starting point to evaluate NoSQL databases. In a future post I will compare these to SQL databases and discuss how to choose between the two. If you have enjoyed this post, and found it helpful, please comment below or find me on Twitter.