All of the data that is generated by us every day needs to be stored and maintained in a way that it can be recalled and used to answer questions and generate knowledge. Databases are the containers used for this storage and maintenance and, given the value of insights from data, are the most valuable asset of an organization. We may not think about it often, but databases are everywhere, and we interact with them quite regularly. Behind every website, bank transaction, phone call, video game, weather report is a database.
Many people and small organizations use Excel to manage their data. Excel is good for some things but not so good at others and managing data sets is at the top of the list. One reason is that Excel lacks data integrity. Each cell is independent of every other cell and the “data” in each cell is not necessarily data; a what looks like a number may not be a number, but rather a bit of text. This can create inconsistencies across the spreadsheet. It is also not practical for working with multiple data sets simultaneously nor is it good for answering detailed queries. It also does not scale with the growth of your data because of it high memory requirements and is sized limits.
There are two basic types of databases; relational databases, known as SQL databases, and non-relational databases, known as NoSQL databases. Both types are managed by software known as a database management system, or DBMS. For relational databases, this system is often referred to as a relational database management system (RDBMS). Popular versions you may have heard of are MySQL, Mircrosoft SQL Server, SQLite, PostgreSQL, and Oracle. Some of the more popular NoSQL DBMSs are MongoDB, Cassandra, CouchBase, and Redis among others.
Relational Database Basics
Relational databases are based on the relational model developed by E.F. Codd in 1970 at IBM. In relational databases, data are organized into tables, defined by a schema, which are “related” to one another by a unique identifier known as a primary key. A common example is a customer database. Customer details, such as name, address, phone number, etc. are stored in one table and assigned a unique customer ID (primary key). Any transaction, such as a sell or return, is stored in a separate table; related to the customer table by the customer ID. This prevents redundant data and improves efficiency when interrogating the database for information.
Relational databases are accessed and manipulated using SQL or Structured Query Language. SQL a is declarative programming language based on relational algebra and tuple related calculus (https://en.wikipedia.org/wiki/SQL) and can be divided into three sublanguages:
DQL – data query language
DDL – data definition language
DCL – data control language
DML – data manipulation language
SQL is a standard of the ANSI, the American National Standards Institute but, despite there being a standard, there are slight variances in the language across the various DBMSs
Non-Relational Databases Basics
With the amount of data increasing as rapidly as it has a need evolved for less structured data storage where data did not have to fit into a schema. To answer this need, NoSQL databases were developed. NoSQL is really a collection of various technologies that aren’t necessarily related, but tackle data management without the SQL language or table structure. This lack of, or poorly enforced structure, allows for structure to be applied at the software application as it is needed.
NoSQL databases employ a number of architectures to store data. Some common architectures are key:value stores, document stores, column-oriented databases, and graph databases. The Redis database uses key:value pairs where data values are stored and accessed using a key, much like Python dictionaries. MongoDB uses document store architecture similar to JSON (Java Script Object Notation) where there are individual records to store data. These look similar to Python dictionaries but are quite different in function; JSON is a serialization format representing structured data as a text string and a Python dictionary is an in-memory data structure. Column oriented databases, like Cassandra, are structured in a way that data that would normally be in a table column has been transposed into rows; greatly accelerating lookups. Graph databases, like Neo4j use edges to define relationships and are useful in pattern recognition.
As you can see, there are several options to consider when choosing a database. Your choice will depend on your specific needs. Please comment below if you found this helpful in understanding databases.