Data Integrity

Screen Shot 2018-05-02 at 6.23.24 PM

Data integrity refers to the accuracy and consistency of data throughout the data life cycle and is the opposite of data loss. Businesses are essentially houses of cards and are only as robust as the data that drive their information, decisions, and value. For these reasons, it is important that data is recorded exactly as it was collected and that when recalled at a later date, it is the same as it was when it was originally collected and recorded. It is the goal that nothing changes the data, or induces errors, from data entry to the time of use

The FDA uses the acronym ALCOA to describe data integrity standards. By this acronym, data should be:

  • Attributable – it should be clearly evident who observed and recorded the data, when it was observed and recorded, and what the subject of the data is.
  • Legible – data should be easy to understand, permanently recorded, with original records and entries preserved for reference.
  • Contemporaneous – data should be recorded as it was observed at the time it was executed.
  • Original – Source data should be in its original form and accessible.
  • Accurate – data should be free from errors and comply with an organization’s standard operating procedures.

As data have evolved from paper records to digital information, it has become important to develop new methods to assure data integrity.  The following will help minimize data risk and ensure data integrity for your organization:

Ensure all computer systems are reliable

While it is an FDA regulation, 21 CFR Part 11 is a good guide to ensure that your electronic records are trustworthy, reliable, and equivalent to paper records.

Follow a software lifecycle

This usually applies to software development but even if you aren’t in the software development business, you should ensure your business software is current with best business practices in your industry.

Validate systems

Work with vendors to ensure your software produces a consistent product. Have them install, test your systems, and provide documentation that they meet best practices for your industry.

Implement audit trails

Automate and maintain records of data transactions with time and date stamps, identity of user, history of changes and deletions. This will ensure the trustworthiness of your data and will demonstrate the records have not been tampered with.

Limit system access

Require documented logins to computer and network systems. Ensure appropriate database access is controlled, managed, and documented.

Implement Quality Management Systems

Establish best practices and workflows with standard operating procedures to ensure repeatable methods.

Training

Train all users on proper data management practices and maintain records of their training.

Conduct periodic internal audits

This will ensure, and will provide evidence, that all procedures have been followed and the data are accurate, therefore demonstrating data integrity.

Final Thoughts

There are several opportunities for mistakes to happen in the data acquisition and storage workflow. This could be from a manual data entry mistake or the storage of an improper data type. Take the steps to ensure data integrity in your organization to get the most value from your data and to provide the best service to your stakeholders.

I hope this post has provided you some insight into the importance of data integrity and how you can achieve it. If it has, please comment below. If you want to learn more, follow me on Twitter.

Image credit

Data Governance

data-governance-blog-post-page-imageData is the life blood of organizations and businesses and is likely its most valuable asset after people. Because of its importance and role in daily operations and overall success, it needs to be managed just as any other business asset is managed. This management is prescribed and managed by data governance programs. Data Governance is the management of all of the data an organization has to ensure that high quality data exists throughout the complete data lifecycle. ( . A robust data governance program will exist to ensure data are available, usable, consistent, have integrity, and are secure.

The Four Pillars of Data Governance

People

People are important in the management of data. The data steward is an individual tasked with ensuring ensures the governance processes are followed, enforced, and ensures necessary governance improvements are made so that data delivers the most value possible for as long as it can in the data life cycle. This person may be part of a team of data stewards which may also include database administrators, business analysts, and business subject experts.

Processes

These are the key to the success of a data governance program. Any work with data: acquiring, collecting, munging, storing, modeling, analyzing, all need to be in line with the key metrics and goals of your organization. The accuracy accessibility, consistency, and completeness of the data are important, but these must be within the context of your business and should be refined and honed with an emphasis on the goals of your business.

Data quality is key to the data processes to ensure accuracy and completeness. Version control, data scrubbing, established workflows, and project management systems are all integral to data quality.

Technology

Data storage design and architecture is of utmost importance but no matter how large you start; your storage containers will eventually begin to fill up. Metadata management and master data management are crucial in gaining insights to data flow and will help anticipate needs and constraints. Sound metadata management practices will also improve transparency and security across your organization’s systems.

It is also important that the tools are in place to analyze and derive information and value from your data. Actionable insights are the reason the data was collected to begin with and its purpose is to add value to your organization.

Big data has driven the development of cloud architecture, and ushered in unstructured, or NoSQL, data that is different from traditional relational databases that were originally the focus of master data management. To address the complex relationships from these new data structures, graph data stores are being used more frequently in master data management.

Best practices

Best practices for data governance are important in order to establish a quality, robust program. However, the practices change and evolve rapidly as organizations flatten and big data continues to gain prominence. No matter how things change and evolve, it will always be critical to identify key stakeholders, employ meaningful metrics, communicate frequently across teams and the organization, and to strive to make data governance a practice and not a one-off project.

Final Thoughts

Data governance is a cornerstone to the success of your business or organization. The practice has gained prominence with the introduction of regulation such as Sorbanes-Oxley and is necessary to ensure information assets are properly and efficiently managed to ensure your organization and your customers are receiving the best value and results.

I hope this post was helpful in providing some insight to the basic concepts of data governance. If you found it helpful, or would like to learn more, please comment below or find me on Twitter.

Image credit

The Data Life Cycle

big-data-lifecycleLike you and me, data has a life cycle beginning with its acquisition and ending with its purging. In the middle of this life cycle is when the data is most useful, similar to our lives one we enter adolescence and our parents begin to imagine some relief from the financial and emotional stress we have place on them. Eventually, the data are no longer useful, but they can still tell a story, so they are put in an archive to be called upon when there is some interest. Ultimately, there is a point when it is no longer useful and only have historical context and they are purged.

The Beginning – Data Capture

To start its life cycle, data must enter our infrastructure, be it our consciousness or our enterprise infrastructure. The primary methods for this are:

  1. Data acquisition – this is the gathering and taking in of data from outside the processing or storage center. This data can be purchased from other organizations or can be physically collected by technicians.
  2. Data entry – the physical entry of data, gathered from outside or generated from within, into the processing or storage center
  3. Signal capture – gathering data from sensors and devices by data loggers and IoT devices.

All of these methods have particular data governance challenges. There may be contractual or legal agreements from data acquired from outside an organization. Data entered from within may have reliability or integrity issues; think in terms of bias. Data from signal capture may have quality issues.

Infancy – Data Maintenance

Once collected data must be nurtured and processed into a state that will make it useful. No value is derived in the data maintenance stage and it has a net expense. During this stage the data is enriched through munged, cleaned, transported, related, extracted, transformed, and loaded (ETL) for later use. Data governance in this period is concerned with how the data are handled, documented, and manipulated.

Early Childhood – Data Synthesis (development of information)

Now in a useful state, data is analyzed for value. It is subjected to logic, algorithms, equations, and computations to see if value can be extracted or if it will support the original reason it was collected. This area is handled by subject and content specialists for the topic at hand.

Adolescence – Data Usage

At this stage data is useful and begins to generate value. It is applied to tasks of interest, such as making predictions, supporting decisions, or evaluating risks and begins to align with the purpose of the organization that collected it. Governance at this stage focuses on proper and permitted use of the data.

Adulthood – Data Broadcasting

Once realizations are made, the data and the findings are generally shared with similar organizations or clients. Corrections cannot be made at this point and data governance rules will have to dictate how corrections and retractions are made as well as how to handle those affected by any errors or omissions.

Mature Adulthood – Data Archival

At some point, after many discussions and uses, an organization has realized all the value it can from a set of data and moves on; the data has lived its useful life. At this point the data is stored in a manner that it can be called back into use or interrogated by auditors should the need arise. It is not being maintained nor cared for, it is just at rest.

Death – Data Purging

This is the end of life for data. It is removed from the organization entirely with its archive deleted. The challenge of this phase is to ensure that all of the data has been deleted and there are no fragments or extra copies in circulation or hiding.

Final Thoughts

Data may not experience each of these phases and in reality, much data is not destroyed. More often than not, data is retained in the event that it may be useful again. I find this quite often in my line of work where so much effort was put into collecting, transforming, modeling, analyzing and storing the data that it is just too difficult to let go. As a data administrator and engineer in the mining industry I have worked with some truly great statisticians and scientists that hold on to their data for dear life, and I understand why. In our industry, they changed history and made the future, paving the way for the digital processes that we are using now. But at some point, the volumes of data have to be dealt with, and the organization must move on. Finding the proper balance, to ensure all value has been extracted and that just enough is kept, is one of the challenges of a well-managed data governance program.

If you found this post interesting, or you have experience with managing data life cycles,  please let me know and leave a comment below for find me on Twitter.

Image credit

NoSQL Databases

NoSQL-DBs1Relational databases were developed long before the internet, mobile devices, and the concept of big data. They are still the standard data model, but developers needed different solutions to address the ever-growing mass of data they were working with. Relational databases put developers in a corner; they had to know their data needs and architecture from the beginning, a difficult goal in the big data era. To address the inflexibility of relational databases, developers and data engineers begin looking into less rigid models to suit their needs, giving rise to NoSQL databases.

To clear the air, NoSQL does not mean “No SQL”. It is an acronym for “Not Only SQL” and provides developers with much more latitude in addressing their emerging needs while still have the security of a database. NoSQL databases are hybrids and don’t have a common architecture like relational databases do. They are best described as having similar qualities such as:

  • Not using a relational model
  • Designed to run on clusters
  • Designed for web architecture
  • No rigid schema

Why the Change

With the growth of and movement toward applications and the increased integration of web connected devices, developers needed more flexibility. With relational databases the data must fit the model and the database dictates the what and how of the query. With the new generation of “apps”, the interrogation happens in the application and in memory as it is needed, more “on the fly” than it is with relational databases and allows developers to utilize in-memory data structures. This need has given rise to aggregate data models, allowing data to be interacted with as a unit.

This aggregation makes distribution across clusters simpler, and more robust since the data can reside on multiple computers rather than on one single computer. When the data are called, the aggregation model allows all associated data to be retrieved as a unit, alleviating the need to query any other related data. This has given rise to map-reduce algorithms to retrieve cluster hosted data.

The data distribution methods that make this possible are:

  • Sharding – distributing different data across multiple servers with each serve acting as a sole source for a data subset.
  • Replication-copying data across several servers allowing the data to be found in multiple locations.

CAP

CAP is an acronym for Consistency, Availability, and Partition toleration. According to Eric Brewer ,any distributed system needs to manage these variables but can only choose two, leaving the third factor vulnerable. To have availability, a developer may have to trade off consistency. Developers have the ability to tune these parameters to optimize the database to the needs of their application, but this could cause problems if not balance properly.

Types of NoSQL Database

There are essentially four categories of NoSQL databases: key-value database, document databases, column family stores, and graph databases.

Key-Value

These are the most basic NoSQL databases to use from the perspective of an API. Because of the persistent use of primary keys, they generally demonstrate good performance and are very easily scaled. Just like with Python dictionaries, users can call a key and get a value, add a value for a key, or delete a key from a store. The data are essentially blobs in the data store with no real organization; all organization is maintained or enforced by the calling application.  Some of the more popular key-value databases are Couchbase and Reddis.

While key-value databases are similar, they are not the same. Some support persistent data while others do not. If data is not persistent, all can be lost if a node is lost, requiring all data to be refreshed. However, is a persistent database, updating old data can be a concern. For these and other reasons, it is important to ensure a key-value database will suit your needs.

Document

The storage structure in document databases is, well, documents. The types of documents stored are numerous but common formats are BSON , JSON , and XML . These are basically hierarchical structures that can contain maps, scalar values, collections of lists, etc. These are stored in a similar manner to key-value pairs but the value can be examined. The most popular databases in this category are MongoDB, CouchDB , and RavenDB  .

Column Family

Column family databases store data in rows comprised of many columns, associated with a row key. Column families are comprised of related data that are accessed together. Each of these columns is comparable to a group, or container, of rows in a relational database management system. However, these rows do not have to have the same columns, and columns can be added to any row without having to add it to other rows. These databases are easily scalable and can spread read-write operations across a cluster, with read-write being handled by any cluster. Popular databases in this category are Cassandra, HBase , and Hypertable.

Graph

Graph databases support not just the storage of entities, but also the relationships between entities. Entities are also known as nodes and relations are known as edges. Both of these have properties with edges having directional importance and nodes are organized by relationships that permit the examination of patterns between nodes. This structure lets the data, or graph, be stored and then examined in different ways based on the relationships. This is not easily done with relational databases without significant schema changes and data transfers.

Graph databases can be extremely fast when traversing joins since the relationship between nodes persists and it is not calculated with each query. There can be numerous types of relationships between nodes allowing secondary relationships between other things such as categories, paths, or linked lists. There is no limit to the number and kind of relationships nodes can have, and all can exist in a single graph database. It is in these relationships where most of the value, and power, exist. Because of these relationships, a lot of work must be put into model the relationships. The most popular database in this category are Neo4J.

Final Thoughts

There are several types of NoSQL databases to choose from and special consideration needs to give to the most important needs of your application with choosing one. If programmer productivity and increased access performance for large amounts of data are your concerns, NoSQL databases are worth considering.

I hope this post has provided you with a starting point to evaluate NoSQL databases. In a future post I will compare these to SQL databases and discuss how to choose between the two. If you have enjoyed this post, and found it helpful, please comment below or find me on Twitter.

Image credit

Relations and Schema in Relational Databases

sqlRelational Databases are the most common databases in use today. Since relational databases are made of tables, people often get the impression that they are like a less visual form of an Excel spreadsheet. While they may look similar, the similarities pretty much stop at appearance. When it comes to databases, the relational database is the heavyweight champ an Excel unskilled novice flyweight.

A spreadsheet, if being used as a database, would be considered a flat database; a database consisting of a single table of information. A row, or record, contains all the information about a single entry. Each row intersects a column, dividing the row into fields for each type, or piece, of information held in the row.  Since I work in mining, we will consider some hypothetical assay information for a drillhole. The project ID, the drillhole number, from, to, assay type, element, and element value each occupy a field. Such a table would look something like this:

Screen Shot 2018-04-25 at 9.10.40 PM

It is pretty easy to look up information in this table. We can quickly tell what elements were assayed, the drillhole ID, the From-To interval, etc. However, this flat database has a problem common to all flat databases: redundancy. Everything is the same with exception of the Element and Element_Value fields. This is inefficient and wastes storage space. A better way to store this data would be to make a relational database from a series of tables.

Project Table

Screen Shot 2018-04-25 at 9.10.49 PM

Drillhole Table

Screen Shot 2018-04-25 at 9.10.56 PM

Sample Table

Screen Shot 2018-04-25 at 9.11.04 PM

By creating these three tables, none of the data are repeated. We have established relationships between the Project Table and the Drillhole Table using the ProjectCode and between the Drillhole Table and the Sample Table through the DrillHHoleID.  These ProjectCode, the DrillHole ID, and the SampleID are Primary Keys which are unique values that identify each of these records. Primary keys enforce referential integrity within the database and ensure there are no duplicate records. These relationships are how relational databases get their name.

This relational scheme has two advantages: data storage efficiency and error risk reduction. The data storage savings in negligible in this small example but you can imagine the amount of storage space saved if this were a mining company and there were multiple projects and thousands of drillholes at each one, with each drillhole having 100 – 200 assays for 31 elements. That can really add up. The potential for errors is reduced since information is stored only once, and definitions can be set such that certain fields cascade, or auto populate when entered in one table.

Relationship Models and Schema

The power of relational databases lies in their strict table definitions and for this reason it is important to think about the data you want to store and the relationships between the data. These relationships are defined by a schema, or model, that will provide the structure of your database.

There are three basic types of relationships in relational databases:

  • one-to-one
  • one-to-many
  • many-to-many

It is important to consider your data in these relationships in order design the most effective database possible. A helpful way to think about these relationships is by considering a product sales order database example. In this scenario, each type of relationship would look like the following:

  • one-to-one – a customer has a single, unique customer ID
  • one-to-many – a customer may have many orders or transactions
  • many-to-many – a customer’s order may containing more than one product and a product can be in many orders.

The differences in these relationships requires different structures to ensure efficiency.

One – to -one

The customer ID should be with other descriptive information about the customer, such as address, phone number, and credit card information. This information should be kept together with the exception of certain cases when there are large chunks of infrequently used data, differing security requirements for something like the credit cards, or if there is customer information that might lend to better efficiency if it is separated from other customer data.

One-to-many

This is more easily thought of as many-to-one. In our example think about many orders belonging to one customer; or, one customer having many orders. However it works best for your style of thinking. Customer orders would be tied to purchase numbers by assigning a customer ID to a transaction number. This information would then be connected to the customer name by the customer ID.

Many – to -many

This kind of relationship is best described as multiple many-to-one relationships. Think about this as a collection of customers, a collection of orders, and a collection of products. In this line of thinking, we would create an entity that connects a customer with a transaction and a transaction with products. This associates one customer with one order through multiple instances of many-to-one relationship pairs.

Considerations

Relational databases are best used when you have a good understanding of the types of data and the types of relationships discussed above are well understood. They are rigid structures and that is one of their strengths. However, sometimes you don’t know what you don’t know and then a NoSQL, or non-relational database, may be a better choice.

If this brief overview of relational databases and schema has been helpful to you, please comment below or find me on Twitter.

Image credit

Database Basics

dbAll of the data that is generated by us every day needs to be stored and maintained in a way that it can be recalled and used to answer questions and generate knowledge. Databases are the containers used for this storage and maintenance and, given the value of insights from data, are the most valuable asset of an organization. We may not think about it often, but databases are everywhere, and we interact with them quite regularly. Behind every website, bank transaction, phone call, video game, weather report is a database.

Many people and small organizations use Excel to manage their data. Excel is good for some things but not so good at others and managing data sets is at the top of the list. One reason is that Excel lacks data integrity. Each cell is independent of every other cell and the “data” in each cell is not necessarily data; a what looks like a number may not be a number, but rather a bit of text. This can create inconsistencies across the spreadsheet. It is also not practical for working with multiple data sets simultaneously nor is it good for answering detailed queries. It also does not scale with the growth of your data because of it high memory requirements and is sized limits.

There are two basic types of databases; relational databases, known as SQL databases, and non-relational databases, known as NoSQL databases. Both types are managed by software known as a database management system, or DBMS. For relational databases, this system is often referred to as a relational database management system (RDBMS). Popular versions you may have heard of are MySQL, Mircrosoft SQL Server, SQLite, PostgreSQL, and Oracle. Some of the more popular NoSQL DBMSs are MongoDB, Cassandra, CouchBase, and Redis among others.

Relational Database Basics

Relational databases are based on the relational model developed by E.F. Codd in 1970 at IBM.  In relational databases, data are organized into tables, defined by a schema, which are “related” to one another by a unique identifier known as a primary key. A common example is a customer database. Customer details, such as name, address, phone number, etc. are stored in one table and assigned a unique customer ID (primary key). Any transaction, such as a sell or return, is stored in a separate table; related to the customer table by the customer ID. This prevents redundant data and improves efficiency when interrogating the database for information.

Relational databases are accessed and manipulated using SQL or Structured Query Language. SQL a is declarative programming language based on relational algebra and tuple related calculus (https://en.wikipedia.org/wiki/SQL) and can be divided into three sublanguages:

DQL – data query language

DDL – data definition language

DCL – data control language

DML – data manipulation language

SQL is a standard of the ANSI, the American National Standards Institute but, despite there being a standard, there are slight variances in the language across the various DBMSs

Non-Relational Databases Basics

With the amount of data increasing as rapidly as it has a need evolved for less structured data storage where data did not have to fit into a schema. To answer this need, NoSQL databases were developed. NoSQL is really a collection of various technologies that aren’t necessarily related, but tackle data management without the SQL language or table structure. This lack of, or poorly enforced structure, allows for structure to be applied at the software application as it is needed.

NoSQL databases employ a number of architectures to store data. Some common architectures are key:value stores, document stores, column-oriented databases, and graph databases. The Redis database uses key:value pairs where data values are stored and accessed using a key, much like Python dictionaries. MongoDB uses document store architecture similar to JSON (Java Script Object Notation) where there are individual records to store data. These look similar to Python dictionaries but are quite different in function; JSON is a serialization format representing structured data as a text string and a Python dictionary is an in-memory data structure. Column oriented databases, like Cassandra, are structured in a way that data that would normally be in a table column has been transposed into rows; greatly accelerating lookups. Graph databases, like Neo4j use edges to define relationships and are useful in pattern recognition.

As you can see, there are several options to consider when choosing a database. Your choice will depend on your specific needs. Please comment below if you found this helpful in understanding databases.

Image Credit

Python Dictionaries

pythonStickersDictionaries, also called dicts, are Python’s native mapping data type. Dictionaries allow rapid data lookup and enumerations; both very useful in using Python for programming and data analysis.

 

Dictionaries are unordered and are defined by key:value pairs in curly brackets { } and typcially hold data that are related such as information contained in a user profile. In our example, since I work with geologic data, I will use some hypothetical exploration drillhole information.

A dictionary looks like this:

In [62]:

 Keys & Values

For your reference DDH2018_001 is a drillhole ID and the keys are northing, easting, and elevation (geographical coordinates), depth,Surveyed, and surveyor.

The values in this example are 1506890, 4569873, 6456, 1465, True, and 'IDS'. Values can be any datatype. In this example there are integer, Boolean, and string data.

Printing

 Printing a dicitonay is done the same wall as any printing call in Python 3:
In [63]:
 {‘northing’: 1506890, ‘easting’: 4569873, ‘elevation’: 6456, ‘depth’: 1465, ‘surveyed’: True, ‘surveyor’: ‘IDS’}

Accessing Values with Keys

Values in dictionaries are easy to access by calling the key of the data you need. To see just the depth of DDH20018_001 we do the following:

In [64]:
1465

Dictionaries are similar to databases in that it is possible to retrieve data by calling a key, which is acting like record number.

In [65]:
True
IDS

Accessing Dictionary Elements with Methods

It is also possible to use built-in dictionary methods to obtain information about a dictionary:

dict.keys()        – returns all keys
dict.values()    – returns all values
dict.items()      – returns all items in a list of
(key,value) .       – tuple pairs

In [66]:
dict_keys(['northing', 'easting', 'elevation', 'depth', 'surveyed', 'surveyor'])

This returned an iterable list of dict keys.

In [67]:
dict_values([1506890, 4569873, 6456, 1465, True, 'IDS'])

This returned an iterable list of dict values.

In [68]:
dict_items([('northing', 1506890), ('easting', 4569873), ('elevation', 6456), ('depth', 1465), ('surveyed', True), ('surveyor', 'IDS')])

This returned an iterable list of (key,value) pairs.

These methods provide iterable view objects for the dict.keys(), dict.values(), dict.items() classes. This makes it possible to query across dicts to find common values and differences. We will compare DDH2018_001 and DDH2018_002:

 

In [69]:
True True
6456 6502
IDS MineDept
1506890 150000
1465 1503
4569873 4569450

From compairing these two drillholes we can see they have the same keys, or fields, and all fields are contain data. This can be extremely useful when comparing large dictionaries.

Another trick we can do is iterate the view of dict.items from above and incorporte them into a sentence using a for loop. This can be useful when extracting data into a readable format for human consumption.

 

In [70]:
northing is the key for the value 1506890
easting is the key for the value 4569873
elevation is the key for the value 6456
depth is the key for the value 1465
surveyed is the key for the value True
surveyor is the key for the value IDS

Modifying Dictionaries

Since dictionaries are mutable, it is possible to make changes to them by adding, deleting, and changing elements.

To add a value we execute dict[key]=value.

 

In [71]:
{'P1': 'DDH2018_001', 'P2': 'DDH2018_002', 'P3': 'DDH2018_003'}

The dictionary of proposed drillhole IDs has been updated to include P3 : DDH2018_003. Using this same syntax it is possible to change any value in a dictionary. Let’s assume that the survey for DDH2018_002 needs to be removed from the dictionary because the readings were flawed. To show that DDH2018_002 has been removed, we will change the dictionary to show survey:False

 

In [72]:
{'northing': 150000, 'easting': 4569450, 'elevation': 6502, 'depth': 1503, 'surveyed': False, 'surveyor': 'MineDept'}

It also possible to modify dictionaries using the dict.update(). Since the the surveyed value for DDH2018_002 has been changed to Falsewe need to update surveyor to None.

 

In [73]:
{'northing': 150000, 'easting': 4569450, 'elevation': 6502, 'depth': 1503, 'surveyed': False, 'surveyor': 'None'}

We can see that 'surveyor' record has been successfully updated to 'survyeor':'None'.

Deleting Elements from a Dictionary

Deleting elements from a dicitonary is just as easy as adding and updating dictionary elements. To delete a dictionary element we will use del dict[key]. Let’s imagine we discovered all of the elevation data for DDH2018_002 is wrong and we just want to delete until the correct coordinates can be provided. To delete the coordinates we perform the following:

 

In [74]:
{'northing': 150000, 'easting': 4569450, 'elevation': 6502, 'depth': 1503, 'surveyed': False, 'surveyor': 'None'}
{'northing': 150000, 'easting': 4569450, 'depth': 1503, 'surveyed': False, 'surveyor': 'None'}

The command del DDH2018_002['elevation'] has removed the elevation data as demonstrated by the two lists above. We have now learned that all of the information for DDH2018_002 is incorrect so we will clear all values by executing dict.clear().

 

In [75]:
{'northing': 150000, 'easting': 4569450, 'depth': 1503, 'surveyed': False, 'surveyor': 'None'}
{}

As we can see from the empty brackets that were returned when printing DDH2018_002 the information was successfully cleared. Now, since there is not need for DDH2018_002 it can be deleted completely like so:

 

In [76]:
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-76-832d004d1908> in <module>()
      1 del DDH2018_002
----> 2 print(DDH2018_002)

NameError: name 'DDH2018_002' is not defined

Since DDH2018_002 was deleted, we receive an error when we try to print it.

Final Thoughts

In this overview we covered some of the most commonly used dictionary functions and methods. There are many more you will pick up as you learn more Python and I will show more advanced dictionary methods in a future post if you would like to come back and learn more. If this has been useful to you, or if you have some good dictionary please leave a comment below or contact me on Twitter.

Python Lists

python
Python lists are ordered, mutable data structures used for storing related values known as items. The square brackets [ ] define a list literal and signify to Python that a particular set of operations and methods will be available to the items in the list. Since lists are mutable, we can make additions and deletions to them as necessary. Another feature is the ability to perform a single operation on all of the items in a list at once.

I like dogs so lets make a list of dogs to start exploring the power of lists.

In [77]:
dogs = ['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Labradoodle']

Like other data types in Python, we can print a list. We should get output that looks just like the list we created.

In [78]:
print(dogs)
['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Labradoodle']

Indexing Lists

 Since lists are a compound data container, or data type, we can access specific elements of the list using indexing methods. This makes it easy to select and edit or remove items from a list as well modify specific items. This indexed, oredered structure also makes it perform operarions on specific list items and to make new, smaller lists from specific items. This level of flexibiltiy is very useful in Python.

Lists in Python are ‘0’ based meaning the first item is in the ‘0’ location and the second item is in the ‘1’ location. To illustrate, I have used lists, list comprehension, and some functions from the IPython display library to make a table of our list of dogs and their indexed position. This list method is a little advanced and will be revisited later but it is a good example of one of the things you can do with lists.

In [79]:
 from IPython.display import HTML, display

 data = [['0', '1', '2', '3'],
         ['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Labradoodle'],
         ]

 display(HTML(
    '
{}

.format(
        ''.join(
            {}

.format(''.join(str(_) for _ in row)) for row in data)
        )
 ))
0 1 2 3
Pudlepointer Jack Russell French Bulldog Labradoodle

As you can see in the index example above, in position ‘0’ is Pudle Pointer and in position ‘3’ is Labradoodle. I have four dogs, or items, in my list. I have four total items but my list my index only goes to 3. This is very important to remember when accessing list items.

To demonstrate indexing, lets print Pudle Ponter by calling its index from our list:

In [80]:
print(dogs[0])
Pudlepointer

To print French Bulldog we call position 2:

In [81]:
print(dogs[2])
French Bulldog

But what happens if we print position 4?

In [82]:
print(dogs[4])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
 in ()
----> 1 print(dogs[4])

IndexError: list index out of range

Oops; there is no poistion 4

It is also possible to access the list in reverse using negative numbers.

In [83]:
 data = [['-4', '-3', '-2', '-1'],
         ['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Labradoodle'],
         ]

 display(HTML(
    '
{}

.format(
        ''.join(
            {}

.format(''.join(str(_) for _ in row)) for row in data)
        )
 ))
-4 -3 -2 -1
Pudlepointer Jack Russell French Bulldog Labradoodle

If we want to use the negative indexing to print a value from our list of dogs we do the following:

In [84]:
print(dogs[-3])
Jack Russell

We can also perform string concatination with items from a list.

In [85]:
print('My dog Millie is a ' + dogs[-4]+'.')
My dog Millie is a Pudlepointer.

Modifying List Items

It is also possible to change list items by using the index numbers. If we want to change ‘Labradoodle’ to ‘Griffon’ we can perform the following, using the index of [-1]:

In [86]:
dogs[-1] = 'Griffon'
print(dogs)
['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Griffon']

List Sicing

It is possible to select a range of values from a list using the slicing method. A slice is created by separating a starting index and an ending index with a colon like so [x:y]. The first index value is inclusive whereas the last value, or stopping position, is exclusive. If we slice ‘dogs’ with [2:4] the result will be [‘French Bulldog’,’Griffon’].

In [87]:
print(dogs[2:4])
['French Bulldog', 'Griffon']

We can include either end of the list by omitting one of the numbers in the list[x:y] syntax. To print the first three items of dogs, positions 0,1,2, we would do the following:

In [88]:
print(dogs[:3])
['Pudlepointer', 'Jack Russell', 'French Bulldog']

If we want to print from the middle of the list to the end we would do so like this:

In [89]:
print(dogs[2:])
['French Bulldog', 'Griffon']

Negative indexing can also be used with slices:

In [90]:
print(dogs[:-2])
print(dogs[-1:])
print(dogs[-4:-2])
['Pudlepointer', 'Jack Russell']
['Griffon']
['Pudlepointer', 'Jack Russell']

Another trick with lists is slicing using stride. Stride will determine how many units we move up or down in a list; like counting by 2s or 5s. We have not been including the stride parameter in our slices so Python has been using the default of 1. The notation for slicing with a stride parameter is list[x:y:z]. This would produce a selection of items x:y by every zth item.

In [91]:
alphabet = ['a','b','c','d','e','f','g','h','i','j','k','l','m','m','o','p','q','r','s','t','u','v',
           'w','x','y','z']

To see how long our list, ‘alphabet’, is we use the len() function.

In [92]:
len(alphabet)
Out[92]:
26

To select every third letter, beginning at the ‘0’ position, we can do any of the following:

In [93]:
print(alphabet)
print(alphabet[::3])
print(alphabet[0::3])
print(alphabet[:26:3])
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'm', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
['a', 'd', 'g', 'j', 'm', 'p', 's', 'v', 'y']
['a', 'd', 'g', 'j', 'm', 'p', 's', 'v', 'y']
['a', 'd', 'g', 'j', 'm', 'p', 's', 'v', 'y']

This feature provides a lot of control and flexibilty for us to use to access data in lists.

List Modification Using Operators

Mathematical operators, such as \*and + , can also be used to modify lists. Additionally, the common compound forms for these operators, \*= and += can be used.

It is common to use + to concatenate two lists and to add items to the end of a list.

In [94]:
print(dogs + alphabet)
['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Griffon', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'm', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
In [95]:
dogs = dogs + ['Beagle']
print(dogs)
['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Griffon', 'Beagle']

Using the \* operator it is possible to multiply a list by a factor and replicate the values.

In [96]:
print(dogs*2)
['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Griffon', 'Beagle', 'Pudlepointer', 'Jack Russell', 'French Bulldog', 'Griffon', 'Beagle']

Using the compound operators \*= and \+= it is easy to automate the population of a list. We will use a for loop to add ‘Boder Collie’ to our list of dogs multiple times. We will add ‘Border Collie’ to our list of dogs four times, adding to each previous list.

In [97]:
for x in range(1,5):
    dogs += ['Border Collie']
    print(dogs)
['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Griffon', 'Beagle', 'Border Collie']
['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Griffon', 'Beagle', 'Border Collie', 'Border Collie']
['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Griffon', 'Beagle', 'Border Collie', 'Border Collie', 'Border Collie']
['Pudlepointer', 'Jack Russell', 'French Bulldog', 'Griffon', 'Beagle', 'Border Collie', 'Border Collie', 'Border Collie', 'Border Collie']

\*= behaves in a similar way.

In [98]:
doggies = ['Bulldog']

for x in range(1,5):
    doggies *= 3
    print(doggies)
['Bulldog', 'Bulldog', 'Bulldog']
['Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog']
['Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog']
['Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog', 'Bulldog']

Removing Items from a List

There are a few ways to remove items from a list. The del statement can be used to delete based on index and the .pop() method can be used to select the last item from a list.

In [107]:
dogs = ['Pudlepointer', 'Jack Russell', 'French Bulldong','Griffon','Beagle','Labradoodle']


del dogs[2]
print(dogs)
['Pudlepointer', 'Jack Russell', 'Griffon', 'Beagle', 'Labradoodle']

Comparing the original list 'dogs' to the output list printed after the del you will see that 'French Bulldog' was deleted.

It is also possible to delete over a range of index positions. If we want to delete 'Jack Russell','Griffon','Beagle' we would type the following:

In [108]:
dogs = ['Pudlepointer', 'Jack Russell', 'Griffon','Beagle','Labradoodle']

del dogs[1:4]
print(dogs)
['Pudlepointer', 'Labradoodle']

.pop() will select the last item from a list like so:

In [110]:
dogs = ['Pudlepointer', 'Jack Russell', 'Griffon','Beagle','Labradoodle']
print(dogs.pop())
print(dogs)
Labradoodle
['Pudlepointer', 'Jack Russell', 'Griffon', 'Beagle']

Lists of Lists

It is possible to make a list that consists of other lists as its items. This is exactly what a comma delemited file (csv) is; a list of lists, or more appropriately, nested lists.

In [111]:
dog_park = [['Pudlepointer', 'Jack Russell', 'Griffon','Beagle','Labradoodle'],['Millie', 'Piper', 'Oakley', 'Jake','Mumford']]

Nested lists are accessed using indices similary to what was done previously. With the notation list[x][y] [x] is the list index in the nested list and [y] is the index of the item you want to select in the list.

In [113]:
print(dog_park[1][0])
print(dog_park[0][0])
Millie
Pudlepointer
In [120]:
print("The first nested list [0].")
print(dog_park[0][0])
print(dog_park[0][1])
print(dog_park[0][2])
print(dog_park[0][3])
print(dog_park[0][4])

print()

print("The second nested list [1].")
print(dog_park[1][0])
print(dog_park[1][1])
print(dog_park[1][2])
print(dog_park[1][3])
print(dog_park[1][4])
The first nested list [0].
Pudlepointer
Jack Russell
Griffon
Beagle
Labradoodle

The second nested list [1].
Millie
Piper
Oakley
Jake
Mumford

Final Thoughts

 List are very flexible and accessible data containers for holding ordered data. The examples above are just some of the basic things that can be done with lists. If you want to learn more, I recommend you read about list methods and list comprehensions to further your programming skills.

Thank you for reading this post. If you found it useful or you have some more information about Python lists, please share with me in the comments below or on Twitter.

Python image courtesy of www.python.org 

Variables and Data Types in Python

pythonPython, like other computer languages, uses variables to store data values in memory. In storing these variables in memory, we are also assigning a data type to the value held by the variable so that the processor knows how to treat the value. This is important because integer numbers have different properties than floats (decimals) and dates. The beauty of Python is that data types for variables do not have to be explicitly declared as they do in statically typed languages such as Java or C. Instead, the Python interpreter dynamically infers the data type, therefore, allowing any kind of data to be assigned to any variable without setting memory allocations. For example, in C we would type:

int value = 0;
for(int 1=0; i<1000; i++){ value +=I; }

The same operation in Python would look like this:

value = 0
for i in range(1000):
value +=1

Python will recognize and treat value as an integer whereas in the C example we had to declare value as an integer. Now, if we want to change things up in Python, we can type:

Screen Shot 2018-04-17 at 10.29.22 PM

X = “one-hundred”

But in C that cannot be done.....

int x = 100;
x = “one-hundred”;  //ERROR

We told the computer x would be an integer and then we changed it to a string.

Back to basics, if you consider the types of data you encounter on a daily basis you will likely find that you deal with floating decimal places (floats) and integers (counts) when dealing with numbers, word (string), and dates (datetime), plus various others. We have general rules as to how these data types can interact with one another:

Screen Shot 2018-04-17 at 10.30.45 PM

Screen Shot 2018-04-17 at 10.30.53 PM

Screen Shot 2018-04-17 at 10.31.25 PM

As we can see, the computer will not combine two different data types by addition. It is important that we understand the different data types and how they can be used and manipulated.

Python Standard Data Types

Python has five standard data types:

  • Number
  • String
  • Boolean
  • List
  • Tuple
  • Dictionary

Numbers

Python will interpret any number you enter as a number. If the number has no decimals, Python will see it as an integer and if it does have decimals Python will see it as a float. For example, 155 is an integer and 155.0 is a float.

Integers

Integers, also known as int,  are whole numbers that can be positive, negative, or 0.  An example of integers would be  …….-2, -1, 0, 1, 2……  In Python, we can print an integer like so

Print(1500)

1500

Notice that commas are not used in numbers greater than 999 like you would use if you were writing.

It is also possible to assign a value to a variable. An important concept is that a variable holds a value and does not equal that value. For example:Screen Shot 2018-04-17 at 10.32.07 PMScreen Shot 2018-04-17 at 10.31.25 PM

We can also assign expressions to variables:

My_expression = 1500 + 500

Print(my_expression)

2000

Floating Point Numbers

Floating point numbers, or floats, are real numbers that can be rational or irrational. These are pretty common; think about numbers with decimals, such as money, and p. As with integers, floats can be printed with Python’s print function

Screen Shot 2018-04-17 at 10.32.22 PM

Can be assigned to variables

Screen Shot 2018-04-17 at 10.32.31 PM

And we can assign expressions with floats to a variable

Screen Shot 2018-04-17 at 10.32.41 PM

One thing to keep in mind is that while we may see 5.0 and 5 as being the same, the computer does not, 5.0 is a float and 5 is an integer. Performing a mathematical operation with a float and and integer will convert the integer to a float

Screen Shot 2018-04-17 at 10.32.51 PM

Screen Shot 2018-04-17 at 10.33.00 PM

Strings

The string data type comprises letters, numbers, and symbols contained within either single quotes ‘ or double quotes ‘. The choice is yours to make when choosing double or single quotes, but it is important to stay consistent.

Some examples of strings are:

‘5’

‘Hey diddle, diddle. The cat and the fiddle.’

‘My cat likes single quotes.’

“My dog likes double quotes.”

The classic first program is the “Hello World” statement.

Print(“Hello World!”)

Screen Shot 2018-04-17 at 10.33.11 PM

Screen Shot 2018-04-17 at 10.33.19 PM

There are numerous string methods and functions we can employ within Python to manipulate string data. These are worth exploring to make your data munging tasks easier as well as exchange information back and forth with the computer.

Boolean

Boolean data types will be one of two values; True or False. These can also be represented as T or F, and 1 or 0, respectively. Some examples of Boolean logic are:

10>9 True

Screen Shot 2018-04-17 at 10.33.28 PM

8<4 False Screen Shot 2018-04-17 at 10.33.35 PM

550<600 True Screen Shot 2018-04-17 at 10.33.41 PM

Screen Shot 2018-04-17 at 10.33.48 PM

Screen Shot 2018-04-17 at 10.33.57 PM

Screen Shot 2018-04-17 at 10.34.06 PM

Notice double equals are used here == to test for equivalency. A single equals sign = is only used to assign value to a variable.

As with numbers and strings, Boolean values can be assigned to variables and printed with the print() function:

Screen Shot 2018-04-17 at 10.34.15 PM

Lists

A list in Python is a mutable ordered sequence of items. Anything can be included in a list and lists can also be assigned to variables like the other data types. Lists are always contained within two brackets [ ]. Some examples are:

Integers           [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Floats              [2.2, 3.2, 4.2, 5.8, 9.2, 9.8, 10.0]

Strings             [“Beagle”, “Jack Russel”, “Pudle Pointer”]

Screen Shot 2018-04-17 at 10.34.23 PM

Lists are often used in list comprehension, where an operation is performed on each member of a list. These are very flexible data types as they can be changed and appended to.

Tuples

Tuples are immutable, ordered objects used to group data. Tuples are contained within parentheses ( ). An example of a tuple is:

(‘red’, ‘orange’, ‘yellow’, ‘green’, ‘blue’, ‘indigo’, ‘violet’).

Tuples can also be assigned to a variable:

Screen Shot 2018-04-17 at 10.34.30 PM

Dictionaries

Dictionaries in Python are used to map key:value pairs to hold data and are constructed with curly braces { }.  An example of a dictionary, or dict, is:

{‘name’: ‘Millie’, ‘animal’:’dog’, ‘breed’:’Pudle Pointer’, ‘color’:’brown’, ‘home’:’Reno’}

In this example, they keys are the words to the left of the colon: ‘name’, ‘animal’, ‘breed’, ‘color’, ‘home’. These can be creaed using any immutable data type. The associated values are ‘Millie’, ‘dog’, ‘Pudle Pointer’, ‘brown’, ‘Reno’ and can consist of any data type.

As with all Python data types, dicts can be assigned to a variable and printed:

Screen Shot 2018-04-17 at 10.34.38 PM

To select specific information we can call the variable and one of its keys. If we want to know my dog’s name, we can call my_dog[‘name’]. We can also print this value

Screen Shot 2018-04-17 at 10.34.46 PM

Final Thoughts

This brief overview of Python’s primary data types and data structures should help you get started with applying them to your own Python programming. As your skills grow, you will learn and use list comprehensions to iterate operations, will leverage indexing functions to add and subtract from dictionaries, and will learn to use list functions like append to add values to lists once you have performed operations. If this has been informative and helpful, please comment below or contact me at Twitter.

Introduction to Jupyter Notebooks

Now that Anaconda is installed on your computer we will spend some time learning about the Jupyter notebook environment before experimenting with Python.The name Jupyter is derived from the names of the three most popular scientific computing languages that are supported by the notebook environment: Julia, Python, and R.

The Jupyter notebook has become one of the most popular means of performing, and sharing, scientific research and interactive computing. Some of the reasons to use Jupyter Notebooks are:

  • A single document for everything: Jupyter notebooks support the ingteractive development and execution of code, markdown documentation, graphs and figures, mathematical equations, maps, and much, much more all in a web-based environment served locally from your computer.
  • Reproducible work: The combination of markdown language with the ease of use Jupyter notebook makes documentation of your work simple.
  • Easy to share and convert: Since the notebooks are JSON documents, and since they can be easy converted to HTML and PDF with nbconvert. They can also be viewed by others without the Jupyter ecosystem in a web browser using nbviewer.

There are more reasons to use Jupyter notebooks, but these are the primary ones for those switching over from Excel or another spreadsheet program. These qualities will help you win your organization over to using notebooks to collaborate, share, and peer review projects.

Starting Jupyter Notebooks with IPython 3

The easiest way to start Jupyter Notebooks and the IPython environment is to open up a Command Prompt or Terminal window on your computer and and navigate to the directory where you want to work. Once there, type:

Jupyter notebook

Screen Shot 2018-04-13 at 9.52.53 PM

The Jupyter application will open in a web browser with the following address: http://loccalhost:888 meaning you are serving an instance locally on your computer. You will see a webpage that contains the directory and file structure where you opened the notebook. You can choose a directory here or you click on the “New” dropdown and select Python 3 to start an Python instance in a new tab.

Screen Shot 2018-04-13 at 9.53.15 PM

A new tab will appear with a single, empty cell. This is where we will type our markdown text or type and execute our code.

Screen Shot 2018-04-13 at 9.53.40 PM

Cells

Cells are where you will perform your work. There are two types of cells:

  • Code cells – contains code to be executed in with output printed below the cell.
  • Markdown cells – contain text formatted with Markdown for comments and any writing you choose to do.

To see how a code cell works, type:

print(‘Welcome to Jupyter Notebooks!)

Screen Shot 2018-04-13 at 11.14.37 PM

and press either Shift+Enter or Control+Enter. Alternatively, to run the cell you can click on the Run button at the top of the screen. I recommend taking the time to learn the keyboard shortcuts. They help you continue your workflow, making our typing and production quicker and more efficient. To help get you started, here is a list of common keyboard shortcuts.

Jupyter Keyboard Shortcuts (Command Mode)

  • Run a cell                                                                    Ctrl + Enter or Shift + Enter
  • Toggle between edit and command mode           Esc or Enter
  • Scroll up and down                                                   Up or Down keys
  • New cell above                                                           A
  • New cell below                                                           B
  • Activate Markdown cell                                           M
  • Activate Code cell                                                       Y
  • Delete active cell                                                        D 2x
  • Undo deleted cell                                                       Z
  • Select multiple cells                                                  Shift + Up or Down
  • Merge cells                                                                  Shift + M
  • Split cell                                                                       Ctrl + Shift + -

To see more features and their keystrokes type Ctrl + Shift + P while in command mode.

To quite, just click the Logout button in the top right corner of the screen.

This is enough to get you going. In future posts we will learn some basic Python commands and will take a look at Markdown.

Was this helpful for you? Please comment below for find me on Twitter to let me know.