Relational Databases for Data Science/Analysis
Acquiring the data you need to conduct data analysis can often be a challenging and rewarding part of the data science life cycle. One of the most important questions to address especially when acquiring large amounts of data for data science and analysis is, where will all the data be stored? Data storage is such an integral part of data acquisition because the way that you store your data will influence how you interact with it throughout its application. In this lesson, we will cover relational databases as a way to store data for data science and analytics processes.
We should make sure to understand the type of data we are working with when acquiring data. Data can be structured, semi-structured, or unstructured. Unstructured data is data that has no defined organization or data model. It is the photos, posts, videos, PDF files, and Word documents that can contain a wide range of dimensions and types. It is estimated that by 2017, nearly 80% of the 125 exabytes of global business data was unstructured. Semi-structured data has more organization than unstructured data, an example being XML data which is stored in only plain text format. Here is an example of the sitemap for codecademy.com which is an XML file that stores all of the links found on the Codecademy home page.
XML code snippet
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <sitemap> <loc>https://www.codecademy.com/sitemap_static.xml</loc> </sitemap> <sitemap> <loc>https://www.codecademy.com/sitemap_catalog.xml</loc> </sitemap> <sitemap> <loc>https://www.codecademy.com/sitemap_article.xml</loc> </sitemap> <sitemap> <loc>https://www.codecademy.com/sitemap_module.xml</loc> </sitemap> <sitemap> <loc>https://www.codecademy.com/sitemap_skills_v2.xml</loc> </sitemap> <sitemap> <loc>https://www.codecademy.com/sitemap_paths.xml</loc> </sitemap> <sitemap> <loc>https://www.codecademy.com/sitemap_references.xml</loc> </sitemap> </sitemapindex>
In this lesson, we will focus on structured data, as it is the data found within relational databases.
Structured data is the most organized type of data. It follows a data model, which is a kind of blueprint that defines and describes the structure of the data. Structured data is typically in the form of tables that are well defined by their rows and columns, a great example being a DataFrame from the popular Python library pandas. Some advantages of structured data are that its explicit structure aids in organized storage and accessibility. This also allows it to be easily indexed for efficient reference. With structured data, we are able to set permissions for security purposes, so that only those that meet predetermined guidelines and clearances can access the data.
Examples of structured data are:
- Online forms
- Excel Spreadsheets
- SQL Databases
In many applications, the amount of data collected is vast and can not be stored in a single machine. This is why databases are so vital when working with certain data. Databases are collections of data that are organized for efficient accessibility and management. Data Analysts use databases for marketing, business, and sales. Data Scientists, Software Engineers, and Web Developers use databases to store the large amounts of data that are often used in production scale applications. Within the data science pipeline, databases are considered the storage units that house the data so it can be cleaned for analysis and modeling.
Insert image of databases in data science pipeline
- Storage- where the data is housed throughout the entire data science life cycle.
- Data Cleaning- assigning proper data types and categories, removing duplicate and null values within the data.
- Anomaly removal- removal of any data points that would worsen the modeling performance.
- Transformation- applying changes to every data point to improve modeling performance.
- Augmentation- changing the size of data used for training a model, such as cropping, rotating, or flipping an image.
There are multiple database models that specify how a database is structured. For instance, the flat model is the most simple and is essentially a table. The relational model can be viewed as a database model that has multiple tables that each describe a particular entity of the database. We refer to the databases that follow the relational model as relational databases.
Relational databases are the primary means of storage for structured data. These databases are relational because they organize data into tables that each contain data related to one another. Remember that tables are composed of rows and columns, with rows representing each individual observation called a record, and the columns are called fields that represent the characteristics or dimensions of the record.
Insert image of table description
In this table we can see the 6 fields:
address_id. Each of these fields is an attribute of a record. For this particular view, we see 4 records in the table, with each representing a single customer.
Relational databases are so useful for data science and analytics because we often want to understand the relationships among our data, and relational databases organize data to make those relationships clear and simple to explore. The tabular structure of relational databases also makes them easy to index, query, and scale. When we consider relational databases as a collection of tables, what we call a schema, we can visualize them with entity-relationship diagrams, which give us a chance to view the data within each table, and how each table relates to the others.
Insert entity relationship diagram image
Entity Relationship Diagram
In the entity-relationship diagram above we can see the seven tables that make up the relational database of a record store. There are tables that contain information on customers, orders, albums, etc. If we look closely, we can see that some of the tables are related to each other by certain fields. These shared fields are called keys and are the links that connect the various tables.