Introduction to NumPy and Pandas
NumPy is an open-source Python library that facilitates efficient numerical operations on large quantities of data. The main data structure in this library is the powerful NumPy array, ndarray, which can have any number of dimensions. The NumPy library contains many useful features for performing mathematical and logical operations on these special arrays. NumPy is a part of a set of Python libraries that are used for scientific computing due to its efficient data analysis capabilities.
Pandas is a library with data manipulation tools that are built on top of and add to those of the established NumPy library. It relies on the NumPy array structure for implementation of its objects and therefore shares many features with NumPy and is frequently used alongside it. Pandas is also a part of the set of libraries used for scientific computation.
If you have Anaconda installed, NumPy and pandas may have been auto-installed as well! If they haven’t been, or if you want to update to the latest versions, you can open a terminal window and run the following commands:
conda install numpy conda install pandas
If you don’t have Anaconda installed, you can alternatively install the libraries using
pip by running the following commands from your terminal:
pip install numpy pip install pandas
Once you’ve installed these libraries, you’re ready to open any Python coding environment (we recommend Jupyter Notebook). Before you can use these libraries, you’ll need to import them using the following lines of code. We’ll use the abbreviations
pd, respectively, to simplify our function calls in the future.
import numpy as np import pandas as pd
NumPy arrays are unique in that they are more flexible than normal Python lists. They are called ndarrays since they can have any number (n) of dimensions (d). They hold a collection of items of any one data type and can be either a vector (one-dimensional) or a matrix (multi-dimensional). NumPy arrays allow for fast element access and efficient data manipulation.
The code below initializes a Python list named
list1 = [1,2,3,4]
To convert this to a one-dimensional ndarray with one row and four columns, we can use the
array1 = np.array(list1) print(array1)
[1 2 3 4]
To get a two-dimensional ndarray from a list, we must start with a Python list of lists:
list2 = [[1,2,3],[4,5,6]] array2 = np.array(list2) print(array2)
[[1 2 3] [4 5 6]]
In the above output, you may notice that the NumPy array print-out is displayed in a way that clearly demonstrates its multi-dimensional structure: two rows and three columns.
Many operations can be performed on NumPy arrays which makes them very helpful for manipulating data:
Selecting array elements
Numerical operations (min, max, mean, etc)
Mathematical operations can be performed on all values in a ndarray at one time rather than having to loop through values, as is necessary with a Python list. This is very helpful in many scenarios. Say you own a toy store and decide to decrease the price of all toys by €2 for a weekend sale. With the toy prices stored in an ndarray, you can easily facilitate this operation.
toyPrices = np.array([5,8,3,6]) print(toyPrices - 2)
[3 6 1 4]
If, however, you had stored your toy prices in a Python list, you would have to manually loop through the entire list to decrease each toy price.
toyPrices = [5,8,3,6] # print(toyPrices - 2) -- Not possible. Causes an error for i in range(len(toyPrices)): toyPrices[i] -= 2 print(toyPrices)
Pandas Series and Dataframes
Just as the ndarray is the foundation of the NumPy library, the Series is the core object of the pandas library. A pandas Series is very similar to a one-dimensional NumPy array, but it has additional functionality that allows values in the Series to be indexed using labels. A NumPy array does not have the flexibility to do this. This labeling is useful when you are storing pieces of data that have other data associated with them. Say you want to store the ages of students in an online course to eventually figure out the average student age. If stored in a NumPy array, you could only access these ages with the internal ndarray indices
0,1,2.... With a Series object, the indices of values are set to
0,1,2... by default, but you can customize the indices to be other values such as student names so an age can be accessed using a name. Customized indices of a Series are established by sending values into the Series constructor, as you will see below.
A Series holds items of any one data type and can be created by sending in a scalar value, Python list, dictionary, or ndarray as a parameter to the pandas Series constructor. If a dictionary is sent in, the keys may be used as the indices.
# Create a Series using a NumPy array of ages with the default numerical indices ages = np.array([13,25,19]) series1 = pd.Series(ages) print(series1)
0 | 13 1 | 25 2 | 19 dtype: int64
When printing a Series, the data type of its elements is also printed. To customize the indices of a Series object, use the
index argument of the
# Create a Series using a NumPy array of ages but customize the indices to be the names that correspond to each age ages = np.array([13,25,19]) series1 = pd.Series(ages,index=['Emma', 'Swetha', 'Serajh']) print(series1)
Emma | 13 Swetha | 25 Serajh | 19 dtype: int64
Series objects provide more information than NumPy arrays do. Printing a NumPy array of ages does not print the indices or allow us to customize them.
ages = np.array([13,25,19]) print(ages)
[13 25 19]
Another important type of object in the pandas library is the DataFrame. This object is similar in form to a matrix as it consists of rows and columns. Both rows and columns can be indexed with integers or String names. One DataFrame can contain many different types of data types, but within a column, everything has to be the same data type. A column of a DataFrame is essentially a Series. All columns must have the same number of elements (rows).
There are different ways to fill a DataFrame such as with a CSV file, a SQL query, a Python list, or a dictionary. Here we have created a DataFrame using a Python list of lists. Each nested list represents the data in one row of the DataFrame. We use the keyword
columns to pass in the list of our custom column names.
dataf = pd.DataFrame([ ['John Smith','123 Main St',34], ['Jane Doe', '456 Maple Ave',28], ['Joe Schmo', '789 Broadway',51] ], columns=['name','address','age'])
This is how the DataFrame is displayed:
name | address | age 0 | John Smith | 123 Main St | 34 1 | Jane Doe | 456 Maple Ave | 28 2 | Joe Schmo | 789 Broadway | 51
The default row indices are
0,1,2..., but these can be changed. For example, they can be set to be the elements in one of the columns of the DataFrame. To use the
names column as indices instead of the default numerical values, we can run the following command on our DataFrame:
name | address | age John Smith | 123 Main St | 34 Jane Doe | 456 Maple Ave | 28 Joe Schmo | 789 Broadway | 51
DataFrames are useful because they make it much easier to select, manipulate, and summarize data. Their tabular format (a table with rows and columns) also makes it easier to label, simpler to read, and easier to export data to and from a spreadsheet. Understanding the power of these new data structures is the key to unlocking many new avenues for data manipulation, exploration, and analysis!