# Statistics in NumPy

Learn how to analyze different statistical distributions using NumPy.

Start## Key Concepts

Review core concepts you need to learn to master this subject

Conditions in Numpy.mean()

NumPy’s Mean and Axis

NumPy’s Sort Function

Definition of Percentile

NumPy Percentile Function

NumPy’s Percentile and Quartiles

Histogram Visualization

Datasets and their Histograms

Conditions in Numpy.mean()

Conditions in Numpy.mean()

```
import numpy as np
a = np.array([1,2,3,4])
np.mean(a)
# Output = 2.5
np.mean(a>2)
# The array now becomes array([False, False, True, True])
# True = 1.0,False = 0.0
# Output = 0.5
# 50% of array elements are greater than 2
```

In Python, the function `numpy.mean()`

can be used to calculate the percent of array elements that satisfies a certain condition.

NumPy’s Mean and Axis

NumPy’s Mean and Axis

```
import numpy as np
a = np.array([1,2,3,4])
np.mean(a)
# Output = 2.5
np.mean(a>2)
# The array now becomes array([False, False, True, True])
# True = 1.0,False = 0.0
# Output = 0.5
# 50% of array elements are greater than 2
```

In a two-dimensional array, you may want the mean of just the rows or just the columns. In Python, the NumPy `.mean()`

function can be used to find these values. To find the average of all rows, set the axis parameter to 1. To find the average of all columns, set the axis parameter to 0.

NumPy’s Sort Function

NumPy’s Sort Function

```
import numpy as np
a = np.array([1,2,3,4])
np.mean(a)
# Output = 2.5
np.mean(a>2)
# The array now becomes array([False, False, True, True])
# True = 1.0,False = 0.0
# Output = 0.5
# 50% of array elements are greater than 2
```

In Python, the NumPy `.sort()`

function takes a NumPy array and returns a different NumPy array, this one containing the same numbers in ascending order.

Definition of Percentile

Definition of Percentile

```
import numpy as np
a = np.array([1,2,3,4])
np.mean(a)
# Output = 2.5
np.mean(a>2)
# The array now becomes array([False, False, True, True])
# True = 1.0,False = 0.0
# Output = 0.5
# 50% of array elements are greater than 2
```

In statistics, a data set’s Nth percentile is the cutoff point demarcating the lower N% of samples.

NumPy Percentile Function

NumPy Percentile Function

```
import numpy as np
a = np.array([1,2,3,4])
np.mean(a)
# Output = 2.5
np.mean(a>2)
# The array now becomes array([False, False, True, True])
# True = 1.0,False = 0.0
# Output = 0.5
# 50% of array elements are greater than 2
```

In Python, the NumPy `.percentile`

function accepts a NumPy array and percentile value between 0 and 100. The function returns the value of the array element at the percentile specified.

NumPy’s Percentile and Quartiles

NumPy’s Percentile and Quartiles

```
import numpy as np
a = np.array([1,2,3,4])
np.mean(a)
# Output = 2.5
np.mean(a>2)
# The array now becomes array([False, False, True, True])
# True = 1.0,False = 0.0
# Output = 0.5
# 50% of array elements are greater than 2
```

In Python, the NumPy `.percentile()`

function can calculate the first, second and third quartiles of an array. These three quartiles are simply the values at the 25th, 50th, and 75th percentiles, so those numbers would be the parameters, just as with any other percentile.

Histogram Visualization

Histogram Visualization

```
import numpy as np
a = np.array([1,2,3,4])
np.mean(a)
# Output = 2.5
np.mean(a>2)
# The array now becomes array([False, False, True, True])
# True = 1.0,False = 0.0
# Output = 0.5
# 50% of array elements are greater than 2
```

A histogram is a plot that visualizes the distribution of samples in a dataset. Histogram shows the frequency on the vertical axis and the horizontal axis is another dimension. Usually horizontal axis has bins, where every bin has a minimum and maximum value. Each bin also has a frequency between x and infinite.

Datasets and their Histograms

Datasets and their Histograms

```
import numpy as np
a = np.array([1,2,3,4])
np.mean(a)
# Output = 2.5
np.mean(a>2)
# The array now becomes array([False, False, True, True])
# True = 1.0,False = 0.0
# Output = 0.5
# 50% of array elements are greater than 2
```

When datasets are plotted as *histograms*, the way the data is distributed determines the distribution type of the data.

The number of peaks in the histogram determines the *modality* of the dataset. It can be *unimodal* (one peak), *bimodal* (two peaks), *multimodal* (more than two peaks) or *uniform* (no peaks).

Unimodal datasets can also be *symmetric*, *skew-left* or *skew-right* depending on where the peak is relative to the rest of the data.

Normal Distribution using Python Numpy module

Normal Distribution using Python Numpy module

```
import numpy as np
a = np.array([1,2,3,4])
np.mean(a)
# Output = 2.5
np.mean(a>2)
# The array now becomes array([False, False, True, True])
# True = 1.0,False = 0.0
# Output = 0.5
# 50% of array elements are greater than 2
```

Normal distribution in NumPy can be created using the below method.

`np.random.normal(loc, scale, size)`

Where `loc`

is the mean for the normal distribution, `scale`

is the standard deviation of the distribution, and `size`

is the number of observations the distribution will have.

Standard deviation

Standard deviation

```
import numpy as np
a = np.array([1,2,3,4])
np.mean(a)
# Output = 2.5
np.mean(a>2)
# The array now becomes array([False, False, True, True])
# True = 1.0,False = 0.0
# Output = 0.5
# 50% of array elements are greater than 2
```

The *standard deviation* of a *normal distribution* determines how spread out the data is from the mean.

68% of samples will fall between +/- 1 standard deviation of the mean.

95% of samples will fall between +/- 2 standard deviations of the mean.

99.7% of samples will fall between +/- 3 standard deviations of the mean.

- 1You’re a citizen scientist who has started collecting data about rising water in the river next to where you live. For months, you painstakingly measure the water levels and enter your findings int…
- 2The first statistical concept we’ll explore is
*mean*, also commonly referred to as an average. The mean is a useful measurement to get the center of a dataset. NumPy has a built-in function to cal… - 3We can also use np.mean to calculate the percent of array elements that have a certain property. As we know, a logical operator will evaluate each item in an array to see if it matches the specif…
- 4If we have a two-dimensional array, np.mean can calculate the means of the larger array as well as the interior values. Let’s imagine a game of ring toss at a carnival. In this game, you have thr…
- 5As we can see, the mean is a helpful way to quickly understand different parts of our data. However, the mean is highly influenced by the specific values in our data set. What happens when one of t…
- 6One way to quickly identify outliers is by sorting our data, Once our data is sorted, we can quickly glance at the beginning or end of an array to see if some values lie far beyond the expected ran…
- 7Another key metric that we can use in data analysis is the
*median*. The median is the middle value of a dataset that’s been ordered in terms of magnitude (from lowest to highest). Let’s look at … - 8In a dataset, the median value can provide an important comparison to the mean. Unlike a mean, the median is not affected by outliers. This becomes important in
*skewed*datasets, datasets whose va… - 9As we know, the median is the middle of a dataset: it is the number for which 50% of the samples are below, and 50% of the samples are above. But what if we wanted to find a point at which 40% of t…
- 10Some percentiles have specific names: - The
**25th percentile**is called the*first quartile*- The**50th percentile**is called the*median*- The**75th percentile**is called the *third quarti… - 11While the mean and median can tell us about the center of our data, they do not reflect the range of the data. That’s where
*standard deviation*comes in. Similar to the interquartile range, the … - 12As we saw in the last exercise, knowing the standard deviation of a dataset can help us understand how spread out our dataset is. We can find the standard deviation of a dataset using the Numpy f…
- 13Let’s review! In this lesson, you learned how to use NumPy to analyze single-variable datasets. Here’s what we covered: - Using the np.sort method to locate outliers. - Calculating central positio…

- 1A university wants to keep track of the popularity of different programs over time, to ensure that programs are allocated enough space and resources. You work in the admissions office and are asked…
- 2When we first look at a dataset, we want to be able to quickly understand certain things about it: - Do some values occur more often than others? - What is the range of the dataset (i.e., the min …
- 3Suppose we had a larger dataset with values ranging from 0 to 50. We might not want to know exactly how many 0’s, 1’s, 2’s, etc. we have. Instead, we might want to know how many values fall betwee…
- 4We can graph histograms using a Python module known as
*Matplotlib*. We’re not going to go into detail about Matplotlib’s plotting functions, but if you’re interested in learning more, take our co… - 5Histograms and their datasets can be classified based on the shape of the graphed values. In the next two exercises, we’ll look at two different ways of describing histograms. One way to classify…
- 6Most of the datasets that we’ll be dealing with will be unimodal (one peak). We can further classify unimodal distributions by describing where most of the numbers are relative to the peak. A *sym…
- 7The most common distribution in statistics is known as the
*normal distribution*, which is a symmetric, unimodal distribution. Lots of things follow a normal distribution: - The heights of a large… - 8We can generate our own normally distributed datasets using NumPy. Using these datasets can help us better understand the properties and behavior of different distributions. We can also use them to…
- 9In a normal distribution, we know that the mean and the standard deviation determine certain characteristics of the shape of our data, but how exactly? Let’s do some exploration to find out!
- 10We know that the standard deviation affects the “shape” of our normal distribution. The last exercise helps to give us a more quantitative understanding of this. Suppose that we have a normal dis…
- 11It’s known that a certain basketball player makes 30% of his free throws. On Friday night’s game, he had the chance to shoot 10 free throws. How many free throws might you expect him to make? We …
- 12There are some complicated formulas for determining these types of probabilities. Luckily for us, we can use NumPy - specifically, its ability to generate random numbers. We can use these random nu…
- 13Let’s return to our original question: Our basketball player has a 30% chance of making any individual basket. He took 10 shots and made 4 of them, even though we only expected him to make 3. Wh…
- 14Let’s review! In this lesson, you learned how to use NumPy to analyze different distributions and generate random numbers to produce datasets. Here’s what we covered: - What is a histogram and how…

## What you'll create

Portfolio projects that showcase your new skills

## How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory