Pandas - Data Science with Python¶

Numpy and numpy arrays are our tool of choice for numeric data that resembles vectors, matrices (and higher dimensional tensors).

Where data is gathered from experiments, and in particular where we want to extract meaning from the combination of different data sources, and where data is often incomplete, the pandas library offers a number of useful tools (and has become a standard tool for data scientists).

In this section, we introduce the basics of Pandas.

In particular, we introduce the two key data types in Pandas: the Series and the DataFrame objects.

By convention, the pandas library is imported under the name pd (the same way that numpy is imported under the name np:

import pandas as pd

Motivational example (Series)¶

Imagine we are working on software for a greengrocer or supermarket, and need to track the number of apples (10), oranges(3) and bananas (22) that are available in the supermarket.

We could use a python list (or a numpy array) to track these numbers:

stock = [10, 3, 22]

However, we would need to remember separately that the entries are in the order of apples, oranges, and bananas. This could be achieved through a second list:

stocklabels = ['apple', 'orange', 'banana']

assert len(stocklabels) == len(stock)  # check labels and 
                                       # stock are consistent
for label, count in zip(stocklabels, stock):
    print(f'{label:10s} : {count:4d}')

apple      :   10
orange     :    3
banana     :   22

The above 2-list solution is a little awkward in two ways: firstly, we have use two lists to describe one set of data (and thus need to be carefuly to update them simulatenously, for example), and secondly, the access to the data given a label is inconvenient: We need to find the index of the label with one list, then use this as the index to the other list, for example

index = stocklabels.index('banana')
bananas = stock[index]
print(f"There are {bananas} bananas [index={index}].")

There are 22 bananas [index=2].

We have come across similar examples in the section on dictionaries, and indeed a dictionary is a more convenient solution:

stock_dic = {'apple': 10, 
             'orange': 3,
             'banana': 22}

In a way, the keys of the dictionary contain the stock labels and the values contain the actual values:

stock_dic.keys()

dict_keys(['apple', 'orange', 'banana'])

stock_dic.values()

dict_values([10, 3, 22])

To retrieve (or change) the value for apple, we use apple as the key and retrieve the value through the dictionary's indexing notation:

stock_dic['apple']

10

And we can summarise the stock as follows:

for label in stock_dic:
    print(f'{label:10s} : {stock_dic[label]:4d}')

apple      :   10
orange     :    3
banana     :   22

This is a vast improvement over the 2-lists solution: (i) we only maintain one structure, which contains a value for every key - so we don't need to check that the lists have the same length. (ii) we can access individual elements through the label (using it as a key for the dictionary).

The Pandas Series object address the requriments above. It is similar to a dictionary, but with improvements for the given problem:

the order of the items is maintained
the values have to have the same type (higher execution performance)
a (large) number of convenience functionality, for example to deal with missing data, time series, sorting, plotting, and more

Pandas `Series`¶

Stock example - `Series`¶

We can create a Series object - for example - from a dictionary:

stock = pd.Series({'apple': 10, 
                   'orange': 3,
                   'banana': 22})

The default presentation shows the entries one per row, with the label on the left, and the value on the right.

type(stock)

pandas.core.series.Series

stock

apple     10
orange     3
banana    22
dtype: int64

The items on the left are referred to as the index of the Series, and are available as the index attribute of the series object:

stock.index

Index(['apple', 'orange', 'banana'], dtype='object')

type(stock.index)

pandas.core.indexes.base.Index

We can also access the list of values for each item, using the values attribute:

stock.values

array([10,  3, 22])

Regarding data access, the Series object behaves like a dictionary:

stock['apple']

10

stock['potato'] = 101    # adding more values
stock['cucumber'] = 1

print(stock)

apple        10
orange        3
banana       22
potato      101
cucumber      1
dtype: int64

stock

apple        10
orange        3
banana       22
potato      101
cucumber      1
dtype: int64

We can plot the data as a bar chart:

%matplotlib inline
stock.plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x11458b7b8>

We can sort the data according to the values in the Series (and then plot to visualise):

stock.sort_values().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x116655400>

Or sort the index to get alphabetical order of our fruit and vegetables:

stock.sort_index().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x11674cf60>

The Series object has a number of numerical methods available, including mean and sum:

stock.sum()

137

stock.mean()

27.4

It also behaves like a sequence in that the len function returns the number of data points in the Series object:

len(stock)

5

memory usage¶

For larger data sets, it might be important to know how many bytes storing the Series costs. The bytes required to store the actual series data are available as

stock.nbytes

40

or from the underlying numpy array directly:

stock.values.nbytes

40

It is 40 bytes, because we have 5 elements stored as int64 (each needing 8 bytes):

stock.dtype

dtype('int64')

The Series object needs additional memory. This can be queried using:

stock.memory_usage()

240

Statistics¶

A number of statistical descriptors of the data in the stock Series object is available using describe():

stock.describe()

count      5.000000
mean      27.400000
std       41.955929
min        1.000000
25%        3.000000
50%       10.000000
75%       22.000000
max      101.000000
dtype: float64

As usual, the documentation strings provide documentation (help(stock.describe)), and the pandas home page (https://pandas.pydata.org) provides links to the Pandas documentation.

Create Series from list¶

In the example above, we showed how to create a Series from a dictionary where the keys of the dictionary entries served as the index for the Series object.

We can also create a Series from a list, an provide an additional index:

stock = pd.Series([10, 3, 22], index=['apple', 'orange', 'banana'])

stock

apple     10
orange     3
banana    22
dtype: int64

If we omit the index argument, the Series will assume an integer index:

stock = pd.Series([10, 3, 22])

stock

0    10
1     3
2    22
dtype: int64

However, an index can be added subsequently:

stock.index = ['apple', 'orange', 'banana']

stock

apple     10
orange     3
banana    22
dtype: int64

Plotting data¶

Commonly used plots are easily accessible via the plot() method of the Series object. We have seen a bar plot above already. The Series.plot() method accepts an argument kind such as kind="bar", but there is an equivalent method Series.plot.bar() avaialble.

Further examples:

stock.plot.pie()

<matplotlib.axes._subplots.AxesSubplot at 0x11683beb8>

To tailor the plot, we can either get the axis object and modify it subsequently:

ax = stock.plot.pie()
ax.set_aspect(1)
ax.set_ylabel(None);
ax.set_title(None);

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(9, 3))
stock.plot.bar(ax=ax)
ax.set_title("Current stock");

We can also fetch the data from the series and drive the plotting "manually" ourselves:

import matplotlib.pyplot as plt

names = list(stock.index)
values = list(stock.values)

fig, ax = plt.subplots(1, 1, figsize=(9, 3))
ax.bar(names, values)
ax.set_title('Stock');

Missing values¶

"Real" data sets tend to be incomplete. Dealing with missing values is an important topic in data science. The agreement in Pandas is that the special floating point value "NaN" (standing for Not a Number) represents missing data points. For example, if we have a table for the stock, but we don't know the value for apple, we would replace it with NaN.

The special Nan value in Python can be created using float('nan') or using numpy.nan if the numpy module is imported.

stock['apple'] = float('nan')

stock

apple      NaN
orange     3.0
banana    22.0
dtype: float64

Note that the dtype of the stock Series object has changed from int64 to float64 when we assigned NaN to apple: the whole series has been converted to float, because NaN is only defined for floating point numbers.

(There is a proposal to create a NaN object as part of pandas - this would overcome the above limitation.)

Assume we need to calculate how many items of stock we have in total using the sum function:

stock.values

array([nan,  3., 22.])

A common situation is that we have an incomplete Series or DataFrame (which are multiple Series with the same index) and we want to process with our analysis, but treat the missing values in a special way.

stock.sum()

25.0

The above example sum shows that NaN values are simply ignored, which can be convenient.

We can also 'tidy up' the Series object, by removing all entries that have a NaN value:

stock.dropna()

orange     3.0
banana    22.0
dtype: float64

Series data access: explicit and implicit (`loc` and `iloc`)¶

stock = pd.Series({'apple': 10, 
                   'orange': 3,
                   'banana': 22,
                   'cucumber' : 1,
                   'potato' : 110})

stock

apple        10
orange        3
banana       22
cucumber      1
potato      110
dtype: int64

Indexing¶

We can access single values through their index as if the stock Series object would be a dictionary:

stock['banana']

22

There is an equivalent and recommended way of using this retrieval using the loc (for LOCation?) attribute:

stock.loc['banana']

22

For convenience, pandas also (!) allows us to use integer indexing into the Series object. This is called implicit indexing as the series Object doesn't use integers as the index, but the name of the fruits.

For example, we can also retrieve the value for banana through its implicit index 2, because it is in row 3 of the Series object (which would need index 2 as we start counting from 0):

stock[2]

22

In this example, this works fine and seems convenient, but can become very confusing if the actual index of the object consists of integers. For that reason, the explicit (and recommended way) of using the indirect indexing is through the iloc (ImplicitLOCation) attribute:

stock.iloc[2]

22

Slicing¶

stock

apple        10
orange        3
banana       22
cucumber      1
potato      110
dtype: int64

We can also slice the Series:

stock['orange':'potato']

orange        3
banana       22
cucumber      1
potato      110
dtype: int64

Or skip every second entry:

stock['orange':'potato':2]

orange      3
cucumber    1
dtype: int64

Data manipulation¶

Numerical operations on the series object can be carried for all data values at the same time inthe same way that numpy arrays are processed:

stock - stock.mean()

apple      -19.2
orange     -26.2
banana      -7.2
cucumber   -28.2
potato      80.8
dtype: float64

import numpy as np

np.sqrt(stock)

apple        3.162278
orange       1.732051
banana       4.690416
cucumber     1.000000
potato      10.488088
dtype: float64

Where preferred, we can extract the numpy array and work with that:

data = stock.values

type(data)

numpy.ndarray

data - data.mean()

array([-19.2, -26.2,  -7.2, -28.2,  80.8])

Import and Export¶

Pandas (and its objects Series and DataFrame) support export to and import from a number of useful formats.

For example, we can write a Series object into a comma separated value file:

stock.to_csv('stock.csv', header=False)

#NBVAL_IGNORE_OUTPUT
!cat stock.csv

apple,10
orange,3
banana,22
cucumber,1
potato,110

We can also create a $\LaTeX$ representation of the table:

stock.to_latex()

'\\begin{tabular}{lr}\n\\toprule\n{} &    0 \\\\\n\\midrule\napple    &   10 \\\\\norange   &    3 \\\\\nbanana   &   22 \\\\\ncucumber &    1 \\\\\npotato   &  110 \\\\\n\\bottomrule\n\\end{tabular}\n'

We'll come back to reading from files in the DataFrame section.

Data Frame¶

Stock Example - `DataFrame`¶

After having introduced the Series object above, we will focus on the second important type in pandas: the DataFrame.

As a first description, we could say that the DataFrame is similar to a (2d) spreadsheet: it contains rows and columns.

The series object we have studied above is a special case of the DataFrame, where the DataFrame has only one column.

We'll continue with our stock example:

stock

apple        10
orange        3
banana       22
cucumber      1
potato      110
dtype: int64

In addition to tracking how many objects of each type we have stocked, we have a second Series object that provides the price per item at which the item is sold:

price = pd.Series({'apple': 0.55, 'banana': 0.50, 'cucumber' : 0.99, 'potato' : 0.17, 'orange': 1.76})
price

apple       0.55
banana      0.50
cucumber    0.99
potato      0.17
orange      1.76
dtype: float64

The DataFrame object allows us to treat the two series together. In fact, a convenient way to create the DataFrame object is to combine a number of series as follows:

shop = pd.DataFrame({'stock' : stock, 'price' : price})
shop

Because both Series objects had the same index elements, our data is nicely aligned in the DataFrame with name shop, even though the data was stored in different order in the price and stock.

If one Series is missing a data point, pandas will insert a NaN entry into that field:

price2 = price.copy()

price2['grapefruit'] = 1.99
price2

apple         0.55
banana        0.50
cucumber      0.99
potato        0.17
orange        1.76
grapefruit    1.99
dtype: float64

pd.DataFrame({'stock' : stock, 'price' : price2})

Accessing data in a DataFramea¶

shop

The data frame has an index which is the same for all columns, and shown in bold in the left most column. We can also ask for it:

shop.index

Index(['apple', 'banana', 'cucumber', 'orange', 'potato'], dtype='object')

Each column has name (here stock and price):

shop.columns

Index(['stock', 'price'], dtype='object')

Extracting columns of data¶

Using the column names, we can extract one column into a Series object using the index operator ([]):

shop['stock']

apple        10
banana       22
cucumber      1
orange        3
potato      110
Name: stock, dtype: int64

shop['price']

apple       0.55
banana      0.50
cucumber    0.99
orange      1.76
potato      0.17
Name: price, dtype: float64

Extracting rows of data¶

We have two options of extracting a row of data.

First, explicit indexing using the label of the index in that row:

shop.loc['apple']             # single row is returned as series

stock    10.00
price     0.55
Name: apple, dtype: float64

shop.loc['banana':'cucumber']  # multiple rows are returned as DataFrame

Second, we can use the implicit indexing (as for Series objects):

shop.iloc[0]

stock    10.00
price     0.55
Name: apple, dtype: float64

shop.iloc[1:3]

Warning¶

Note that there are some inconsistencies here: the explicit slicing with index labels (such as .loc['banana':'cucumber']) is inclusive of cucumber, whereas in the implicit slicing (such as .iloc[1:3]) the row with index 3 is not included.

The behaviour of .loc is convenient and a good design choice if labels such as strings in our stock example are used. The behaviour of .iloc is reflecting the normal Python behaviour.

It is thus understandable how we have arrived at the situation.

Data manipulation with `shop`¶

The real strength of the DataFrames is that we can continue to process the data conveniently.

For example, we could work out the financial value of the items we have in stock, and add this as an extra column:

shop['value'] = shop['price'] * shop['stock']
shop

Of course we can compute the sum, for example, to estimate the value of the total stock:

shop['value'].sum()

41.47

If, for whatever reason, we want to swap columns with rows, we can transpose the data frame like a numpy array:

shop.transpose()

Example: European population 2017¶

Here is a second example to demonstrate some use cases of pandas DataFrames.

First, we get the data. It is originally from EUROSTAT (reference "demo_gind")

#NBVAL_IGNORE_OUTPUT
!wget https://fangohr.github.io/data/eurostat/population2017/eu-pop-2017.csv

--2019-06-19 20:36:43--  https://fangohr.github.io/data/eurostat/population2017/eu-pop-2017.csv
Resolving fangohr.github.io... 185.199.108.153, 185.199.109.153, 185.199.111.153, ...
Connecting to fangohr.github.io|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1087 (1.1K) [text/csv]
Saving to: ‘eu-pop-2017.csv’

eu-pop-2017.csv     100%[===================>]   1.06K  --.-KB/s    in 0s      

2019-06-19 20:36:43 (61.0 MB/s) - ‘eu-pop-2017.csv’ saved [1087/1087]

The data source is a comma-separated-value file (CSV), which looks like this:

#NBVAL_IGNORE_OUTPUT
!head eu-pop-2017.csv

geo,pop17,pop18,births,deaths
Belgium ,11351727,11413058,119690,109666
Bulgaria,7101859,7050034,63955,109791
Czechia,10578820,10610055,114405,111443
Denmark,5748769,5781190,61397,53261
Germany,82521653,82850000,785000,933000
Estonia ,1315634,1319133,13784,15543
Ireland,4784383,4838259,62084,30324
Greece,10768193,10738868,88523,124530
Spain,46527039,46659302,390024,421269

Pandas has very strong support of reading files from different formats, including MS Excel, CSV, HDF5 and others. Each reading routine has a number of options to tailor the process.

Many data science projects leave the data in their original files, and use a few lines of Python code to import it.

df = pd.read_csv('eu-pop-2017.csv')

df

We look at the dataframe as it is, and use the 'head()' command which will only show the first 5 lines of data:

df.head()

The meaning of the colums, we have to get from metada information. In this case, we have the following description of the data:

geo: the country in question
pop17: the population count of that country as of 1 January 2017
pop18: the population count of that country as of 1 January 2018
births: the number of (live) births in the country during the year 2017
deaths: the number of deaths in that country during the year 2017

The data is provided for all of the 28 European Union members (as of 2017).

We want to use the country as the country name as the index. We can achieve this either with

df2 = df.set_index('geo')

df2.head()

Note that we cannot change the index in a given DataFrame, so the set_index() method returns a new DataFrame. (This happens for many operations.)

An as alternative, we can also modify the import statement to already indicate which column we want to use as the index:

df = pd.read_csv('eu-pop-2017.csv', index_col="geo")

df.head()

We explore the data by plotting some of it:

df.plot(kind='bar', y='pop17')

<matplotlib.axes._subplots.AxesSubplot at 0x11c2c40b8>

The above shows the population as of 1 Jan 2017.

We'll try to improve this in two ways:

we want to count population in millions. We can do this by dividing all the data by $10^6$.
it would be interesting to sort the countries in order of size for this plot.

df_millions = df / 1e6

df_millions['pop17'].sort_values(ascending=False).plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x11c39e748>

The example above selects one column from the data frame (['pop17') and that returns a Series object. Then we sort this Series object using sort_values() according to the values (that's the number of poeple in each country), then we plot this.

Alternatively, we could also create a plot for the whole data frame, but say that the pop17 is the column for sorting, and that we want to plot only the column with pop17:

df_millions.sort_values(by='pop17').plot(kind='bar', y='pop17')

<matplotlib.axes._subplots.AxesSubplot at 0x11c4fa908>

We can also plot more than one column at the same time:

ax = df_millions.sort_values(by='pop17').plot(kind='bar', y=['pop17', 'pop18'])

We can also fine tune the plot with the usual matplotlib commands:

ax = df_millions.sort_values(by='pop17').plot(kind='bar', y='pop17', figsize=(10, 4))
ax.set_ylabel("population 2017 [in millions]")
ax.grid()
ax.set_xlabel(None);  # get rid of default label for x-axis ('geo')

Based on the number of births and deaths, we can compute change in population for each country for 2017. This is sometimes called the "natural-change":

df['natural-change'] = df['births'] - df['deaths']

df['natural-change'].sort_values()

geo
Italy            -190910
Germany          -148000
Romania           -71125
Bulgaria          -45836
Hungary           -37231
Greece            -36007
Spain             -31245
Portugal          -23432
Croatia           -16921
Lithuania         -11446
Latvia             -7929
Finland            -3401
Estonia            -1759
Poland              -870
Slovenia            -268
Malta                748
Luxembourg          1911
Czechia             2962
Cyprus              3232
Slovakia            4055
Austria             4363
Denmark             8136
Belgium            10024
Netherlands        19173
Sweden             23444
Ireland            31760
United Kingdom    147871
France            164550
Name: natural-change, dtype: int64

From this, we can see that the population change due to births and deaths in Italy and Germany is decreasing most in absolute terms.

To relate this to the overall size of the population, one often uses rates per year and per 1000 people in the country, such as the birth rate per 1000 inhabitants [1] (and death rate accordingly):

[1] https://en.wikipedia.org/wiki/Birth_rate

df['birth-rate'] = df['births'] / df['pop17'] * 1000
df['death-rate'] = df['deaths'] / df['pop17'] * 1000
df['natural-change-rate'] = df['natural-change'] / df['pop17'] * 1000

df.head()

We can now look at the natural rate of change of population for each country, which is normalised by the population in that country.

ax = df.sort_values(by='natural-change-rate').plot(kind='bar', y='natural-change-rate', figsize=(10, 4))
ax.set_title("Natural change due to births and deaths per 1000 in 2017");

We can show the data together with the underlying birth and death rate data:

tmp = df.sort_values(by='natural-change-rate')

fig, axes = plt.subplots(2, 1, figsize=(12, 6))

tmp.plot(kind='bar', y=['natural-change-rate'], sharex=True, ax=axes[0])
axes[0].set_title("Population change per 1000 in 2017")
tmp.plot(kind='bar', y=['death-rate', 'birth-rate'], sharex=True, ax=axes[1])

<matplotlib.axes._subplots.AxesSubplot at 0x11cb4f5f8>

We haven't used the information we have about the population on 1 January 2018 yet.

Let's first look at the absolute changes in the population based on the (census?) data from 1 Jan 2017 and 1 Jan 2018:

df['change'] = df['pop18'] - df['pop17']

ax = df.sort_values(by='change').plot(y='change', kind='bar')
ax.set_title("Total change in population per country in 2017");

With that information, we can estimate migration. (It is important to note that this estimated number will also absorb all inaccuracies or changes of the data gathering method, in the original data described as "statistical adjustment".)

df['migration'] = df['change'] - df['natural-change']

df.head()

Let's plot the total change of the population per country in the top subfigure, and the contribution from natural changes and migration in the lower subfigure:

tmp = df.sort_values(by='change')
fig, axes = plt.subplots(2, 1, figsize=(12, 6))

tmp.plot(kind='bar', y=['change'], sharex=True, ax=axes[0])
axes[0].set_title("Population changes in 2017")
axes[0].legend(['total change of population (migration + natural change due to deaths and births'])
tmp.plot(kind='bar', y=['migration', 'natural-change'], sharex=True, ax=axes[1])
axes[1].legend(['Migration', "natural change due to deaths and births"])
axes[1].set_xlabel(None);

	stock	price
apple	10	0.55
banana	22	0.50
cucumber	1	0.99
orange	3	1.76
potato	110	0.17

	stock	price
apple	10.0	0.55
banana	22.0	0.50
cucumber	1.0	0.99
grapefruit	NaN	1.99
orange	3.0	1.76
potato	110.0	0.17

	stock	price	value
apple	10	0.55	5.50
banana	22	0.50	11.00
cucumber	1	0.99	0.99
orange	3	1.76	5.28
potato	110	0.17	18.70

	apple	banana	cucumber	orange	potato
stock	10.00	22.0	1.00	3.00	110.00
price	0.55	0.5	0.99	1.76	0.17
value	5.50	11.0	0.99	5.28	18.70

	geo	pop17	pop18	births	deaths
0	Belgium	11351727	11413058	119690	109666
1	Bulgaria	7101859	7050034	63955	109791
2	Czechia	10578820	10610055	114405	111443
3	Denmark	5748769	5781190	61397	53261
4	Germany	82521653	82850000	785000	933000
5	Estonia	1315634	1319133	13784	15543
6	Ireland	4784383	4838259	62084	30324
7	Greece	10768193	10738868	88523	124530
8	Spain	46527039	46659302	390024	421269
9	France	66989083	67221943	767691	603141
10	Croatia	4154212	4105493	36556	53477
11	Italy	60589445	60483973	458151	649061
12	Cyprus	854802	864236	9229	5997
13	Latvia	1950116	1934379	20828	28757
14	Lithuania	2847904	2808901	28696	40142
15	Luxembourg	590667	602005	6174	4263
16	Hungary	9797561	9778371	94646	131877
17	Malta	460297	475701	4319	3571
18	Netherlands	17081507	17181084	169200	150027
19	Austria	8772865	8822267	87633	83270
20	Poland	37972964	37976687	401982	402852
21	Portugal	10309573	10291027	86154	109586
22	Romania	19644350	19523621	189474	260599
23	Slovenia	2065895	2066880	20241	20509
24	Slovakia	5435343	5443120	57969	53914
25	Finland	5503297	5513130	50321	53722
26	Sweden	9995153	10120242	115416	91972
27	United Kingdom	65808573	66238007	755043	607172