5 Pandas#

When working with large complex datasets it is often useful to be able to manipulate the data in a variety of ways. Numpy is great for doing calculations on arrays but it is not so good at manipulating data.

5.1 Series and Dataframes#

Pandas is a python package which builds on top of Numpy. It is designed to allow fast and efficient manipulation of data structures. It introduces two basic datatypes:

  1. A Series - a 1d array

  2. A Dataframe - a 2d array

These datatypes are built on top of Numpy but the main difference is that all the data is not just indexed by sequential numbers but has labels. This will become clearer as we study them.

5.1.1 Creating Series and DataFrames#

We can create both Series and DataFrames from other datatypes like lists, numpy arrays or dictionaries. See example below or the docs for full details (Series and DataFrames)

Be aware that when Pandas switched to v2 certain functions etc were deprecated (ie no longer available). There are lots of examples on the internet still using old methods which won’t work. Check the docs if you are unsure

import pandas as pd

#Create a series
my_data = [3,5,7,9,11]
data_series = pd.Series(my_data)

#A dataframe
other_data = {'age':[20, 24, 24, 50],'name':['Jane','Tom','Lucy','Bob'], 'gender':['Female', 'Male','Female', 'Male']}
df = pd.DataFrame(other_data)   # You'll see that people often write the dataframe variable as df. 

print(data_series)
df
0     3
1     5
2     7
3     9
4    11
dtype: int64
age name gender
0 20 Jane Female
1 24 Tom Male
2 24 Lucy Female
3 50 Bob Male

As well as the values we put in, there is an additional column to the left which is called the index. We didn’t specify index labels, so pandas has created values which are just sequential numbers. However, part of the power of Pandas is we can specify a meaningful index when we create it or set index to a column later.

df=pd.DataFrame(other_data, index=['a','c','e','g'])
df
age name gender
a 20 Jane Female
c 24 Tom Male
e 24 Lucy Female
g 50 Bob Male

5.1.2 Selecting Data in a DataFrame#

This labelled index provides us with two ways of referring to the row. We can index by location (df.iloc[row, col]) in a manner similar to numpy. ie the first row is row 0, the second is row 1. The columns can also be indexed as 0 and 1. However, we can also index by label (df.loc[row_label, col_label]). It is possible that more than one row may have the same label and this second method will select both.

N.B. In Pandas wherever you want to select multiple values you need to add an extra pair of brackets.

print('Indexing by location')
display(df.iloc[0])  #   Get the first row
display(df.iloc[:,0]) #   Get the first column. N.B double brackets
display(df.iloc[[1,2],[0,1]]) #   Get the second and first row's values in first and second column. N.B double brackets
display(df.iloc[0,1]) #   Get item at first row and second column

print('\nIndexing by label')
display(df.loc['a'])  #   Get row labelled a
display(df.loc[:,'name']) #   Get column labelled name  --> returns a Series
display(df.loc[:,['name','age']])   #   Get columns labelled name and age --> returns a Dataframe. 
display(df.loc['a','name'])   #   value with row labelled a and column labelled name
Indexing by location
age           20
name        Jane
gender    Female
Name: a, dtype: object
a    20
c    24
e    24
g    50
Name: age, dtype: int64
age name
c 24 Tom
e 24 Lucy
'Jane'
Indexing by label
age           20
name        Jane
gender    Female
Name: a, dtype: object
a    Jane
c     Tom
e    Lucy
g     Bob
Name: name, dtype: object
name age
a Jane 20
c Tom 24
e Lucy 24
g Bob 50
'Jane'

5.1.3 Modifying DataFrames#

We can create new bits of a DataFrame or Series and modify the existing ones.

age_series = df['age']  #  Select a single column as a series 
new_df = df[['age','name']] #   Again notice the extra brackets

df['subject'] = ['maths','biology','computer science','physics']    #   Add a new column. Use list, np.array but length must match original df
df
age name gender subject
a 20 Jane Female maths
c 24 Tom Male biology
e 24 Lucy Female computer science
g 50 Bob Male physics

We can also apply conditional filtering of the rows like this

names_with_o = df[df['name'].str.contains('o')]     #   A Series has a .str property which allows you to operate on all the strings in a column
display(names_with_o)

young_women = df[(df['age'] < 25) & (df['gender'] == 'Female')]    #   We can use any logic to filter. Find Women under 25.
display(young_women)
age name gender subject
c 24 Tom Male biology
g 50 Bob Male physics
age name gender subject
a 20 Jane Female maths
e 24 Lucy Female computer science

Since each column is effectively a Numpy array we can also do mathematical and statistical operations on them.

display(df['age']*3 + df['age'])
df['age'].mean()
a     80
c     96
e     96
g    200
Name: age, dtype: int64
29.5

5.2 Data input and output#

5.2.1 Reading Data from Files#

Whilst we can generate Series and Dataframes like this, it is more common to read a large dataset from a file or database. Pandas accepts many formats. For now we will work with csv files, but you could also use Excel, hdf5 (a high density format), sql etc for which there are similar methods. As an example we are going to work with the public Covid-19 dataset from 2019 maintained by John Hopkins University (dataset). The file is included in resources / textfiles / owid-covid-data.csv

df_covid = pd.read_csv('resources/textfiles/owid-covid-data.csv') 
df_covid
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index population excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
0 AFG Asia Afghanistan 2020-01-03 NaN 0.0 NaN NaN 0.0 NaN ... NaN 37.746 0.5 64.83 0.511 41128772.0 NaN NaN NaN NaN
1 AFG Asia Afghanistan 2020-01-04 NaN 0.0 NaN NaN 0.0 NaN ... NaN 37.746 0.5 64.83 0.511 41128772.0 NaN NaN NaN NaN
2 AFG Asia Afghanistan 2020-01-05 NaN 0.0 NaN NaN 0.0 NaN ... NaN 37.746 0.5 64.83 0.511 41128772.0 NaN NaN NaN NaN
3 AFG Asia Afghanistan 2020-01-06 NaN 0.0 NaN NaN 0.0 NaN ... NaN 37.746 0.5 64.83 0.511 41128772.0 NaN NaN NaN NaN
4 AFG Asia Afghanistan 2020-01-07 NaN 0.0 NaN NaN 0.0 NaN ... NaN 37.746 0.5 64.83 0.511 41128772.0 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
320267 ZWE Africa Zimbabwe 2023-06-17 265289.0 0.0 18.286 5702.0 0.0 1.000 ... 30.7 36.791 1.7 61.49 0.571 16320539.0 NaN NaN NaN NaN
320268 ZWE Africa Zimbabwe 2023-06-18 265289.0 0.0 18.286 5702.0 0.0 1.000 ... 30.7 36.791 1.7 61.49 0.571 16320539.0 NaN NaN NaN NaN
320269 ZWE Africa Zimbabwe 2023-06-19 265413.0 124.0 17.714 5707.0 5.0 0.714 ... 30.7 36.791 1.7 61.49 0.571 16320539.0 NaN NaN NaN NaN
320270 ZWE Africa Zimbabwe 2023-06-20 265413.0 NaN NaN 5707.0 0.0 0.714 ... 30.7 36.791 1.7 61.49 0.571 16320539.0 NaN NaN NaN NaN
320271 ZWE Africa Zimbabwe 2023-06-21 265413.0 NaN NaN 5707.0 0.0 0.714 ... 30.7 36.791 1.7 61.49 0.571 16320539.0 NaN NaN NaN NaN

320272 rows × 67 columns

Pandas displays the beginning and the end of the dataframe but all the data has been loaded into memory for us to work with.

There are a whole host of options upon reading a file so it is always worth looking at the docs. Not all the columns above are of interest. We also want to pick out the data so we might want to set the index to a sensible column. The method df.head(n=numberrows) also allows us to look at a small section of the data

df_covid = pd.read_csv('resources/textfiles/owid-covid-data.csv', index_col='date',usecols=['location','continent', 'date', 'total_cases','new_cases','total_deaths','total_cases_per_million','hospital_beds_per_thousand','life_expectancy','population'])
df_covid_1stMay = df_covid.loc['2020-05-01']    #The dataset contains an entry for every day so lets look at one specific date.
df_covid_1stMay.head()
continent location total_cases new_cases total_deaths total_cases_per_million hospital_beds_per_thousand life_expectancy population
date
2020-05-01 Asia Afghanistan 1937.0 110.0 60.0 47.096 0.50 64.83 4.112877e+07
2020-05-01 NaN Africa 40131.0 2490.0 2614.0 28.128 NaN NaN 1.426737e+09
2020-05-01 Europe Albania 766.0 16.0 30.0 269.498 2.89 78.57 2.842318e+06
2020-05-01 Africa Algeria 4006.0 158.0 447.0 89.214 1.90 76.88 4.490323e+07
2020-05-01 Oceania American Samoa NaN 0.0 NaN NaN NaN 73.74 4.429500e+04

5.2.2 Writing Data to files#

We can also write dataframes to file very easily pd.to_csv docs. Similar methods exist to write to other formats.

time = np.linspace(0,4*np.pi,100)
df_example = pd.DataFrame({'time':time,'amplitude':np.sin(0.3*time)})
df_example.to_csv('resources/textfiles/write_example.csv', index=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 time = np.linspace(0,4*np.pi,100)
      2 df_example = pd.DataFrame({'time':time,'amplitude':np.sin(0.3*time)})
      3 df_example.to_csv('resources/textfiles/write_example.csv', index=False)

NameError: name 'np' is not defined

5.3.1 Returning vs inplace#

Many methods have a keyword argument called inplace. By default this is usually False meaning that the method will return a new DataFrame leaving the original unchanged ie:

    new_df = pd.pandas_method(old_df)

However, if you set inplace=True this modifies the original DataFrame, so there is no return value. This can be really good for memory and speed, but it comes with the trade off that you’ve changed the original data which you might want for something else.

    pd.pandas_method(df, inplace=True)

If you get odd things in your code, where you are finding that you seem to have lost your DataFrame and python is complaining about NoneTypes it may be you’ve written something like:

    new_df = pd.pandas_method(df, inplace=True)

5.3.2 Data Cleaning#

In any realworld dataset the data is usually not perfect and maybe not in the format we want. There may be column data we are not interested in. For this example we’ll pick new_cases. We can also see that some of the values are not present and have been filled with NaN. This is the same as np.nan we saw before. We have to think carefully about the most appropriate option. There are several options but the decision is yours and depends on your data and task:

  1. Leave them:

NaNs can be helpful because they are not included in calculations of the mean or median of a column and hence don’t skew the results by including 0 or some other value.

  1. Replace them with another value:

You can replace the NaN values with a substitute value, perhaps 0 or something else appropriate

  1. Remove the row or column where the NaNs appear.

#NB if I try and run this cell more than once it will error as the data has been changed. To rerun I need to rerun the cell above

#Drop the column "new_cases"
df_covid_1stMay.drop(columns='new_cases', inplace=True)     #   Note no return the original df is modified due to inplace=True

#Replace the NaNs in hospital_beds with 1
df_covid_1stMay.loc[:,'hospital_beds_per_thousand'].fillna(value=1,inplace=True)

#   Drop those rows where the column data in either total_cases or life_expectancy contains a NaN.
df_covid_1stMay.dropna(inplace=True, subset=['total_cases','life_expectancy'], how='any')  

df_covid_1stMay.head(n=5)

5.3.3 Copy vs Views#

Pandas did what we asked but there were some warnings. What are those about?

When we looked at the python datatype list we saw that if one list was equal to the other, then changes to one list affected the other. In Pandas there is a slightly complicated subtelty. Sometimes Pandas returns a copy (ie an independent set of data) and sometimes a View. A View is similar to the effect we saw with lists in that it accesses the same data and hence changes in the View will also affect the original. See the example below. It’s not always obvious, when Pandas does which but it’s something to be aware of if you get odd things happening. See this discussion

other_data = {'age':[20, 24, 24, 50],'name':['Jane','Tom','Lucy','Bob'], 'gender':['Female', 'Male','Female', 'Male']}
df = pd.DataFrame(other_data, index=['a','c','e','g'])   # You'll see that people often write the dataframe variable as df. 

view = df['name']
view['g'] = 'Martha' #   Change a value on the Series view but it alters the original DataFrame.
df

5.4 Aggregation and grouping#

Often we want to group our data and work out the properties on those datasets. For example, returning to the Covid dataset, let’s group the countries together by continent. We will then work out for each continent the max, min and average life expectancy.

pandas groupby

#Create a groupby object
continent_groupby = df_covid.groupby('continent')
print(continent_groupby.groups.keys())  #   Display the categories

#For each group of countries calculate the max, min and mean life expectancy
display(continent_groupby['life_expectancy'].max())
display(continent_groupby['life_expectancy'].min())
display(continent_groupby['life_expectancy'].mean())

5.5 TimeSeries Data#

Another common class of tasks are operations on data that forms a sequential sequence. Often this is data that varies over time. For example, we might want to calculate the running average on some noisy data as a way of trying to smooth it. Our original covid dataset contains the number of new cases every day, for each country. We might expect that the fluctuations from day to day would make this data less useful than a weekly figure. Lets try and calculate the number of new cases in the UK summed over each week. To do that we will take the 7 day rolling average.

pandas rolling

import matplotlib.pyplot as plt
import numpy as np

# Extract data for the UK
df_uk = df_covid[df_covid['location'] == 'United Kingdom']
df_weekly=df_uk['new_cases'].rolling(window=7, center=True).sum()
display(df_weekly.loc['2023-05-01':'2023-05-10'])    #   Note using slicing on row label indices beacuse they are sequential indices.
#We can then see the effect of the smoothing
fig, ax = plt.subplots()

ax.plot(7*df_uk['new_cases'],'r-')  # Original data scaled to compare, Note you can send a Series directly to Matplotlib
ax.plot(df_weekly,'b-') # We can supply a Series directly to maplotlib. It uses index as x values.

xticks = df_weekly.index[::300] #   Bit of slicing to generate smaller number of labels for x axis
ax.set_xticks(xticks)
ax.set_xlabel('Date')
ax.set_ylabel('Total New Cases')
ax.set_title('New Covid Cases UK')

5.6 Joining DataFrames#

There are various different ways to join dataframes together. Pandas docs here cover this topic in a lot of detail. Some of the diagrams I’ve used are taken from there. There are 2 main different ways to do this:

5.6.1 Concatenation#

Concatenation joins DataFrames “vertically”. The diagram below shows how we might join 3 dataframes df1, df2 and df3 of the same width and same columns.

Drawing
# A simplified example of the above.
df1 = pd.DataFrame({'A':[1,2,3],'B':[2,3,4],'C':[3,4,5]}, index=[1,2,3])
df2 = pd.DataFrame({'A':[4,5,6],'B':[5,6,7],'C':[6,7,8]}, index=[2,3,4])

display(pd.concat([df1,df2]))       #   Sometimes the values in the index will clash. Sometimes you want this - The index info is important.
display(pd.concat([df1,df2], ignore_index=True))        # Sometimes you don't - You want each value to have its own reference.

5.6.2 Merging#

There are many different options for joining DataFrames. A merge is often used to join DataFrames horizontally. This can be really helpful if you have two datasources which contain related information. By merging them we can easily work out how the data is related.

If DataFrames have a common index or column, you can use this to join them. They don’t have to have the same number of rows. Pandas will use the rows that are the same and produce NaNs in positions where values don’t exist. You can also join on multiple columns. Here things get a bit messy. You can use the values in the joining column or columns retaining the rows in either the left dataframe or the right dataframe, the common values (‘inner’) or all values (‘outer’).

Perhaps easiest to see this as an example:

# A simplified example of the above.
df1 = pd.DataFrame({'Name':['Bob','Mike','Jane'],'Subject':['Physics','Maths','Chemistry']}, index=[1,2,3])
df2 = pd.DataFrame({'Name':['Mike','Jane','Luke'],'Year':[1,2,3]}, index=[2,3,4])
display(df1)
display(df2)

print('merge results')

display(pd.merge(df1,df2,on='Name', how='left'))    # Only keep the rows where the Name exists in df1
display(pd.merge(df1,df2,on='Name', how='right'))   # Only keep the rows where the Name exists in df2
display(pd.merge(df1,df2,on='Name', how='inner'))   # Only keep the rows where the Name exists in both df1 and df2
display(pd.merge(df1,df2,on='Name',how='outer'))    # keep all the rows

5.7 Speeding things up in Pandas#

There are several simple things that you can do to make your Pandas code run faster:

  1. Just as with Numpy, you should avoid writing code that performs an operation on a DataFrame by running a for loop on each row sequentially. The in-built Pandas functions will be much faster than using python code.

  2. Where possible use the inplace=True as a keyword argument to functions. This modifies the original data but generally runs faster.

  3. Use the index to access rows rather than boolean selection (see example below).

  4. The order of the index can make a huge difference. Sorting it into order and then using the index to access data can result in huge speed ups. The example below shows that once sorted there is a speed up from about 25ms to 200μs!

# Working with the df_uk frame
df_uk = df_covid[df_covid['location'] == 'United Kingdom']
df_uk.head()
%%timeit    #   This example uses .loc which is quicker (a bit) than the next cell where we access the same thing using boolean indexing
df_uk.loc['2022-06-01']
%%timeit
df_uk[df_uk.index=='2022-06-01']
%%timeit
# However there is a big difference if we sort the index. The first example does the same as above, using the full df_covid dataset. Here, the index of the dataframe is sorted by the location and then by the index date and is pretty slow.
df_covid['location'].loc['2021-05-03']
%%timeit
# Sorting the dataframe takes a long time but if we want to run the operations on the sorted dataframe multiple times, it is better to sort it once and then use it.
df_covid2 = df_covid.copy()
df_covid2.sort_index(inplace=True)
%%timeit
# Now the dataframe is sorted the .loc operation is really fast.
df_covid2['location'].loc['2021-05-03']