Pandas: Basics#

Overview#

Why Series and Data Frames?

  1. Idea:

    • Give 1-D and 2D data more structure, information and methods than vectors and matrices have.

    • Access rows and columns by general indices/labels/names.

    • Store non-numerical data and data of different types (strings, numbers, …) in one object.

  2. Advantages of series and data frames compared to spreadheets: Data frames and Series get really powerfull when

    • you have to handle so-called “big data”, where overview is lost easily.

    • you have to do automated data processing repeating similar operations on new data many times.

    • you have to debug your system.

    • you need transparency of the processing steps. In a spreadsheet the processing steps (changes and generation of new cells) are not saved and thus not clear for an outsider.

We will use Pandas, which is a software library written for the Python programming language for data manipulation and analysis.

In computer science there are many different and additional data structures and management software systems, e.g., relational databases and SQL, graphs, n-dim arrays, … The pandas data frame is comparable to R’s data frame concept.

Contents:

In this introduction we will only touch (mostly using synthetic and small data) some fundamental topics like:

  • data classes: Series, Data Frames

  • indexing and slicing: slice and dice

  • handling of missing values

  • methods: describe, correlation, diff/prct_change, shifting, general transformations, sorting

  • file input/output

  • date and time data

  • visualization

  • random data generation

  • real examples

References:

For detailed and more documentation see the Pandas Documentation or the book Python for Data Analysis by Wes McKinney.

import pandas as pd
import numpy as np

Series#

A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

s = pd.Series(data, index)

Typically, a Series object is created by reading some data from a file. Here, we create a simple synthetic Series whose values are of data type float.

s = pd.Series(data=[2, 3, 1, 2, 5], 
              index=['Max','Emil', 'Sarah','David','Ilvy'],
              dtype=float)
s
Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
dtype: float64

Change the data type of the values to int.

print(s.dtype)
s = s.astype(int)
print(s.dtype)
float64
int64

Give a name to the series and the indices:

s.name = 'grades'
s.index.name= 'students'
s
students
Max      2
Emil     3
Sarah    1
David    2
Ilvy     5
Name: grades, dtype: int64

Indexing and Slicing:

s.index
Index(['Max', 'Emil', 'Sarah', 'David', 'Ilvy'], dtype='object', name='students')

Get the value with index = ‘Sarah’:

s['Sarah']
1

Get the series corresponding to a list of indices.

s[['Ilvy','Emil']]
students
Ilvy    5
Emil    3
Name: grades, dtype: int64
s.iloc[2:4]
students
Sarah    1
David    2
Name: grades, dtype: int64

Boolean Indexing:

s < 5
students
Max       True
Emil      True
Sarah     True
David     True
Ilvy     False
Name: grades, dtype: bool
s[s < 5]
students
Max      2
Emil     3
Sarah    1
David    2
Name: grades, dtype: int64

Interated Indexing

s[s < 5][s > 2]
students
Emil    3
Name: grades, dtype: int64

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

s[ (s < 5) & (s > 2) | (s == 1)]
students
Emil     3
Sarah    1
Name: grades, dtype: int64

Queries:

2 in s.values
True
'Maxx' in s.index
False
len(s[s == 2])
2

Operations on the values of a series:

s*2 + s
students
Max       6
Emil      9
Sarah     3
David     6
Ilvy     15
Name: grades, dtype: int64

Missing values: NaN (Not a Number) value represents a missing value:

Add a missing (NaN .. not a number) value

s['Isa'] = np.NaN
s
students
Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
Isa      NaN
Name: grades, dtype: float64

Delete an entry

del s['Isa']
s
students
Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
Name: grades, dtype: float64

Alternatively, you can add a Series using the append method.

s2 = pd.Series({'Isa': np.NaN})
s = pd.concat([s, s2])
s
Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
Isa      NaN
dtype: float64

Query Null (i.e. NaN) values

s.isnull()
Max      False
Emil     False
Sarah    False
David    False
Ilvy     False
Isa       True
dtype: bool

Drop NaN values

s = s.dropna()
s
Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
dtype: float64

Data Frames#

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.

df = pd.DataFrame(data, index, columns)

Typically, a DataFrame object is created by reading some data from a file. Here, we create a simple synthetic DataFrame from a dictionary.

data = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year': [2000,2001,2002,2001,2002], 
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9

Change order of columns = reindex columns

df = df.reindex(columns=['year','state','pop'])
df
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9

Let’s add a new column name with no data values and let’s give the indices names:

df = pd.DataFrame(data, 
                  columns= ['pop','state','year','debt'], 
                  index=['one','two','three','four','five'])
## alternative way: df['debt'] = NaN
df
pop state year debt
one 1.5 Ohio 2000 NaN
two 1.7 Ohio 2001 NaN
three 3.6 Ohio 2002 NaN
four 2.4 Nevada 2001 NaN
five 2.9 Nevada 2002 NaN

Let’s change the index, thereby adding a new row with known data values:

df = pd.DataFrame(df, index = ['one','two','three','four','five', 'six'])
df
pop state year debt
one 1.5 Ohio 2000.0 NaN
two 1.7 Ohio 2001.0 NaN
three 3.6 Ohio 2002.0 NaN
four 2.4 Nevada 2001.0 NaN
five 2.9 Nevada 2002.0 NaN
six NaN NaN NaN NaN

Apend a row with known values:

row = pd.Series({'pop':3,'debt':np.NaN, 'state':'Texas', 'year':2000}, name='seven')
df = df.append(row)
df
/tmp/ipykernel_22036/311088242.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df.append(row)
pop state year debt
one 1.5 Ohio 2000.0 NaN
two 1.7 Ohio 2001.0 NaN
three 3.6 Ohio 2002.0 NaN
four 2.4 Nevada 2001.0 NaN
five 2.9 Nevada 2002.0 NaN
six NaN NaN NaN NaN
seven 3.0 Texas 2000.0 NaN

Access the columns and index of the data frame:

df.columns
Index(['pop', 'state', 'year', 'debt'], dtype='object')
df.index
Index(['one', 'two', 'three', 'four', 'five', 'six', 'seven'], dtype='object')

Indexing#

For more cf. http://pandas.pydata.org/pandas-docs/stable/indexing.html

Indexing a column by it’s name/label: Note that for data frames the []-operator selects columns (and not indices as with series)!

df['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six         NaN
seven     Texas
Name: state, dtype: object

Note that the returned object is a Series:

type(df['state'])
pandas.core.series.Series

An alternative way to get one column (if its name is one word):

df.state
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six         NaN
seven     Texas
Name: state, dtype: object

Index more than one column to get a sliced DataFrame:

df[['year','pop']]
year pop
one 2000.0 1.5
two 2001.0 1.7
three 2002.0 3.6
four 2001.0 2.4
five 2002.0 2.9
six NaN NaN
seven 2000.0 3.0

Index a row by an integer indicating the row number:

df.iloc[0]   # df[0] results in an error!
pop         1.5
state      Ohio
year     2000.0
debt        NaN
Name: one, dtype: object

Slice some rows with integers:

df.iloc[1:3]
pop state year debt
two 1.7 Ohio 2001.0 NaN
three 3.6 Ohio 2002.0 NaN

Index one or more rows by name/label:

df.loc['one']
pop         1.5
state      Ohio
year     2000.0
debt        NaN
Name: one, dtype: object
df.loc[['one','three']]
pop state year debt
one 1.5 Ohio 2000.0 NaN
three 3.6 Ohio 2002.0 NaN

Select both rows and columns: The first argument refers to row selection, the second to column selection.

df.loc[['two','three'], ['pop','state']]
pop state
two 1.7 Ohio
three 3.6 Ohio

Boolean Indexing:

df['year'] != 2001
one       True
two      False
three     True
four     False
five      True
six       True
seven     True
Name: year, dtype: bool
df[df['year'] != 2001]
pop state year debt
one 1.5 Ohio 2000.0 NaN
three 3.6 Ohio 2002.0 NaN
five 2.9 Nevada 2002.0 NaN
six NaN NaN NaN NaN
seven 3.0 Texas 2000.0 NaN

Dropping rows and columns:

df.drop('six')   # The default axis is 0 (=rows).
pop state year debt
one 1.5 Ohio 2000.0 NaN
two 1.7 Ohio 2001.0 NaN
three 3.6 Ohio 2002.0 NaN
four 2.4 Nevada 2001.0 NaN
five 2.9 Nevada 2002.0 NaN
seven 3.0 Texas 2000.0 NaN
df.drop('debt', axis=1)
pop state year
one 1.5 Ohio 2000.0
two 1.7 Ohio 2001.0
three 3.6 Ohio 2002.0
four 2.4 Nevada 2001.0
five 2.9 Nevada 2002.0
six NaN NaN NaN
seven 3.0 Texas 2000.0
df['new'] = [1,2,3,4,5,6, 7]
df
pop state year debt new
one 1.5 Ohio 2000.0 NaN 1
two 1.7 Ohio 2001.0 NaN 2
three 3.6 Ohio 2002.0 NaN 3
four 2.4 Nevada 2001.0 NaN 4
five 2.9 Nevada 2002.0 NaN 5
six NaN NaN NaN NaN 6
seven 3.0 Texas 2000.0 NaN 7
df.loc['eight'] = [1, 'as', 2000, 3,2]
df
pop state year debt new
one 1.5 Ohio 2000.0 NaN 1
two 1.7 Ohio 2001.0 NaN 2
three 3.6 Ohio 2002.0 NaN 3
four 2.4 Nevada 2001.0 NaN 4
five 2.9 Nevada 2002.0 NaN 5
six NaN NaN NaN NaN 6
seven 3.0 Texas 2000.0 NaN 7
eight 1.0 as 2000.0 3 2
df = df.drop('eight')
df
pop state year debt new
one 1.5 Ohio 2000.0 NaN 1
two 1.7 Ohio 2001.0 NaN 2
three 3.6 Ohio 2002.0 NaN 3
four 2.4 Nevada 2001.0 NaN 4
five 2.9 Nevada 2002.0 NaN 5
six NaN NaN NaN NaN 6
seven 3.0 Texas 2000.0 NaN 7

Missing Values#

For more cf. http://pandas.pydata.org/pandas-docs/stable/missing_data.html

Dropping missing values:

df.dropna()
pop state year debt new

Only those columns (axis =1) are dropped where all values are NaN, i.e., missing.

df.dropna(how='all', axis = 1)  
pop state year new
one 1.5 Ohio 2000.0 1
two 1.7 Ohio 2001.0 2
three 3.6 Ohio 2002.0 3
four 2.4 Nevada 2001.0 4
five 2.9 Nevada 2002.0 5
six NaN NaN NaN 6
seven 3.0 Texas 2000.0 7
df.dropna(how='all', axis = 0)
pop state year debt new
one 1.5 Ohio 2000.0 NaN 1
two 1.7 Ohio 2001.0 NaN 2
three 3.6 Ohio 2002.0 NaN 3
four 2.4 Nevada 2001.0 NaN 4
five 2.9 Nevada 2002.0 NaN 5
six NaN NaN NaN NaN 6
seven 3.0 Texas 2000.0 NaN 7

Filling missing values:

df.fillna('unknown')
pop state year debt new
one 1.5 Ohio 2000.0 unknown 1
two 1.7 Ohio 2001.0 unknown 2
three 3.6 Ohio 2002.0 unknown 3
four 2.4 Nevada 2001.0 unknown 4
five 2.9 Nevada 2002.0 unknown 5
six unknown unknown unknown unknown 6
seven 3.0 Texas 2000.0 unknown 7
df.fillna({'state': 'unknown', 'debt': 0})
pop state year debt new
one 1.5 Ohio 2000.0 0 1
two 1.7 Ohio 2001.0 0 2
three 3.6 Ohio 2002.0 0 3
four 2.4 Nevada 2001.0 0 4
five 2.9 Nevada 2002.0 0 5
six NaN unknown NaN 0 6
seven 3.0 Texas 2000.0 0 7

Concatenation#

For more cf. http://pandas.pydata.org/pandas-docs/stable/merging.html

df1 = pd.DataFrame({'A': [1,2,3], 'B':[4,3,1]}, 
                   index = ['Max','Eric','Maria'])
df1
A B
Max 1 4
Eric 2 3
Maria 3 1
df2 = pd.DataFrame({'B': [3,1,0], 'C':[4,3,1]}, 
                   index = ['Eric','Maria','Anna'])
df2
B C
Eric 3 4
Maria 1 3
Anna 0 1
pd.concat([df1, df2])  # default axis =0
A B C
Max 1.0 4 NaN
Eric 2.0 3 NaN
Maria 3.0 1 NaN
Eric NaN 3 4.0
Maria NaN 1 3.0
Anna NaN 0 1.0
pd.concat([df1, df2], axis=1)
A B B C
Max 1.0 4.0 NaN NaN
Eric 2.0 3.0 3.0 4.0
Maria 3.0 1.0 1.0 3.0
Anna NaN NaN 0.0 1.0
df3 = pd.concat([df1, df2], axis=1, join='inner')
df3
A B B C
Eric 2 3 3 4
Maria 3 1 1 3

Dropping duplicates: drop_duplicates() removes duplicate inidces. Therefore, transposing the data frame is necessary.

df3.T.drop_duplicates().T 
A B C
Eric 2 3 4
Maria 3 1 3

Methods#

Sorting#

df.sort_index()   # default: sort rows/index in alphabetical order
pop state year debt new
five 2.9 Nevada 2002.0 NaN 5
four 2.4 Nevada 2001.0 NaN 4
one 1.5 Ohio 2000.0 NaN 1
seven 3.0 Texas 2000.0 NaN 7
six NaN NaN NaN NaN 6
three 3.6 Ohio 2002.0 NaN 3
two 1.7 Ohio 2001.0 NaN 2
df.sort_index(axis=1)  # sort columns in alphabetical order
debt new pop state year
one NaN 1 1.5 Ohio 2000.0
two NaN 2 1.7 Ohio 2001.0
three NaN 3 3.6 Ohio 2002.0
four NaN 4 2.4 Nevada 2001.0
five NaN 5 2.9 Nevada 2002.0
six NaN 6 NaN NaN NaN
seven NaN 7 3.0 Texas 2000.0
df['pop'].sort_values()  # order Series by its values
one      1.5
two      1.7
four     2.4
five     2.9
seven    3.0
three    3.6
six      NaN
Name: pop, dtype: float64
df.sort_values(by='pop') # order the whole data frame by the values of a column
pop state year debt new
one 1.5 Ohio 2000.0 NaN 1
two 1.7 Ohio 2001.0 NaN 2
four 2.4 Nevada 2001.0 NaN 4
five 2.9 Nevada 2002.0 NaN 5
seven 3.0 Texas 2000.0 NaN 7
three 3.6 Ohio 2002.0 NaN 3
six NaN NaN NaN NaN 6

Ranking:

s
Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
dtype: float64
s.rank()
Max      2.5
Emil     4.0
Sarah    1.0
David    2.5
Ilvy     5.0
dtype: float64
s.rank().sort_values()
Sarah    1.0
Max      2.5
David    2.5
Emil     4.0
Ilvy     5.0
dtype: float64

Elementwise function application#

df['pop'].map(np.log)
one      0.405465
two      0.530628
three    1.280934
four     0.875469
five     1.064711
six           NaN
seven    1.098612
Name: pop, dtype: float64

Summarizing and Descriptive Statistics#

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
                  index = ['a','b','c','d'],
                  columns = ['one','two'])
df
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.sum()
one    9.25
two   -5.80
dtype: float64
df.sum(axis=1)
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
df.mean(axis=1, skipna=True)
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64
df.idxmax()
one    b
two    d
dtype: object
df.cumsum()
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
df.describe()
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000

Correlation and Covariance#

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

import pandas_datareader as pdr

# Tickersymbole siehe: https://www.google.com/finance
BP   = pdr.get_data_yahoo("BP" ,  start='2010-01-02', end='2016-11-11')
XOM  = pdr.get_data_yahoo("XOM",  start='2010-01-02', end='2016-11-11')
PXD  = pdr.get_data_yahoo("PXD" , start='2010-01-02', end='2016-11-11')
type(BP)
pandas.core.frame.DataFrame
BP.head()
High Low Open Close Volume Adj Close
Date
2010-01-04 59.450001 59.080002 59.299999 59.150002 3956100.0 30.011429
2010-01-05 59.900002 59.310001 59.650002 59.570000 4109600.0 30.224531
2010-01-06 59.919998 59.340000 59.520000 59.880001 6227900.0 30.381807
2010-01-07 60.000000 59.689999 59.919998 59.860001 4431300.0 30.371668
2010-01-08 60.060001 59.669998 59.790001 60.000000 3786100.0 30.442690

Create a dataframe the columns of which are the closing stock prices:

price = pd.DataFrame({'BP': BP.Close, 'XOM': XOM.Close, 'PXD':PXD.Close})
price.head(3)
BP XOM PXD
Date
2010-01-04 59.150002 69.150002 50.980000
2010-01-05 59.570000 69.419998 51.000000
2010-01-06 59.880001 70.019997 51.889999

Create a dataframe the columns of which are the volumnes of the stocks:

volume = pd.DataFrame( {'BP': BP.Volume, 'XOM': XOM.Volume, 'PXD':PXD.Volume})
volume.describe()
BP XOM PXD
count 1.729000e+03 1.729000e+03 1.729000e+03
mean 9.389188e+06 1.643562e+07 1.868674e+06
std 1.510144e+07 8.312840e+06 9.655614e+05
min 1.724500e+06 4.156600e+06 2.533000e+05
25% 4.778700e+06 1.086030e+07 1.253700e+06
50% 6.401400e+06 1.424370e+07 1.651600e+06
75% 9.112600e+06 1.959910e+07 2.245800e+06
max 2.408085e+08 1.180235e+08 1.405610e+07

Compute the percentage changes (1% is given as 0.01) from trading day to trading day and print the tail of the resulting data frame:

returns = price.pct_change()
returns.tail()
BP XOM PXD
Date
2016-11-07 0.014311 0.022496 0.013617
2016-11-08 -0.015873 -0.001638 0.009225
2016-11-09 0.010753 0.011019 0.015254
2016-11-10 0.002955 0.009275 0.009679
2016-11-11 -0.022392 -0.015853 -0.040907

Correlation and Covariance:

returns.BP.corr(returns.XOM)
0.63458431357487
returns.corr()
BP XOM PXD
BP 1.000000 0.634584 0.528361
XOM 0.634584 1.000000 0.619057
PXD 0.528361 0.619057 1.000000
returns.cov()
BP XOM PXD
BP 0.000323 0.000137 0.000226
XOM 0.000137 0.000145 0.000177
PXD 0.000226 0.000177 0.000566
returns.corrwith(volume)
BP    -0.073794
XOM   -0.040337
PXD   -0.041891
dtype: float64