Pandas: Basics#

Overview#

Why Series and Data Frames?

Idea:
- Give 1-D and 2D data more structure, information and methods than vectors and matrices have.
- Access rows and columns by general indices/labels/names.
- Store non-numerical data and data of different types (strings, numbers, …) in one object.
Advantages of series and data frames compared to spreadheets: Data frames and Series get really powerfull when
- you have to handle so-called “big data”, where overview is lost easily.
- you have to do automated data processing repeating similar operations on new data many times.
- you have to debug your system.
- you need transparency of the processing steps. In a spreadsheet the processing steps (changes and generation of new cells) are not saved and thus not clear for an outsider.

We will use Pandas, which is a software library written for the Python programming language for data manipulation and analysis.

In computer science there are many different and additional data structures and management software systems, e.g., relational databases and SQL, graphs, n-dim arrays, … The pandas data frame is comparable to R’s data frame concept.

Contents:

In this introduction we will only touch (mostly using synthetic and small data) some fundamental topics like:

data classes: Series, Data Frames
indexing and slicing: slice and dice
handling of missing values
methods: describe, correlation, diff/prct_change, shifting, general transformations, sorting
file input/output
date and time data
visualization
random data generation
real examples

References:

For detailed and more documentation see the Pandas Documentation or the book Python for Data Analysis by Wes McKinney.

import pandas as pd
import numpy as np

Series#

A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

s = pd.Series(data, index)

Typically, a Series object is created by reading some data from a file. Here, we create a simple synthetic Series whose values are of data type float.

s = pd.Series(data=[2, 3, 1, 2, 5], 
              index=['Max','Emil', 'Sarah','David','Ilvy'],
              dtype=float)
s

Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
dtype: float64

Change the data type of the values to int.

print(s.dtype)
s = s.astype(int)
print(s.dtype)

float64
int64

Give a name to the series and the indices:

s.name = 'grades'
s.index.name= 'students'
s

students
Max      2
Emil     3
Sarah    1
David    2
Ilvy     5
Name: grades, dtype: int64

Indexing and Slicing:

s.index

Index(['Max', 'Emil', 'Sarah', 'David', 'Ilvy'], dtype='object', name='students')

Get the value with index = ‘Sarah’:

s['Sarah']

Get the series corresponding to a list of indices.

s[['Ilvy','Emil']]

students
Ilvy    5
Emil    3
Name: grades, dtype: int64

s.iloc[2:4]

students
Sarah    1
David    2
Name: grades, dtype: int64

Boolean Indexing:

s < 5

students
Max       True
Emil      True
Sarah     True
David     True
Ilvy     False
Name: grades, dtype: bool

s[s < 5]

students
Max      2
Emil     3
Sarah    1
David    2
Name: grades, dtype: int64

Interated Indexing

s[s < 5][s > 2]

students
Emil    3
Name: grades, dtype: int64

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

s[ (s < 5) & (s > 2) | (s == 1)]

students
Emil     3
Sarah    1
Name: grades, dtype: int64

Queries:

2 in s.values

True

'Maxx' in s.index

False

len(s[s == 2])

Operations on the values of a series:

s*2 + s

students
Max       6
Emil      9
Sarah     3
David     6
Ilvy     15
Name: grades, dtype: int64

Missing values: NaN (Not a Number) value represents a missing value:

Add a missing (NaN .. not a number) value

s['Isa'] = np.NaN
s

students
Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
Isa      NaN
Name: grades, dtype: float64

Delete an entry

del s['Isa']
s

students
Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
Name: grades, dtype: float64

Alternatively, you can add a Series using the append method.

s2 = pd.Series({'Isa': np.NaN})
s = pd.concat([s, s2])
s

Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
Isa      NaN
dtype: float64

Query Null (i.e. NaN) values

s.isnull()

Max      False
Emil     False
Sarah    False
David    False
Ilvy     False
Isa       True
dtype: bool

Drop NaN values

s = s.dropna()
s

Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
dtype: float64

Data Frames#

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.

df = pd.DataFrame(data, index, columns)

Typically, a DataFrame object is created by reading some data from a file. Here, we create a simple synthetic DataFrame from a dictionary.

data = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year': [2000,2001,2002,2001,2002], 
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9

Change order of columns = reindex columns

df = df.reindex(columns=['year','state','pop'])
df

	year	state	pop
0	2000	Ohio	1.5
1	2001	Ohio	1.7
2	2002	Ohio	3.6
3	2001	Nevada	2.4
4	2002	Nevada	2.9

Let’s add a new column name with no data values and let’s give the indices names:

df = pd.DataFrame(data, 
                  columns= ['pop','state','year','debt'], 
                  index=['one','two','three','four','five'])
## alternative way: df['debt'] = NaN
df

	pop	state	year	debt
one	1.5	Ohio	2000	NaN
two	1.7	Ohio	2001	NaN
three	3.6	Ohio	2002	NaN
four	2.4	Nevada	2001	NaN
five	2.9	Nevada	2002	NaN

Let’s change the index, thereby adding a new row with known data values:

df = pd.DataFrame(df, index = ['one','two','three','four','five', 'six'])
df

	pop	state	year	debt
one	1.5	Ohio	2000.0	NaN
two	1.7	Ohio	2001.0	NaN
three	3.6	Ohio	2002.0	NaN
four	2.4	Nevada	2001.0	NaN
five	2.9	Nevada	2002.0	NaN
six	NaN	NaN	NaN	NaN

Apend a row with known values:

row = pd.Series({'pop':3,'debt':np.NaN, 'state':'Texas', 'year':2000}, name='seven')
df = df.append(row)
df

/tmp/ipykernel_22036/311088242.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df.append(row)

	pop	state	year	debt
one	1.5	Ohio	2000.0	NaN
two	1.7	Ohio	2001.0	NaN
three	3.6	Ohio	2002.0	NaN
four	2.4	Nevada	2001.0	NaN
five	2.9	Nevada	2002.0	NaN
six	NaN	NaN	NaN	NaN
seven	3.0	Texas	2000.0	NaN

Access the columns and index of the data frame:

df.columns

Index(['pop', 'state', 'year', 'debt'], dtype='object')

df.index

Index(['one', 'two', 'three', 'four', 'five', 'six', 'seven'], dtype='object')

Indexing#

For more cf. http://pandas.pydata.org/pandas-docs/stable/indexing.html

Indexing a column by it’s name/label: Note that for data frames the []-operator selects columns (and not indices as with series)!

df['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six         NaN
seven     Texas
Name: state, dtype: object

Note that the returned object is a Series:

type(df['state'])

pandas.core.series.Series

An alternative way to get one column (if its name is one word):

df.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six         NaN
seven     Texas
Name: state, dtype: object

Index more than one column to get a sliced DataFrame:

df[['year','pop']]

	year	pop
one	2000.0	1.5
two	2001.0	1.7
three	2002.0	3.6
four	2001.0	2.4
five	2002.0	2.9
six	NaN	NaN
seven	2000.0	3.0

Index a row by an integer indicating the row number:

df.iloc[0]   # df[0] results in an error!

pop         1.5
state      Ohio
year     2000.0
debt        NaN
Name: one, dtype: object

Slice some rows with integers:

df.iloc[1:3]

	pop	state	year	debt
two	1.7	Ohio	2001.0	NaN
three	3.6	Ohio	2002.0	NaN

Index one or more rows by name/label:

df.loc['one']

pop         1.5
state      Ohio
year     2000.0
debt        NaN
Name: one, dtype: object

df.loc[['one','three']]

	pop	state	year	debt
one	1.5	Ohio	2000.0	NaN
three	3.6	Ohio	2002.0	NaN

Select both rows and columns: The first argument refers to row selection, the second to column selection.

df.loc[['two','three'], ['pop','state']]

	pop	state
two	1.7	Ohio
three	3.6	Ohio

Boolean Indexing:

df['year'] != 2001

one       True
two      False
three     True
four     False
five      True
six       True
seven     True
Name: year, dtype: bool

df[df['year'] != 2001]

	pop	state	year	debt
one	1.5	Ohio	2000.0	NaN
three	3.6	Ohio	2002.0	NaN
five	2.9	Nevada	2002.0	NaN
six	NaN	NaN	NaN	NaN
seven	3.0	Texas	2000.0	NaN

Dropping rows and columns:

df.drop('six')   # The default axis is 0 (=rows).

	pop	state	year	debt
one	1.5	Ohio	2000.0	NaN
two	1.7	Ohio	2001.0	NaN
three	3.6	Ohio	2002.0	NaN
four	2.4	Nevada	2001.0	NaN
five	2.9	Nevada	2002.0	NaN
seven	3.0	Texas	2000.0	NaN

df.drop('debt', axis=1)

	pop	state	year
one	1.5	Ohio	2000.0
two	1.7	Ohio	2001.0
three	3.6	Ohio	2002.0
four	2.4	Nevada	2001.0
five	2.9	Nevada	2002.0
six	NaN	NaN	NaN
seven	3.0	Texas	2000.0

df['new'] = [1,2,3,4,5,6, 7]
df

	pop	state	year	debt	new
one	1.5	Ohio	2000.0	NaN	1
two	1.7	Ohio	2001.0	NaN	2
three	3.6	Ohio	2002.0	NaN	3
four	2.4	Nevada	2001.0	NaN	4
five	2.9	Nevada	2002.0	NaN	5
six	NaN	NaN	NaN	NaN	6
seven	3.0	Texas	2000.0	NaN	7

df.loc['eight'] = [1, 'as', 2000, 3,2]
df

	pop	state	year	debt	new
one	1.5	Ohio	2000.0	NaN	1
two	1.7	Ohio	2001.0	NaN	2
three	3.6	Ohio	2002.0	NaN	3
four	2.4	Nevada	2001.0	NaN	4
five	2.9	Nevada	2002.0	NaN	5
six	NaN	NaN	NaN	NaN	6
seven	3.0	Texas	2000.0	NaN	7
eight	1.0	as	2000.0	3	2

df = df.drop('eight')
df

	pop	state	year	debt	new
one	1.5	Ohio	2000.0	NaN	1
two	1.7	Ohio	2001.0	NaN	2
three	3.6	Ohio	2002.0	NaN	3
four	2.4	Nevada	2001.0	NaN	4
five	2.9	Nevada	2002.0	NaN	5
six	NaN	NaN	NaN	NaN	6
seven	3.0	Texas	2000.0	NaN	7

Missing Values#

For more cf. http://pandas.pydata.org/pandas-docs/stable/missing_data.html

Dropping missing values:

df.dropna()

	pop	state	year	debt	new

Only those columns (axis =1) are dropped where all values are NaN, i.e., missing.

df.dropna(how='all', axis = 1)  

	pop	state	year	new
one	1.5	Ohio	2000.0	1
two	1.7	Ohio	2001.0	2
three	3.6	Ohio	2002.0	3
four	2.4	Nevada	2001.0	4
five	2.9	Nevada	2002.0	5
six	NaN	NaN	NaN	6
seven	3.0	Texas	2000.0	7

df.dropna(how='all', axis = 0)

	pop	state	year	debt	new
one	1.5	Ohio	2000.0	NaN	1
two	1.7	Ohio	2001.0	NaN	2
three	3.6	Ohio	2002.0	NaN	3
four	2.4	Nevada	2001.0	NaN	4
five	2.9	Nevada	2002.0	NaN	5
six	NaN	NaN	NaN	NaN	6
seven	3.0	Texas	2000.0	NaN	7

Filling missing values:

df.fillna('unknown')

	pop	state	year	debt	new
one	1.5	Ohio	2000.0	unknown	1
two	1.7	Ohio	2001.0	unknown	2
three	3.6	Ohio	2002.0	unknown	3
four	2.4	Nevada	2001.0	unknown	4
five	2.9	Nevada	2002.0	unknown	5
six	unknown	unknown	unknown	unknown	6
seven	3.0	Texas	2000.0	unknown	7

df.fillna({'state': 'unknown', 'debt': 0})

	pop	state	year	new
one	1.5	Ohio	2000.0	1
two	1.7	Ohio	2001.0	2
three	3.6	Ohio	2002.0	3
four	2.4	Nevada	2001.0	4
five	2.9	Nevada	2002.0	5
six	NaN	unknown	NaN	6
seven	3.0	Texas	2000.0	7

Concatenation#

For more cf. http://pandas.pydata.org/pandas-docs/stable/merging.html

df1 = pd.DataFrame({'A': [1,2,3], 'B':[4,3,1]}, 
                   index = ['Max','Eric','Maria'])
df1

	A	B
Max	1	4
Eric	2	3
Maria	3	1

df2 = pd.DataFrame({'B': [3,1,0], 'C':[4,3,1]}, 
                   index = ['Eric','Maria','Anna'])
df2

	B	C
Eric	3	4
Maria	1	3
Anna	0	1

pd.concat([df1, df2])  # default axis =0

	A	B	C
Max	1.0	4	NaN
Eric	2.0	3	NaN
Maria	3.0	1	NaN
Eric	NaN	3	4.0
Maria	NaN	1	3.0
Anna	NaN	0	1.0

pd.concat([df1, df2], axis=1)

	A	B	B	C
Max	1.0	4.0	NaN	NaN
Eric	2.0	3.0	3.0	4.0
Maria	3.0	1.0	1.0	3.0
Anna	NaN	NaN	0.0	1.0

df3 = pd.concat([df1, df2], axis=1, join='inner')
df3

	A	B	B	C
Eric	2	3	3	4
Maria	3	1	1	3

Dropping duplicates: drop_duplicates() removes duplicate inidces. Therefore, transposing the data frame is necessary.

df3.T.drop_duplicates().T 

	A	B	C
Eric	2	3	4
Maria	3	1	3

Methods#

Sorting#

df.sort_index()   # default: sort rows/index in alphabetical order

	pop	state	year	debt	new
five	2.9	Nevada	2002.0	NaN	5
four	2.4	Nevada	2001.0	NaN	4
one	1.5	Ohio	2000.0	NaN	1
seven	3.0	Texas	2000.0	NaN	7
six	NaN	NaN	NaN	NaN	6
three	3.6	Ohio	2002.0	NaN	3
two	1.7	Ohio	2001.0	NaN	2

df.sort_index(axis=1)  # sort columns in alphabetical order

	debt	new	pop	state	year
one	NaN	1	1.5	Ohio	2000.0
two	NaN	2	1.7	Ohio	2001.0
three	NaN	3	3.6	Ohio	2002.0
four	NaN	4	2.4	Nevada	2001.0
five	NaN	5	2.9	Nevada	2002.0
six	NaN	6	NaN	NaN	NaN
seven	NaN	7	3.0	Texas	2000.0

df['pop'].sort_values()  # order Series by its values

one      1.5
two      1.7
four     2.4
five     2.9
seven    3.0
three    3.6
six      NaN
Name: pop, dtype: float64

df.sort_values(by='pop') # order the whole data frame by the values of a column

	pop	state	year	debt	new
one	1.5	Ohio	2000.0	NaN	1
two	1.7	Ohio	2001.0	NaN	2
four	2.4	Nevada	2001.0	NaN	4
five	2.9	Nevada	2002.0	NaN	5
seven	3.0	Texas	2000.0	NaN	7
three	3.6	Ohio	2002.0	NaN	3
six	NaN	NaN	NaN	NaN	6

Ranking:

Max      2.0
Emil     3.0
Sarah    1.0
David    2.0
Ilvy     5.0
dtype: float64

s.rank()

Max      2.5
Emil     4.0
Sarah    1.0
David    2.5
Ilvy     5.0
dtype: float64

s.rank().sort_values()

Sarah    1.0
Max      2.5
David    2.5
Emil     4.0
Ilvy     5.0
dtype: float64

Elementwise function application#

df['pop'].map(np.log)

one      0.405465
two      0.530628
three    1.280934
four     0.875469
five     1.064711
six           NaN
seven    1.098612
Name: pop, dtype: float64

Summarizing and Descriptive Statistics#

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
                  index = ['a','b','c','d'],
                  columns = ['one','two'])
df

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

df.sum()

one    9.25
two   -5.80
dtype: float64

df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

df.mean(axis=1, skipna=True)

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

df.idxmax()

one    b
two    d
dtype: object

df.cumsum()

	one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

df.describe()

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

Correlation and Covariance#

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

import pandas_datareader as pdr

# Tickersymbole siehe: https://www.google.com/finance
BP   = pdr.get_data_yahoo("BP" ,  start='2010-01-02', end='2016-11-11')
XOM  = pdr.get_data_yahoo("XOM",  start='2010-01-02', end='2016-11-11')
PXD  = pdr.get_data_yahoo("PXD" , start='2010-01-02', end='2016-11-11')

type(BP)

pandas.core.frame.DataFrame

BP.head()

	High	Low	Open	Close	Volume	Adj Close
Date
2010-01-04	59.450001	59.080002	59.299999	59.150002	3956100.0	30.011429
2010-01-05	59.900002	59.310001	59.650002	59.570000	4109600.0	30.224531
2010-01-06	59.919998	59.340000	59.520000	59.880001	6227900.0	30.381807
2010-01-07	60.000000	59.689999	59.919998	59.860001	4431300.0	30.371668
2010-01-08	60.060001	59.669998	59.790001	60.000000	3786100.0	30.442690

Create a dataframe the columns of which are the closing stock prices:

price = pd.DataFrame({'BP': BP.Close, 'XOM': XOM.Close, 'PXD':PXD.Close})
price.head(3)

	BP	XOM	PXD
Date
2010-01-04	59.150002	69.150002	50.980000
2010-01-05	59.570000	69.419998	51.000000
2010-01-06	59.880001	70.019997	51.889999

Create a dataframe the columns of which are the volumnes of the stocks:

volume = pd.DataFrame( {'BP': BP.Volume, 'XOM': XOM.Volume, 'PXD':PXD.Volume})
volume.describe()

	BP	XOM	PXD
count	1.729000e+03	1.729000e+03	1.729000e+03
mean	9.389188e+06	1.643562e+07	1.868674e+06
std	1.510144e+07	8.312840e+06	9.655614e+05
min	1.724500e+06	4.156600e+06	2.533000e+05
25%	4.778700e+06	1.086030e+07	1.253700e+06
50%	6.401400e+06	1.424370e+07	1.651600e+06
75%	9.112600e+06	1.959910e+07	2.245800e+06
max	2.408085e+08	1.180235e+08	1.405610e+07

Compute the percentage changes (1% is given as 0.01) from trading day to trading day and print the tail of the resulting data frame:

returns = price.pct_change()
returns.tail()

	BP	XOM	PXD
Date
2016-11-07	0.014311	0.022496	0.013617
2016-11-08	-0.015873	-0.001638	0.009225
2016-11-09	0.010753	0.011019	0.015254
2016-11-10	0.002955	0.009275	0.009679
2016-11-11	-0.022392	-0.015853	-0.040907

Correlation and Covariance:

returns.BP.corr(returns.XOM)

0.63458431357487

returns.corr()

	BP	XOM	PXD
BP	1.000000	0.634584	0.528361
XOM	0.634584	1.000000	0.619057
PXD	0.528361	0.619057	1.000000

returns.cov()

	BP	XOM	PXD
BP	0.000323	0.000137	0.000226
XOM	0.000137	0.000145	0.000177
PXD	0.000226	0.000177	0.000566

returns.corrwith(volume)

BP    -0.073794
XOM   -0.040337
PXD   -0.041891
dtype: float64

Programmieren mit Python

Pandas: Basics

Contents