Pandas: Basics
Contents
Pandas: Basics#
Overview#
Why Series and Data Frames?
Idea:
Give 1-D and 2D data more structure, information and methods than vectors and matrices have.
Access rows and columns by general indices/labels/names.
Store non-numerical data and data of different types (strings, numbers, …) in one object.
Advantages of series and data frames compared to spreadheets: Data frames and Series get really powerfull when
you have to handle so-called “big data”, where overview is lost easily.
you have to do automated data processing repeating similar operations on new data many times.
you have to debug your system.
you need transparency of the processing steps. In a spreadsheet the processing steps (changes and generation of new cells) are not saved and thus not clear for an outsider.
We will use Pandas, which is a software library written for the Python programming language for data manipulation and analysis.
In computer science there are many different and additional data structures and management software systems, e.g., relational databases and SQL, graphs, n-dim arrays, … The pandas data frame is comparable to R’s data frame concept.
Contents:
In this introduction we will only touch (mostly using synthetic and small data) some fundamental topics like:
data classes: Series, Data Frames
indexing and slicing: slice and dice
handling of missing values
methods: describe, correlation, diff/prct_change, shifting, general transformations, sorting
file input/output
date and time data
visualization
random data generation
real examples
References:
For detailed and more documentation see the Pandas Documentation or the book Python for Data Analysis by Wes McKinney.
import pandas as pd
import numpy as np
Series#
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.
s = pd.Series(data, index)
Typically, a Series object is created by reading some data from a file. Here, we create a simple synthetic Series whose values are of data type float
.
s = pd.Series(data=[2, 3, 1, 2, 5],
index=['Max','Emil', 'Sarah','David','Ilvy'],
dtype=float)
s
Max 2.0
Emil 3.0
Sarah 1.0
David 2.0
Ilvy 5.0
dtype: float64
Change the data type of the values to int
.
print(s.dtype)
s = s.astype(int)
print(s.dtype)
float64
int64
Give a name to the series and the indices:
s.name = 'grades'
s.index.name= 'students'
s
students
Max 2
Emil 3
Sarah 1
David 2
Ilvy 5
Name: grades, dtype: int64
Indexing and Slicing:
s.index
Index(['Max', 'Emil', 'Sarah', 'David', 'Ilvy'], dtype='object', name='students')
Get the value with index = ‘Sarah’:
s['Sarah']
1
Get the series corresponding to a list of indices.
s[['Ilvy','Emil']]
students
Ilvy 5
Emil 3
Name: grades, dtype: int64
s.iloc[2:4]
students
Sarah 1
David 2
Name: grades, dtype: int64
Boolean Indexing:
s < 5
students
Max True
Emil True
Sarah True
David True
Ilvy False
Name: grades, dtype: bool
s[s < 5]
students
Max 2
Emil 3
Sarah 1
David 2
Name: grades, dtype: int64
Interated Indexing
s[s < 5][s > 2]
students
Emil 3
Name: grades, dtype: int64
Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.
s[ (s < 5) & (s > 2) | (s == 1)]
students
Emil 3
Sarah 1
Name: grades, dtype: int64
Queries:
2 in s.values
True
'Maxx' in s.index
False
len(s[s == 2])
2
Operations on the values of a series:
s*2 + s
students
Max 6
Emil 9
Sarah 3
David 6
Ilvy 15
Name: grades, dtype: int64
Missing values: NaN (Not a Number) value represents a missing value:
Add a missing (NaN .. not a number) value
s['Isa'] = np.NaN
s
students
Max 2.0
Emil 3.0
Sarah 1.0
David 2.0
Ilvy 5.0
Isa NaN
Name: grades, dtype: float64
Delete an entry
del s['Isa']
s
students
Max 2.0
Emil 3.0
Sarah 1.0
David 2.0
Ilvy 5.0
Name: grades, dtype: float64
Alternatively, you can add a Series using the append
method.
s2 = pd.Series({'Isa': np.NaN})
s = pd.concat([s, s2])
s
Max 2.0
Emil 3.0
Sarah 1.0
David 2.0
Ilvy 5.0
Isa NaN
dtype: float64
Query Null (i.e. NaN) values
s.isnull()
Max False
Emil False
Sarah False
David False
Ilvy False
Isa True
dtype: bool
Drop NaN values
s = s.dropna()
s
Max 2.0
Emil 3.0
Sarah 1.0
David 2.0
Ilvy 5.0
dtype: float64
Data Frames#
A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.
df = pd.DataFrame(data, index, columns)
Typically, a DataFrame object is created by reading some data from a file. Here, we create a simple synthetic DataFrame from a dictionary.
data = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year': [2000,2001,2002,2001,2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
Change order of columns = reindex columns
df = df.reindex(columns=['year','state','pop'])
df
year | state | pop | |
---|---|---|---|
0 | 2000 | Ohio | 1.5 |
1 | 2001 | Ohio | 1.7 |
2 | 2002 | Ohio | 3.6 |
3 | 2001 | Nevada | 2.4 |
4 | 2002 | Nevada | 2.9 |
Let’s add a new column name with no data values and let’s give the indices names:
df = pd.DataFrame(data,
columns= ['pop','state','year','debt'],
index=['one','two','three','four','five'])
## alternative way: df['debt'] = NaN
df
pop | state | year | debt | |
---|---|---|---|---|
one | 1.5 | Ohio | 2000 | NaN |
two | 1.7 | Ohio | 2001 | NaN |
three | 3.6 | Ohio | 2002 | NaN |
four | 2.4 | Nevada | 2001 | NaN |
five | 2.9 | Nevada | 2002 | NaN |
Let’s change the index, thereby adding a new row with known data values:
df = pd.DataFrame(df, index = ['one','two','three','four','five', 'six'])
df
pop | state | year | debt | |
---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | NaN |
two | 1.7 | Ohio | 2001.0 | NaN |
three | 3.6 | Ohio | 2002.0 | NaN |
four | 2.4 | Nevada | 2001.0 | NaN |
five | 2.9 | Nevada | 2002.0 | NaN |
six | NaN | NaN | NaN | NaN |
Apend a row with known values:
row = pd.Series({'pop':3,'debt':np.NaN, 'state':'Texas', 'year':2000}, name='seven')
df = df.append(row)
df
/tmp/ipykernel_22036/311088242.py:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df = df.append(row)
pop | state | year | debt | |
---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | NaN |
two | 1.7 | Ohio | 2001.0 | NaN |
three | 3.6 | Ohio | 2002.0 | NaN |
four | 2.4 | Nevada | 2001.0 | NaN |
five | 2.9 | Nevada | 2002.0 | NaN |
six | NaN | NaN | NaN | NaN |
seven | 3.0 | Texas | 2000.0 | NaN |
Access the columns and index of the data frame:
df.columns
Index(['pop', 'state', 'year', 'debt'], dtype='object')
df.index
Index(['one', 'two', 'three', 'four', 'five', 'six', 'seven'], dtype='object')
Indexing#
For more cf. http://pandas.pydata.org/pandas-docs/stable/indexing.html
Indexing a column by it’s name/label: Note that for data frames the []-operator selects columns (and not indices as with series)!
df['state']
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six NaN
seven Texas
Name: state, dtype: object
Note that the returned object is a Series:
type(df['state'])
pandas.core.series.Series
An alternative way to get one column (if its name is one word):
df.state
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six NaN
seven Texas
Name: state, dtype: object
Index more than one column to get a sliced DataFrame:
df[['year','pop']]
year | pop | |
---|---|---|
one | 2000.0 | 1.5 |
two | 2001.0 | 1.7 |
three | 2002.0 | 3.6 |
four | 2001.0 | 2.4 |
five | 2002.0 | 2.9 |
six | NaN | NaN |
seven | 2000.0 | 3.0 |
Index a row by an integer indicating the row number:
df.iloc[0] # df[0] results in an error!
pop 1.5
state Ohio
year 2000.0
debt NaN
Name: one, dtype: object
Slice some rows with integers:
df.iloc[1:3]
pop | state | year | debt | |
---|---|---|---|---|
two | 1.7 | Ohio | 2001.0 | NaN |
three | 3.6 | Ohio | 2002.0 | NaN |
Index one or more rows by name/label:
df.loc['one']
pop 1.5
state Ohio
year 2000.0
debt NaN
Name: one, dtype: object
df.loc[['one','three']]
pop | state | year | debt | |
---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | NaN |
three | 3.6 | Ohio | 2002.0 | NaN |
Select both rows and columns: The first argument refers to row selection, the second to column selection.
df.loc[['two','three'], ['pop','state']]
pop | state | |
---|---|---|
two | 1.7 | Ohio |
three | 3.6 | Ohio |
Boolean Indexing:
df['year'] != 2001
one True
two False
three True
four False
five True
six True
seven True
Name: year, dtype: bool
df[df['year'] != 2001]
pop | state | year | debt | |
---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | NaN |
three | 3.6 | Ohio | 2002.0 | NaN |
five | 2.9 | Nevada | 2002.0 | NaN |
six | NaN | NaN | NaN | NaN |
seven | 3.0 | Texas | 2000.0 | NaN |
Dropping rows and columns:
df.drop('six') # The default axis is 0 (=rows).
pop | state | year | debt | |
---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | NaN |
two | 1.7 | Ohio | 2001.0 | NaN |
three | 3.6 | Ohio | 2002.0 | NaN |
four | 2.4 | Nevada | 2001.0 | NaN |
five | 2.9 | Nevada | 2002.0 | NaN |
seven | 3.0 | Texas | 2000.0 | NaN |
df.drop('debt', axis=1)
pop | state | year | |
---|---|---|---|
one | 1.5 | Ohio | 2000.0 |
two | 1.7 | Ohio | 2001.0 |
three | 3.6 | Ohio | 2002.0 |
four | 2.4 | Nevada | 2001.0 |
five | 2.9 | Nevada | 2002.0 |
six | NaN | NaN | NaN |
seven | 3.0 | Texas | 2000.0 |
df['new'] = [1,2,3,4,5,6, 7]
df
pop | state | year | debt | new | |
---|---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | NaN | 1 |
two | 1.7 | Ohio | 2001.0 | NaN | 2 |
three | 3.6 | Ohio | 2002.0 | NaN | 3 |
four | 2.4 | Nevada | 2001.0 | NaN | 4 |
five | 2.9 | Nevada | 2002.0 | NaN | 5 |
six | NaN | NaN | NaN | NaN | 6 |
seven | 3.0 | Texas | 2000.0 | NaN | 7 |
df.loc['eight'] = [1, 'as', 2000, 3,2]
df
pop | state | year | debt | new | |
---|---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | NaN | 1 |
two | 1.7 | Ohio | 2001.0 | NaN | 2 |
three | 3.6 | Ohio | 2002.0 | NaN | 3 |
four | 2.4 | Nevada | 2001.0 | NaN | 4 |
five | 2.9 | Nevada | 2002.0 | NaN | 5 |
six | NaN | NaN | NaN | NaN | 6 |
seven | 3.0 | Texas | 2000.0 | NaN | 7 |
eight | 1.0 | as | 2000.0 | 3 | 2 |
df = df.drop('eight')
df
pop | state | year | debt | new | |
---|---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | NaN | 1 |
two | 1.7 | Ohio | 2001.0 | NaN | 2 |
three | 3.6 | Ohio | 2002.0 | NaN | 3 |
four | 2.4 | Nevada | 2001.0 | NaN | 4 |
five | 2.9 | Nevada | 2002.0 | NaN | 5 |
six | NaN | NaN | NaN | NaN | 6 |
seven | 3.0 | Texas | 2000.0 | NaN | 7 |
Missing Values#
For more cf. http://pandas.pydata.org/pandas-docs/stable/missing_data.html
Dropping missing values:
df.dropna()
pop | state | year | debt | new |
---|
Only those columns (axis =1) are dropped where all values are NaN, i.e., missing.
df.dropna(how='all', axis = 1)
pop | state | year | new | |
---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | 1 |
two | 1.7 | Ohio | 2001.0 | 2 |
three | 3.6 | Ohio | 2002.0 | 3 |
four | 2.4 | Nevada | 2001.0 | 4 |
five | 2.9 | Nevada | 2002.0 | 5 |
six | NaN | NaN | NaN | 6 |
seven | 3.0 | Texas | 2000.0 | 7 |
df.dropna(how='all', axis = 0)
pop | state | year | debt | new | |
---|---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | NaN | 1 |
two | 1.7 | Ohio | 2001.0 | NaN | 2 |
three | 3.6 | Ohio | 2002.0 | NaN | 3 |
four | 2.4 | Nevada | 2001.0 | NaN | 4 |
five | 2.9 | Nevada | 2002.0 | NaN | 5 |
six | NaN | NaN | NaN | NaN | 6 |
seven | 3.0 | Texas | 2000.0 | NaN | 7 |
Filling missing values:
df.fillna('unknown')
pop | state | year | debt | new | |
---|---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | unknown | 1 |
two | 1.7 | Ohio | 2001.0 | unknown | 2 |
three | 3.6 | Ohio | 2002.0 | unknown | 3 |
four | 2.4 | Nevada | 2001.0 | unknown | 4 |
five | 2.9 | Nevada | 2002.0 | unknown | 5 |
six | unknown | unknown | unknown | unknown | 6 |
seven | 3.0 | Texas | 2000.0 | unknown | 7 |
df.fillna({'state': 'unknown', 'debt': 0})
pop | state | year | debt | new | |
---|---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | 0 | 1 |
two | 1.7 | Ohio | 2001.0 | 0 | 2 |
three | 3.6 | Ohio | 2002.0 | 0 | 3 |
four | 2.4 | Nevada | 2001.0 | 0 | 4 |
five | 2.9 | Nevada | 2002.0 | 0 | 5 |
six | NaN | unknown | NaN | 0 | 6 |
seven | 3.0 | Texas | 2000.0 | 0 | 7 |
Concatenation#
For more cf. http://pandas.pydata.org/pandas-docs/stable/merging.html
df1 = pd.DataFrame({'A': [1,2,3], 'B':[4,3,1]},
index = ['Max','Eric','Maria'])
df1
A | B | |
---|---|---|
Max | 1 | 4 |
Eric | 2 | 3 |
Maria | 3 | 1 |
df2 = pd.DataFrame({'B': [3,1,0], 'C':[4,3,1]},
index = ['Eric','Maria','Anna'])
df2
B | C | |
---|---|---|
Eric | 3 | 4 |
Maria | 1 | 3 |
Anna | 0 | 1 |
pd.concat([df1, df2]) # default axis =0
A | B | C | |
---|---|---|---|
Max | 1.0 | 4 | NaN |
Eric | 2.0 | 3 | NaN |
Maria | 3.0 | 1 | NaN |
Eric | NaN | 3 | 4.0 |
Maria | NaN | 1 | 3.0 |
Anna | NaN | 0 | 1.0 |
pd.concat([df1, df2], axis=1)
A | B | B | C | |
---|---|---|---|---|
Max | 1.0 | 4.0 | NaN | NaN |
Eric | 2.0 | 3.0 | 3.0 | 4.0 |
Maria | 3.0 | 1.0 | 1.0 | 3.0 |
Anna | NaN | NaN | 0.0 | 1.0 |
df3 = pd.concat([df1, df2], axis=1, join='inner')
df3
A | B | B | C | |
---|---|---|---|---|
Eric | 2 | 3 | 3 | 4 |
Maria | 3 | 1 | 1 | 3 |
Dropping duplicates: drop_duplicates() removes duplicate inidces. Therefore, transposing the data frame is necessary.
df3.T.drop_duplicates().T
A | B | C | |
---|---|---|---|
Eric | 2 | 3 | 4 |
Maria | 3 | 1 | 3 |
Methods#
Sorting#
df.sort_index() # default: sort rows/index in alphabetical order
pop | state | year | debt | new | |
---|---|---|---|---|---|
five | 2.9 | Nevada | 2002.0 | NaN | 5 |
four | 2.4 | Nevada | 2001.0 | NaN | 4 |
one | 1.5 | Ohio | 2000.0 | NaN | 1 |
seven | 3.0 | Texas | 2000.0 | NaN | 7 |
six | NaN | NaN | NaN | NaN | 6 |
three | 3.6 | Ohio | 2002.0 | NaN | 3 |
two | 1.7 | Ohio | 2001.0 | NaN | 2 |
df.sort_index(axis=1) # sort columns in alphabetical order
debt | new | pop | state | year | |
---|---|---|---|---|---|
one | NaN | 1 | 1.5 | Ohio | 2000.0 |
two | NaN | 2 | 1.7 | Ohio | 2001.0 |
three | NaN | 3 | 3.6 | Ohio | 2002.0 |
four | NaN | 4 | 2.4 | Nevada | 2001.0 |
five | NaN | 5 | 2.9 | Nevada | 2002.0 |
six | NaN | 6 | NaN | NaN | NaN |
seven | NaN | 7 | 3.0 | Texas | 2000.0 |
df['pop'].sort_values() # order Series by its values
one 1.5
two 1.7
four 2.4
five 2.9
seven 3.0
three 3.6
six NaN
Name: pop, dtype: float64
df.sort_values(by='pop') # order the whole data frame by the values of a column
pop | state | year | debt | new | |
---|---|---|---|---|---|
one | 1.5 | Ohio | 2000.0 | NaN | 1 |
two | 1.7 | Ohio | 2001.0 | NaN | 2 |
four | 2.4 | Nevada | 2001.0 | NaN | 4 |
five | 2.9 | Nevada | 2002.0 | NaN | 5 |
seven | 3.0 | Texas | 2000.0 | NaN | 7 |
three | 3.6 | Ohio | 2002.0 | NaN | 3 |
six | NaN | NaN | NaN | NaN | 6 |
Ranking:
s
Max 2.0
Emil 3.0
Sarah 1.0
David 2.0
Ilvy 5.0
dtype: float64
s.rank()
Max 2.5
Emil 4.0
Sarah 1.0
David 2.5
Ilvy 5.0
dtype: float64
s.rank().sort_values()
Sarah 1.0
Max 2.5
David 2.5
Emil 4.0
Ilvy 5.0
dtype: float64
Elementwise function application#
df['pop'].map(np.log)
one 0.405465
two 0.530628
three 1.280934
four 0.875469
five 1.064711
six NaN
seven 1.098612
Name: pop, dtype: float64
Summarizing and Descriptive Statistics#
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
index = ['a','b','c','d'],
columns = ['one','two'])
df
one | two | |
---|---|---|
a | 1.40 | NaN |
b | 7.10 | -4.5 |
c | NaN | NaN |
d | 0.75 | -1.3 |
df.sum()
one 9.25
two -5.80
dtype: float64
df.sum(axis=1)
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
df.mean(axis=1, skipna=True)
a 1.400
b 1.300
c NaN
d -0.275
dtype: float64
df.idxmax()
one b
two d
dtype: object
df.cumsum()
one | two | |
---|---|---|
a | 1.40 | NaN |
b | 8.50 | -4.5 |
c | NaN | NaN |
d | 9.25 | -5.8 |
df.describe()
one | two | |
---|---|---|
count | 3.000000 | 2.000000 |
mean | 3.083333 | -2.900000 |
std | 3.493685 | 2.262742 |
min | 0.750000 | -4.500000 |
25% | 1.075000 | -3.700000 |
50% | 1.400000 | -2.900000 |
75% | 4.250000 | -2.100000 |
max | 7.100000 | -1.300000 |
Correlation and Covariance#
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import pandas_datareader as pdr
# Tickersymbole siehe: https://www.google.com/finance
BP = pdr.get_data_yahoo("BP" , start='2010-01-02', end='2016-11-11')
XOM = pdr.get_data_yahoo("XOM", start='2010-01-02', end='2016-11-11')
PXD = pdr.get_data_yahoo("PXD" , start='2010-01-02', end='2016-11-11')
type(BP)
pandas.core.frame.DataFrame
BP.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2010-01-04 | 59.450001 | 59.080002 | 59.299999 | 59.150002 | 3956100.0 | 30.011429 |
2010-01-05 | 59.900002 | 59.310001 | 59.650002 | 59.570000 | 4109600.0 | 30.224531 |
2010-01-06 | 59.919998 | 59.340000 | 59.520000 | 59.880001 | 6227900.0 | 30.381807 |
2010-01-07 | 60.000000 | 59.689999 | 59.919998 | 59.860001 | 4431300.0 | 30.371668 |
2010-01-08 | 60.060001 | 59.669998 | 59.790001 | 60.000000 | 3786100.0 | 30.442690 |
Create a dataframe the columns of which are the closing stock prices:
price = pd.DataFrame({'BP': BP.Close, 'XOM': XOM.Close, 'PXD':PXD.Close})
price.head(3)
BP | XOM | PXD | |
---|---|---|---|
Date | |||
2010-01-04 | 59.150002 | 69.150002 | 50.980000 |
2010-01-05 | 59.570000 | 69.419998 | 51.000000 |
2010-01-06 | 59.880001 | 70.019997 | 51.889999 |
Create a dataframe the columns of which are the volumnes of the stocks:
volume = pd.DataFrame( {'BP': BP.Volume, 'XOM': XOM.Volume, 'PXD':PXD.Volume})
volume.describe()
BP | XOM | PXD | |
---|---|---|---|
count | 1.729000e+03 | 1.729000e+03 | 1.729000e+03 |
mean | 9.389188e+06 | 1.643562e+07 | 1.868674e+06 |
std | 1.510144e+07 | 8.312840e+06 | 9.655614e+05 |
min | 1.724500e+06 | 4.156600e+06 | 2.533000e+05 |
25% | 4.778700e+06 | 1.086030e+07 | 1.253700e+06 |
50% | 6.401400e+06 | 1.424370e+07 | 1.651600e+06 |
75% | 9.112600e+06 | 1.959910e+07 | 2.245800e+06 |
max | 2.408085e+08 | 1.180235e+08 | 1.405610e+07 |
Compute the percentage changes (1% is given as 0.01) from trading day to trading day and print the tail of the resulting data frame:
returns = price.pct_change()
returns.tail()
BP | XOM | PXD | |
---|---|---|---|
Date | |||
2016-11-07 | 0.014311 | 0.022496 | 0.013617 |
2016-11-08 | -0.015873 | -0.001638 | 0.009225 |
2016-11-09 | 0.010753 | 0.011019 | 0.015254 |
2016-11-10 | 0.002955 | 0.009275 | 0.009679 |
2016-11-11 | -0.022392 | -0.015853 | -0.040907 |
Correlation and Covariance:
returns.BP.corr(returns.XOM)
0.63458431357487
returns.corr()
BP | XOM | PXD | |
---|---|---|---|
BP | 1.000000 | 0.634584 | 0.528361 |
XOM | 0.634584 | 1.000000 | 0.619057 |
PXD | 0.528361 | 0.619057 | 1.000000 |
returns.cov()
BP | XOM | PXD | |
---|---|---|---|
BP | 0.000323 | 0.000137 | 0.000226 |
XOM | 0.000137 | 0.000145 | 0.000177 |
PXD | 0.000226 | 0.000177 | 0.000566 |
returns.corrwith(volume)
BP -0.073794
XOM -0.040337
PXD -0.041891
dtype: float64