OSGeoLive-Notebooks/Pandas/01-pandas_introduction.ipynb

2428 lines
49 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--<img width=700px; src=\"../img/logoUPSayPlusCDS_990.png\"> -->\n",
"\n",
"<p style=\"margin-top: 3em; margin-bottom: 2em;\"><b><big><big><big><big>Introduction to Pandas</big></big></big></big></b></p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"pd.options.display.max_rows = 8"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Let's start with a showcase\n",
"\n",
"#### Case 1: titanic survival data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/titanic.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Starting from reading this dataset, to answering questions about this data in a few lines of code:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What is the age distribution of the passengers?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['Age'].hist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**How does the survival rate of the passengers differ between sexes?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Or how does it differ between the different classes?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All the needed functionality for the above examples will be explained throughout this tutorial."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Case 2: air quality measurement timeseries"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe\n",
"\n",
"Starting from these hourly data for different stations:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"to answering questions about this data in a few lines of code:\n",
"\n",
"**Does the air pollution show a decreasing trend over the years?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data['1999':].resample('M').mean().plot(ylim=[0,120])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data['1999':].resample('A').mean().plot(ylim=[0,100])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"**What is the difference in diurnal profile between weekdays and weekend?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data['weekday'] = data.index.weekday\n",
"data['weekend'] = data['weekday'].isin([5, 6])\n",
"data_weekend = data.groupby(['weekend', data.index.hour])['BASCH'].mean().unstack(level=0)\n",
"data_weekend.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will come back to these example, and build them up step by step."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Pandas: data analysis in python\n",
"\n",
"For data-intensive work in Python the [Pandas](http://pandas.pydata.org) library has become essential.\n",
"\n",
"What is `pandas`?\n",
"\n",
"* Pandas can be thought of as *NumPy arrays with labels* for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.\n",
"* Pandas can also be thought of as `R`'s `data.frame` in Python.\n",
"* Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...\n",
"\n",
"It's documentation: http://pandas.pydata.org/pandas-docs/stable/\n",
"\n",
"\n",
"** When do you need pandas? **\n",
"\n",
"When working with **tabular or structured data** (like R dataframe, SQL table, Excel spreadsheet, ...):\n",
"\n",
"- Import data\n",
"- Clean up messy data\n",
"- Explore data, gain insight into data\n",
"- Process and prepare your data for analysis\n",
"- Analyse your data (together with scikit-learn, statsmodels, ...)\n",
"\n",
"<div class=\"alert alert-warning\">\n",
"<b>ATTENTION!</b>: <br><br>\n",
"\n",
"Pandas is great for working with heterogeneous and tabular 1D/2D data, but not all types of data fit in such structures!\n",
"<ul>\n",
"<li>When working with array data (e.g. images, numerical algorithms): just stick with numpy</li>\n",
"<li>When working with multidimensional labeled data (e.g. climate data): have a look at [xarray](http://xarray.pydata.org/en/stable/)</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. The pandas data structures: `DataFrame` and `Series`\n",
"\n",
"A `DataFrame` is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.\n",
"\n",
"\n",
"<img align=\"left\" width=50% src=\"img/schema-dataframe.svg\">"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Attributes of the DataFrame\n",
"\n",
"A DataFrame has besides a `index` attribute, also a `columns` attribute:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"To check the data types of the different columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"An overview of that information can be given with the `info()` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Also a DataFrame has a `values` attribute, but attention: when you have heterogeneous data, all values will be upcasted:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"df.values"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Apart from importing your data from an external source (text file, excel, database, ..), one of the most common ways of creating a dataframe is from a dictionary of arrays or lists.\n",
"\n",
"Note that in the IPython notebook, the dataframe will display in a rich HTML view:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n",
" 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n",
" 'area': [30510, 671308, 357050, 41526, 244820],\n",
" 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}\n",
"df_countries = pd.DataFrame(data)\n",
"df_countries"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### One-dimensional data: `Series` (a column of a DataFrame)\n",
"\n",
"A Series is a basic holder for **one-dimensional labeled data**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['Age']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"age = df['Age']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Attributes of a Series: `index` and `values`\n",
"\n",
"The Series has also an `index` and `values` attribute, but no `columns`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"age.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can access the underlying numpy array representation with the `.values` attribute:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"age.values[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can access series values via the index, just like for NumPy arrays:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"age[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Unlike the NumPy array, though, this index can be something other than integers:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = df.set_index('Name')\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"age = df['Age']\n",
"age"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"age['Dooley, Mr. Patrick']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"but with the power of numpy arrays. Many things you can do with numpy arrays, can also be applied on DataFrames / Series.\n",
"\n",
"Eg element-wise operations:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"age * 1000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A range of methods:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"age.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fancy indexing, like indexing with a list or boolean indexing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"age[age > 70]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But also a lot of pandas specific methods, e.g."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['Embarked'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>:\n",
"\n",
" <ul>\n",
" <li>What is the maximum Fare that was paid? And the median?</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction31.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction32.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>:\n",
"\n",
" <ul>\n",
" <li>Calculate the average survival ratio for all passengers (note: the 'Survived' column indicates whether someone survived (1) or not (0)).</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction33.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Data import and export"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"A wide range of input/output formats are natively supported by pandas:\n",
"\n",
"* CSV, text\n",
"* SQL database\n",
"* Excel\n",
"* HDF5\n",
"* json\n",
"* html\n",
"* pickle\n",
"* sas, stata\n",
"* (parquet)\n",
"* ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#pd.read"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#df.to"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Very powerful csv reader:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pd.read_csv?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Luckily, if we have a well formed csv file, we don't need many of those arguments:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/titanic.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>: Read the `data/20000101_20161231-NO2.csv` file into a DataFrame `no2`\n",
"<br><br>\n",
"Some aspects about the file:\n",
" <ul>\n",
" <li>Which separator is used in the file?</li>\n",
" <li>The second row includes unit information and should be skipped (check `skiprows` keyword)</li>\n",
" <li>For missing values, it uses the `'n/d'` notation (check `na_values` keyword)</li>\n",
" <li>We want to parse the 'timestamp' column as datetimes (check the `parse_dates` keyword)</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction39.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": false
},
"outputs": [],
"source": [
"no2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Exploration"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Some useful methods:\n",
"\n",
"`head` and `tail`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"no2.head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"`info()`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Getting some basic summary statistics about the data with `describe`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Quickly visualizing the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"no2.plot(kind='box', ylim=[0,250])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"no2['BASCH'].plot(kind='hist', bins=50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>: \n",
"\n",
" <ul>\n",
" <li>Plot the age distribution of the titanic passengers</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction47.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The default plot (when not specifying `kind`) is a line plot of all columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"no2.plot(figsize=(12,6))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This does not say too much .."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can select part of the data (eg the latest 500 data points):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2[-500:].plot(figsize=(12,6))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or we can use some more advanced time series features -> see further in this notebook!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5. Selecting and filtering data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-warning\">\n",
"<b>ATTENTION!</b>: <br><br>\n",
"\n",
"One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. <br><br> We now have to distuinguish between:\n",
"\n",
" <ul>\n",
" <li>selection by **label**</li>\n",
" <li>selection by **position**</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/titanic.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `df[]` provides some convenience shortcuts "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"For a DataFrame, basic indexing selects the columns.\n",
"\n",
"Selecting a single column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['Age']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"or multiple columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[['Age', 'Fare']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"But, slicing accesses the rows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[10:15]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Systematic indexing with `loc` and `iloc`\n",
"\n",
"When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:\n",
" \n",
"* `loc`: selection by label\n",
"* `iloc`: selection by position"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = df.set_index('Name')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.loc['Bonnell, Miss. Elizabeth', 'Fare']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"df.loc['Bonnell, Miss. Elizabeth':'Andersson, Mr. Anders Johan', :]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Selecting by position with `iloc` works similar as indexing numpy arrays:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.iloc[0:2,1:3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The different indexing methods can also be used to assign data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df.loc['Braund, Mr. Owen Harris', 'Survived'] = 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Boolean indexing (filtering)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL) and comparable to numpy. \n",
"\n",
"The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df['Fare'] > 50"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
},
"scrolled": false
},
"outputs": [],
"source": [
"df[df['Fare'] > 50]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>:\n",
"\n",
" <ul>\n",
" <li>Based on the titanic data set, select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/titanic.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction63.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction64.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction65.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>:\n",
"\n",
" <ul>\n",
" <li>Based on the titanic data set, how many passengers older than 70 were on the Titanic?</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction66.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction67.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 6. The group-by operation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some 'theory': the groupby operation (split-apply-combine)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],\n",
" 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recap: aggregating functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When analyzing data, you often calculate summary statistics (aggregations like the mean, max, ...). As we have seen before, we can easily calculate such a statistic for a Series or column using one of the many available methods. For example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df['data'].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, in many cases your data has certain groups in it, and in that case, you may want to calculate this statistic for each of the groups.\n",
"\n",
"For example, in the above dataframe `df`, there is a column 'key' which has three possible values: 'A', 'B' and 'C'. When we want to calculate the sum for each of those groups, we could do the following:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"for key in ['A', 'B', 'C']:\n",
" print(key, df[df['key'] == key]['data'].sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This becomes very verbose when having multiple groups. You could make the above a bit easier by looping over the different values, but still, it is not very convenient to work with.\n",
"\n",
"What we did above, applying a function on different groups, is a \"groupby operation\", and pandas provides some convenient functionality for this."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Groupby: applying functions per group"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The \"group by\" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**\n",
"\n",
"This operation is also referred to as the \"split-apply-combine\" operation, involving the following steps:\n",
"\n",
"* **Splitting** the data into groups based on some criteria\n",
"* **Applying** a function to each group independently\n",
"* **Combining** the results into a data structure\n",
"\n",
"<img src=\"img/splitApplyCombine.png\">\n",
"\n",
"Similar to SQL `GROUP BY`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of doing the manual filtering as above\n",
"\n",
"\n",
" df[df['key'] == \"A\"].sum()\n",
" df[df['key'] == \"B\"].sum()\n",
" ...\n",
"\n",
"pandas provides the `groupby` method to do exactly this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df.groupby('key').sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"df.groupby('key').aggregate(np.sum) # 'sum'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And many more methods are available. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df.groupby('key')['data'].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Application of the groupby concept on the titanic data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We go back to the titanic passengers survival data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/titanic.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>:\n",
"\n",
" <ul>\n",
" <li>Calculate the average age for each sex again, but now using groupby.</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction76.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>:\n",
"\n",
" <ul>\n",
" <li>Calculate the average survival ratio for all passengers.</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction77.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>:\n",
"\n",
" <ul>\n",
" <li>Calculate this survival ratio for all passengers younger that 25 (remember: filtering/boolean indexing).</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"run_control": {
"frozen": false,
"read_only": false
},
"scrolled": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction78.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>:\n",
"\n",
" <ul>\n",
" <li>What is the difference in the survival ratio between the sexes?</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction79.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>:\n",
"\n",
" <ul>\n",
" <li>Or how does it differ between the different classes? Make a bar plot visualizing the survival ratio for the 3 classes.</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction80.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>:\n",
"\n",
" <ul>\n",
" <li>Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is devided is separate classes using the `pd.cut` function as provided below.</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": false,
"collapsed": true,
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"run_control": {
"frozen": false,
"read_only": false
},
"scrolled": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction82.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 7. Working with time series data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"no2 = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"When we ensure the DataFrame has a `DatetimeIndex`, time-series related functionality becomes available:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.index"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Indexing a time series works with strings:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2[\"2010-01-01 09:00\": \"2010-01-01 12:00\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"A nice feature is \"partial string\" indexing, so you don't need to provide the full datetime string."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"E.g. all data of January up to March 2012:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2['2012-01':'2012-03']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Time and date components can be accessed from the index:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.index.hour"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.index.year"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Converting your time series with `resample`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A very powerfull method is **`resample`: converting the frequency of the time series** (e.g. from hourly to daily data).\n",
"\n",
"Remember the air quality data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The time series has a frequency of 1 hour. I want to change this to daily:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.resample('D').mean().head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Above I take the mean, but as with `groupby` I can also specify other methods:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.resample('D').max().head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases \n",
"These strings can also be combined with numbers, eg `'10D'`."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Further exploring the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"no2.resample('M').mean().plot() # 'A'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# no2['2012'].resample('D').plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction95.py"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>: The evolution of the yearly averages with, and the overall mean of all stations\n",
"\n",
" <ul>\n",
" <li>Use `resample` and `plot` to plot the yearly averages for the different stations.</li>\n",
" <li>The overall mean of all stations can be calculated by taking the mean of the different columns (`.mean(axis=1)`).</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"scrolled": false
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction96.py"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>: how does the *typical monthly profile* look like for the different stations?\n",
"\n",
" <ul>\n",
" <li>Add a 'month' column to the dataframe.</li>\n",
" <li>Group by the month to obtain the typical monthly averages over the different years.</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we add a column to the dataframe that indicates the month (integer value of 1 to 12):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction97.py"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Now, we can calculate the mean of each month over the different years:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction98.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction99.py"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>: The typical diurnal profile for the different stations\n",
"\n",
" <ul>\n",
" <li>Similar as for the month, you can now group by the hour of the day.</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction100.py"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>: What is the difference in the typical diurnal profile between week and weekend days for the 'BASCH' station.\n",
"\n",
" <ul>\n",
" <li>Add a column 'weekday' defining the different days in the week.</li>\n",
" <li>Add a column 'weekend' defining if a days is in the weekend (i.e. days 5 and 6) or not (True/False).</li>\n",
" <li>You can groupby on multiple items at the same time. In this case you would need to group by both weekend/weekday and hour of the day.</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Add a column indicating the weekday:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"no2.index.weekday?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction102.py"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Add a column indicating week/weekend"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction103.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can groupby the hour of the day and the weekend (or use `pivot_table`):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction104.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction105.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction106.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction107.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
"\n",
"<b>EXERCISE</b>: What are the number of exceedances of hourly values above the European limit 200 µg/m3 ?\n",
"\n",
"Count the number of exceedances of hourly values above the European limit 200 µg/m3 for each year and station after 2005. Make a barplot of the counts. Add an horizontal line indicating the maximum number of exceedances (which is 18) allowed per year?\n",
"<br><br>\n",
"\n",
"Hints:\n",
"\n",
" <ul>\n",
" <li>Create a new DataFrame, called `exceedances`, (with boolean values) indicating if the threshold is exceeded or not</li>\n",
" <li>Remember that the sum of True values can be used to count elements. Do this using groupby for each year.</li>\n",
" <li>Adding a horizontal line can be done with the matplotlib function `ax.axhline`.</li>\n",
"</ul>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# re-reading the data to have a clean version\n",
"no2 = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction109.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction110.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/01-pandas_introduction111.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 9. What I didn't talk about"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Concatenating data: `pd.concat`\n",
"- Merging and joining data: `pd.merge`\n",
"- Reshaping data: `pivot_table`, `melt`, `stack`, `unstack`\n",
"- Working with missing data: `isnull`, `dropna`, `interpolate`, ...\n",
"- ..."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"\n",
"## Further reading\n",
"\n",
"* Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/\n",
"\n",
"* Books\n",
"\n",
" * \"Python for Data Analysis\" by Wes McKinney\n",
" * \"Python Data Science Handbook\" by Jake VanderPlas\n",
"\n",
"* Tutorials (many good online tutorials!)\n",
"\n",
" * https://github.com/jorisvandenbossche/pandas-tutorial\n",
" * https://github.com/brandon-rhodes/pycon-pandas-tutorial\n",
"\n",
"* Tom Augspurger's blog\n",
"\n",
" * https://tomaugspurger.github.io/modern-1.html"
]
}
],
"metadata": {
"celltoolbar": "Nbtutor - export exercises",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
},
"nav_menu": {},
"toc": {
"navigate_menu": true,
"number_sections": false,
"sideBar": true,
"threshold": 6,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 2
}