2428 lines
49 KiB
Plaintext
2428 lines
49 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<!--<img width=700px; src=\"../img/logoUPSayPlusCDS_990.png\"> -->\n",
|
|
"\n",
|
|
"<p style=\"margin-top: 3em; margin-bottom: 2em;\"><b><big><big><big><big>Introduction to Pandas</big></big></big></big></b></p>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%matplotlib inline\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"pd.options.display.max_rows = 8"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 1. Let's start with a showcase\n",
|
|
"\n",
|
|
"#### Case 1: titanic survival data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.read_csv(\"data/titanic.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Starting from reading this dataset, to answering questions about this data in a few lines of code:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**What is the age distribution of the passengers?**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['Age'].hist()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**How does the survival rate of the passengers differ between sexes?**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Or how does it differ between the different classes?**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"All the needed functionality for the above examples will be explained throughout this tutorial."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Case 2: air quality measurement timeseries"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe\n",
|
|
"\n",
|
|
"Starting from these hourly data for different stations:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"data = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"to answering questions about this data in a few lines of code:\n",
|
|
"\n",
|
|
"**Does the air pollution show a decreasing trend over the years?**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data['1999':].resample('M').mean().plot(ylim=[0,120])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data['1999':].resample('A').mean().plot(ylim=[0,100])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"**What is the difference in diurnal profile between weekdays and weekend?**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data['weekday'] = data.index.weekday\n",
|
|
"data['weekend'] = data['weekday'].isin([5, 6])\n",
|
|
"data_weekend = data.groupby(['weekend', data.index.hour])['BASCH'].mean().unstack(level=0)\n",
|
|
"data_weekend.plot()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We will come back to these example, and build them up step by step."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 2. Pandas: data analysis in python\n",
|
|
"\n",
|
|
"For data-intensive work in Python the [Pandas](http://pandas.pydata.org) library has become essential.\n",
|
|
"\n",
|
|
"What is `pandas`?\n",
|
|
"\n",
|
|
"* Pandas can be thought of as *NumPy arrays with labels* for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.\n",
|
|
"* Pandas can also be thought of as `R`'s `data.frame` in Python.\n",
|
|
"* Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...\n",
|
|
"\n",
|
|
"It's documentation: http://pandas.pydata.org/pandas-docs/stable/\n",
|
|
"\n",
|
|
"\n",
|
|
"** When do you need pandas? **\n",
|
|
"\n",
|
|
"When working with **tabular or structured data** (like R dataframe, SQL table, Excel spreadsheet, ...):\n",
|
|
"\n",
|
|
"- Import data\n",
|
|
"- Clean up messy data\n",
|
|
"- Explore data, gain insight into data\n",
|
|
"- Process and prepare your data for analysis\n",
|
|
"- Analyse your data (together with scikit-learn, statsmodels, ...)\n",
|
|
"\n",
|
|
"<div class=\"alert alert-warning\">\n",
|
|
"<b>ATTENTION!</b>: <br><br>\n",
|
|
"\n",
|
|
"Pandas is great for working with heterogeneous and tabular 1D/2D data, but not all types of data fit in such structures!\n",
|
|
"<ul>\n",
|
|
"<li>When working with array data (e.g. images, numerical algorithms): just stick with numpy</li>\n",
|
|
"<li>When working with multidimensional labeled data (e.g. climate data): have a look at [xarray](http://xarray.pydata.org/en/stable/)</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 2. The pandas data structures: `DataFrame` and `Series`\n",
|
|
"\n",
|
|
"A `DataFrame` is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.\n",
|
|
"\n",
|
|
"\n",
|
|
"<img align=\"left\" width=50% src=\"img/schema-dataframe.svg\">"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"scrolled": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Attributes of the DataFrame\n",
|
|
"\n",
|
|
"A DataFrame has besides a `index` attribute, also a `columns` attribute:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.index"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"To check the data types of the different columns:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.dtypes"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"An overview of that information can be given with the `info()` method:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.info()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Also a DataFrame has a `values` attribute, but attention: when you have heterogeneous data, all values will be upcasted:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Apart from importing your data from an external source (text file, excel, database, ..), one of the most common ways of creating a dataframe is from a dictionary of arrays or lists.\n",
|
|
"\n",
|
|
"Note that in the IPython notebook, the dataframe will display in a rich HTML view:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n",
|
|
" 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n",
|
|
" 'area': [30510, 671308, 357050, 41526, 244820],\n",
|
|
" 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}\n",
|
|
"df_countries = pd.DataFrame(data)\n",
|
|
"df_countries"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### One-dimensional data: `Series` (a column of a DataFrame)\n",
|
|
"\n",
|
|
"A Series is a basic holder for **one-dimensional labeled data**."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['Age']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"age = df['Age']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Attributes of a Series: `index` and `values`\n",
|
|
"\n",
|
|
"The Series has also an `index` and `values` attribute, but no `columns`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"age.index"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can access the underlying numpy array representation with the `.values` attribute:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"scrolled": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"age.values[:10]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"We can access series values via the index, just like for NumPy arrays:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"age[0]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Unlike the NumPy array, though, this index can be something other than integers:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = df.set_index('Name')\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"age = df['Age']\n",
|
|
"age"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"age['Dooley, Mr. Patrick']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"but with the power of numpy arrays. Many things you can do with numpy arrays, can also be applied on DataFrames / Series.\n",
|
|
"\n",
|
|
"Eg element-wise operations:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"age * 1000"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A range of methods:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"age.mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Fancy indexing, like indexing with a list or boolean indexing:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"age[age > 70]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"But also a lot of pandas specific methods, e.g."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['Embarked'].value_counts()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>What is the maximum Fare that was paid? And the median?</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction31.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction32.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Calculate the average survival ratio for all passengers (note: the 'Survived' column indicates whether someone survived (1) or not (0)).</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction33.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 3. Data import and export"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"A wide range of input/output formats are natively supported by pandas:\n",
|
|
"\n",
|
|
"* CSV, text\n",
|
|
"* SQL database\n",
|
|
"* Excel\n",
|
|
"* HDF5\n",
|
|
"* json\n",
|
|
"* html\n",
|
|
"* pickle\n",
|
|
"* sas, stata\n",
|
|
"* (parquet)\n",
|
|
"* ..."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#pd.read"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#df.to"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Very powerful csv reader:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"pd.read_csv?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Luckily, if we have a well formed csv file, we don't need many of those arguments:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.read_csv(\"data/titanic.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>: Read the `data/20000101_20161231-NO2.csv` file into a DataFrame `no2`\n",
|
|
"<br><br>\n",
|
|
"Some aspects about the file:\n",
|
|
" <ul>\n",
|
|
" <li>Which separator is used in the file?</li>\n",
|
|
" <li>The second row includes unit information and should be skipped (check `skiprows` keyword)</li>\n",
|
|
" <li>For missing values, it uses the `'n/d'` notation (check `na_values` keyword)</li>\n",
|
|
" <li>We want to parse the 'timestamp' column as datetimes (check the `parse_dates` keyword)</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction39.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 4. Exploration"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Some useful methods:\n",
|
|
"\n",
|
|
"`head` and `tail`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "-"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.head(3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.tail()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"`info()`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.info()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Getting some basic summary statistics about the data with `describe`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Quickly visualizing the data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "-"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.plot(kind='box', ylim=[0,250])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2['BASCH'].plot(kind='hist', bins=50)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>: \n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Plot the age distribution of the titanic passengers</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction47.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The default plot (when not specifying `kind`) is a line plot of all columns:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.plot(figsize=(12,6))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This does not say too much .."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"We can select part of the data (eg the latest 500 data points):"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2[-500:].plot(figsize=(12,6))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Or we can use some more advanced time series features -> see further in this notebook!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 5. Selecting and filtering data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-warning\">\n",
|
|
"<b>ATTENTION!</b>: <br><br>\n",
|
|
"\n",
|
|
"One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. <br><br> We now have to distuinguish between:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>selection by **label**</li>\n",
|
|
" <li>selection by **position**</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.read_csv(\"data/titanic.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### `df[]` provides some convenience shortcuts "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"For a DataFrame, basic indexing selects the columns.\n",
|
|
"\n",
|
|
"Selecting a single column:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['Age']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"or multiple columns:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df[['Age', 'Fare']]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"But, slicing accesses the rows:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df[10:15]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Systematic indexing with `loc` and `iloc`\n",
|
|
"\n",
|
|
"When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:\n",
|
|
" \n",
|
|
"* `loc`: selection by label\n",
|
|
"* `iloc`: selection by position"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = df.set_index('Name')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.loc['Bonnell, Miss. Elizabeth', 'Fare']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.loc['Bonnell, Miss. Elizabeth':'Andersson, Mr. Anders Johan', :]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Selecting by position with `iloc` works similar as indexing numpy arrays:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.iloc[0:2,1:3]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"The different indexing methods can also be used to assign data:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.loc['Braund, Mr. Owen Harris', 'Survived'] = 100"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Boolean indexing (filtering)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL) and comparable to numpy. \n",
|
|
"\n",
|
|
"The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['Fare'] > 50"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
},
|
|
"scrolled": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df[df['Fare'] > 50]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Based on the titanic data set, select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.read_csv(\"data/titanic.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction63.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction64.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction65.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Based on the titanic data set, how many passengers older than 70 were on the Titanic?</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction66.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction67.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 6. The group-by operation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Some 'theory': the groupby operation (split-apply-combine)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],\n",
|
|
" 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Recap: aggregating functions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"When analyzing data, you often calculate summary statistics (aggregations like the mean, max, ...). As we have seen before, we can easily calculate such a statistic for a Series or column using one of the many available methods. For example:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['data'].sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"However, in many cases your data has certain groups in it, and in that case, you may want to calculate this statistic for each of the groups.\n",
|
|
"\n",
|
|
"For example, in the above dataframe `df`, there is a column 'key' which has three possible values: 'A', 'B' and 'C'. When we want to calculate the sum for each of those groups, we could do the following:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"for key in ['A', 'B', 'C']:\n",
|
|
" print(key, df[df['key'] == key]['data'].sum())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This becomes very verbose when having multiple groups. You could make the above a bit easier by looping over the different values, but still, it is not very convenient to work with.\n",
|
|
"\n",
|
|
"What we did above, applying a function on different groups, is a \"groupby operation\", and pandas provides some convenient functionality for this."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Groupby: applying functions per group"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"The \"group by\" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**\n",
|
|
"\n",
|
|
"This operation is also referred to as the \"split-apply-combine\" operation, involving the following steps:\n",
|
|
"\n",
|
|
"* **Splitting** the data into groups based on some criteria\n",
|
|
"* **Applying** a function to each group independently\n",
|
|
"* **Combining** the results into a data structure\n",
|
|
"\n",
|
|
"<img src=\"img/splitApplyCombine.png\">\n",
|
|
"\n",
|
|
"Similar to SQL `GROUP BY`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Instead of doing the manual filtering as above\n",
|
|
"\n",
|
|
"\n",
|
|
" df[df['key'] == \"A\"].sum()\n",
|
|
" df[df['key'] == \"B\"].sum()\n",
|
|
" ...\n",
|
|
"\n",
|
|
"pandas provides the `groupby` method to do exactly this:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.groupby('key').sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
},
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.groupby('key').aggregate(np.sum) # 'sum'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"And many more methods are available. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.groupby('key')['data'].sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Application of the groupby concept on the titanic data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We go back to the titanic passengers survival data:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true,
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.read_csv(\"data/titanic.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Calculate the average age for each sex again, but now using groupby.</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction76.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Calculate the average survival ratio for all passengers.</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction77.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Calculate this survival ratio for all passengers younger that 25 (remember: filtering/boolean indexing).</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
},
|
|
"scrolled": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction78.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>What is the difference in the survival ratio between the sexes?</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction79.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Or how does it differ between the different classes? Make a bar plot visualizing the survival ratio for the 3 classes.</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction80.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Make a bar plot to visualize the average Fare payed by people depending on their age. The age column is devided is separate classes using the `pd.cut` function as provided below.</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": false,
|
|
"collapsed": true,
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['AgeClass'] = pd.cut(df['Age'], bins=np.arange(0,90,10))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"run_control": {
|
|
"frozen": false,
|
|
"read_only": false
|
|
},
|
|
"scrolled": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction82.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 7. Working with time series data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2 = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"When we ensure the DataFrame has a `DatetimeIndex`, time-series related functionality becomes available:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.index"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Indexing a time series works with strings:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2[\"2010-01-01 09:00\": \"2010-01-01 12:00\"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"A nice feature is \"partial string\" indexing, so you don't need to provide the full datetime string."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "-"
|
|
}
|
|
},
|
|
"source": [
|
|
"E.g. all data of January up to March 2012:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2['2012-01':'2012-03']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Time and date components can be accessed from the index:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.index.hour"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.index.year"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Converting your time series with `resample`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A very powerfull method is **`resample`: converting the frequency of the time series** (e.g. from hourly to daily data).\n",
|
|
"\n",
|
|
"Remember the air quality data:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.plot()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The time series has a frequency of 1 hour. I want to change this to daily:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.resample('D').mean().head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Above I take the mean, but as with `groupby` I can also specify other methods:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.resample('D').max().head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"source": [
|
|
"The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases \n",
|
|
"These strings can also be combined with numbers, eg `'10D'`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Further exploring the data:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.resample('M').mean().plot() # 'A'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# no2['2012'].resample('D').plot()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction95.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>: The evolution of the yearly averages with, and the overall mean of all stations\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Use `resample` and `plot` to plot the yearly averages for the different stations.</li>\n",
|
|
" <li>The overall mean of all stations can be calculated by taking the mean of the different columns (`.mean(axis=1)`).</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"scrolled": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction96.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>: how does the *typical monthly profile* look like for the different stations?\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Add a 'month' column to the dataframe.</li>\n",
|
|
" <li>Group by the month to obtain the typical monthly averages over the different years.</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"First, we add a column to the dataframe that indicates the month (integer value of 1 to 12):"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction97.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Now, we can calculate the mean of each month over the different years:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction98.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction99.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>: The typical diurnal profile for the different stations\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Similar as for the month, you can now group by the hour of the day.</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction100.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>: What is the difference in the typical diurnal profile between week and weekend days for the 'BASCH' station.\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Add a column 'weekday' defining the different days in the week.</li>\n",
|
|
" <li>Add a column 'weekend' defining if a days is in the weekend (i.e. days 5 and 6) or not (True/False).</li>\n",
|
|
" <li>You can groupby on multiple items at the same time. In this case you would need to group by both weekend/weekday and hour of the day.</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Add a column indicating the weekday:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"no2.index.weekday?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction102.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Add a column indicating week/weekend"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction103.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we can groupby the hour of the day and the weekend (or use `pivot_table`):"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction104.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction105.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction106.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction107.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
"\n",
|
|
"<b>EXERCISE</b>: What are the number of exceedances of hourly values above the European limit 200 µg/m3 ?\n",
|
|
"\n",
|
|
"Count the number of exceedances of hourly values above the European limit 200 µg/m3 for each year and station after 2005. Make a barplot of the counts. Add an horizontal line indicating the maximum number of exceedances (which is 18) allowed per year?\n",
|
|
"<br><br>\n",
|
|
"\n",
|
|
"Hints:\n",
|
|
"\n",
|
|
" <ul>\n",
|
|
" <li>Create a new DataFrame, called `exceedances`, (with boolean values) indicating if the threshold is exceeded or not</li>\n",
|
|
" <li>Remember that the sum of True values can be used to count elements. Do this using groupby for each year.</li>\n",
|
|
" <li>Adding a horizontal line can be done with the matplotlib function `ax.axhline`.</li>\n",
|
|
"</ul>\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# re-reading the data to have a clean version\n",
|
|
"no2 = pd.read_csv('data/20000101_20161231-NO2.csv', sep=';', skiprows=[1], na_values=['n/d'], index_col=0, parse_dates=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction109.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true,
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction110.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load snippets/01-pandas_introduction111.py"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 9. What I didn't talk about"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"- Concatenating data: `pd.concat`\n",
|
|
"- Merging and joining data: `pd.merge`\n",
|
|
"- Reshaping data: `pivot_table`, `melt`, `stack`, `unstack`\n",
|
|
"- Working with missing data: `isnull`, `dropna`, `interpolate`, ...\n",
|
|
"- ..."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"source": [
|
|
"\n",
|
|
"## Further reading\n",
|
|
"\n",
|
|
"* Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/\n",
|
|
"\n",
|
|
"* Books\n",
|
|
"\n",
|
|
" * \"Python for Data Analysis\" by Wes McKinney\n",
|
|
" * \"Python Data Science Handbook\" by Jake VanderPlas\n",
|
|
"\n",
|
|
"* Tutorials (many good online tutorials!)\n",
|
|
"\n",
|
|
" * https://github.com/jorisvandenbossche/pandas-tutorial\n",
|
|
" * https://github.com/brandon-rhodes/pycon-pandas-tutorial\n",
|
|
"\n",
|
|
"* Tom Augspurger's blog\n",
|
|
"\n",
|
|
" * https://tomaugspurger.github.io/modern-1.html"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"celltoolbar": "Nbtutor - export exercises",
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.5.2"
|
|
},
|
|
"nav_menu": {},
|
|
"toc": {
|
|
"navigate_menu": true,
|
|
"number_sections": false,
|
|
"sideBar": true,
|
|
"threshold": 6,
|
|
"toc_cell": false,
|
|
"toc_section_display": "block",
|
|
"toc_window_display": true
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|