OSGeoLive-Notebooks/Pandas/07 - Case study - air quali...

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p><font size=\"6\"><b> Case study: air quality data of European monitoring stations (AirBase)</b></font></p><br>\n",
    "**AirBase (The European Air quality dataBase): hourly measurements of all air quality monitoring stations from Europe. **\n",
    "\n",
    "> *© 2016, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AirBase is the European air quality database maintained by the European Environment Agency (EEA). It contains air quality monitoring data and information submitted by participating countries throughout Europe. The air quality database consists of a multi-annual time series of air quality measurement data and statistics for a number of air pollutants."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "from IPython.display import HTML\n",
    "HTML('<iframe src=http://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-quality-database-8#tab-data-by-country width=900 height=350></iframe>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the data files that are available from AirBase were included in the data folder: the hourly **concentrations of nitrogen dioxide (NO2)** for 4 different measurement stations:\n",
    "\n",
    "- FR04037 (PARIS 13eme): urban background site at Square de Choisy\n",
    "- FR04012 (Paris, Place Victor Basch): urban traffic site at Rue d'Alesia\n",
    "- BETR802: urban traffic site in Antwerp, Belgium\n",
    "- BETN029: rural background site in Houtem, Belgium\n",
    "\n",
    "See http://www.eea.europa.eu/themes/air/interactive/no2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "pd.options.display.max_rows = 8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Processing a single file\n",
    "\n",
    "We will start with processing one of the downloaded files (`BETR8010000800100hour.1-1-1990.31-12-2012`). Looking at the data, you will see it does not look like a nice csv file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "with open(\"data/BETR8010000800100hour.1-1-1990.31-12-2012\") as f:\n",
    "    print(f.readline())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we will need to do some manual processing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Just reading the tab-delimited data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"data/BETR8010000800100hour.1-1-1990.31-12-2012\", sep='\\t')#, header=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data. Furthermore, there is no header row with column names."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "<b>EXERCISE</b>: <br><br> Clean up this dataframe by using more options of `read_csv` (see its [docstring](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html))\n",
    "\n",
    " <ul>\n",
    "  <li>specify the correct delimiter</li>\n",
    "  <li>specify that the values of -999 and -9999 should be regarded as NaN</li>\n",
    "  <li>specify are own column names (for how the column names are made up, see See http://stackoverflow.com/questions/6356041/python-intertwining-two-lists)\n",
    "</ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "# Column names: list consisting of 'date' and then intertwined the hour of the day and 'flag'\n",
    "hours = [\"{:02d}\".format(i) for i in range(24)]\n",
    "column_names = ['date'] + [item for pair in zip(hours, ['flag']*24) for item in pair]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data7.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data8.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "For the sake of this tutorial, we will disregard the 'flag' columns (indicating the quality of the data). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "<b>EXERCISE</b>:\n",
    "<br><br>\n",
    "Drop all 'flag' columns ('flag1', 'flag2', ...) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "flag_columns = [col for col in data.columns if 'flag' in col]\n",
    "# we can now use this list to drop these columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data10.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index. Here we have a wide and long dataframe, and want to make this a long, narrow timeseries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "<b>REMEMBER</b>: \n",
    "\n",
    " <ul>\n",
    "  <li>Recap: reshaping your data with [`stack` and `unstack`](./pandas_07_reshaping_data.ipynb)</li>\n",
    "</ul>\n",
    "\n",
    "\n",
    "<img src=\"img/schema-stack.svg\" width=70%>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "<b>EXERCISE</b>:\n",
    "\n",
    "<br><br>\n",
    "\n",
    "Reshape the dataframe to a timeseries. \n",
    "The end result should look like:<br><br>\n",
    "\n",
    "\n",
    "<div class='center'>\n",
    "<table border=\"1\" class=\"dataframe\">\n",
    "  <thead>\n",
    "    <tr style=\"text-align: right;\">\n",
    "      <th></th>\n",
    "      <th>BETR801</th>\n",
    "    </tr>\n",
    "  </thead>\n",
    "  <tbody>\n",
    "    <tr>\n",
    "      <th>1990-01-02 09:00:00</th>\n",
    "      <td>48.0</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>1990-01-02 12:00:00</th>\n",
    "      <td>48.0</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>1990-01-02 13:00:00</th>\n",
    "      <td>50.0</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>1990-01-02 14:00:00</th>\n",
    "      <td>55.0</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>...</th>\n",
    "      <td>...</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>2012-12-31 20:00:00</th>\n",
    "      <td>16.5</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>2012-12-31 21:00:00</th>\n",
    "      <td>14.5</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>2012-12-31 22:00:00</th>\n",
    "      <td>16.5</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "      <th>2012-12-31 23:00:00</th>\n",
    "      <td>15.0</td>\n",
    "    </tr>\n",
    "  </tbody>\n",
    "</table>\n",
    "<p style=\"text-align:center\">170794 rows × 1 columns</p>\n",
    "</div>\n",
    "\n",
    " <ul>\n",
    "  <li>Reshape the dataframe so that each row consists of one observation for one date + hour combination</li>\n",
    "  <li>When you have the date and hour values as two columns, combine these columns into a datetime (tip: string columns can be summed to concatenate the strings) and remove the original columns</li>\n",
    "  <li>Set the new datetime values as the index, and remove the original columns with date and hour values</li>\n",
    "\n",
    "</ul>\n",
    "\n",
    "\n",
    "**NOTE**: This is an advanced exercise. Do not spend too much time on it and don't hesitate to look at the solutions. \n",
    "\n",
    "</div>\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data12.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data13.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data14.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data15.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data16.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_stacked.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our final data is now a time series. In pandas, this means that the index is a `DatetimeIndex`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "data_stacked.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "data_stacked.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Processing a collection of files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have seen the code steps to process one of the files. We have however multiple files for the different stations with the same structure. Therefore, to not have to repeat the actual code, let's make a function from the steps we have seen above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "<b>EXERCISE</b>:\n",
    "\n",
    " <ul>\n",
    "  <li>Write a function `read_airbase_file(filename, station)`, using the above steps the read in and process the data, and that returns a processed timeseries.</li>\n",
    "</ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "def read_airbase_file(filename, station):\n",
    "    \"\"\"\n",
    "    Read hourly AirBase data files.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    filename : string\n",
    "        Path to the data file.\n",
    "    station : string\n",
    "        Name of the station.\n",
    "       \n",
    "    Returns\n",
    "    -------\n",
    "    DataFrame\n",
    "        Processed dataframe.\n",
    "    \"\"\"\n",
    "    \n",
    "    ...\n",
    "    \n",
    "    return ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data21.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test the function on the data file from above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "filename = \"data/BETR8010000800100hour.1-1-1990.31-12-2012\"\n",
    "station = filename.split(\"/\")[-1][:7]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "station"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "test = read_airbase_file(filename, station)\n",
    "test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now want to use this function to read in all the different data files from AirBase, and combine them into one Dataframe. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "<b>EXERCISE</b>:\n",
    "\n",
    " <ul>\n",
    "  <li>Use the `glob.glob` function to list all 4 AirBase data files that are included in the 'data' directory, and call the result `data_files`.</li>\n",
    "</ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": false,
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "import glob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data33.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "\n",
    "<b>EXERCISE</b>:\n",
    "\n",
    " <ul>\n",
    "  <li>Loop over the data files, read and process the file using our defined function, and append the dataframe to a list.</li>\n",
    "  <li>Combine the the different DataFrames in the list into a single DataFrame where the different columns are the different stations. Call the result `combined_data`.</li>\n",
    "\n",
    "</ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data34.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data35.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "combined_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we don't want to have to repeat this each time we use the data. Therefore, let's save the processed data to a csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "combined_data.to_csv(\"airbase_data.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with time series data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We processed the individual data files above, and saved it to a csv file `airbase_data.csv`. Let's import the file here (if you didn't finish the above exercises, a version of the dataset is also available in `data/airbase_data.csv`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "alldata = pd.read_csv('airbase_data.csv', index_col=0, parse_dates=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We only use the data from 1999 onwards:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data = alldata['1999':].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Som first exploration with the *typical* functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "data.head() # tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "data.describe(percentiles=[0.1, 0.5, 0.9])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Quickly visualizing the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "data.plot(kind='box', ylim=[0,250])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "data['BETR801'].plot(kind='hist', bins=50)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "data.plot(figsize=(12,6))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This does not say too much .."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We can select part of the data (eg the latest 500 data points):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[-500:].plot(figsize=(12,6))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "<b>REMINDER</b>: <br><br>\n",
    "\n",
    "Take a look at the [Timeseries notebook](05 - Time series data.ipynb) when you require more info about...\n",
    "\n",
    " <ul>\n",
    "  <li>`resample`</li>\n",
    "  <li>string indexing of DateTimeIndex</li>\n",
    "</ul><br><br>\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: plot the monthly mean and median concentration of the 'FR04037' station for the years 2009-2012\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data50.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data51.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: plot the monthly mininum and maximum daily concentration of the 'BETR801' station\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data52.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data53.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: make a bar plot of the mean of the stations in year of 2012\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data54.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: The evolution of the yearly averages with, and the overall mean of all stations (indicate the overall mean with a thicker black line)?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data55.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Combination with groupby"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`resample` can actually be seen as a specific kind of `groupby`. E.g. taking annual means with `data.resample('A', 'mean')` is equivalent to `data.groupby(data.index.year).mean()` (only the result of `resample` still has a DatetimeIndex).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.groupby(data.index.year).mean().plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "But, `groupby` is more flexible and can also do resamples that do not result in a new continuous time series, e.g. by grouping by the hour of the day to get the diurnal cycle."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: how does the *typical monthly profile* look like for the different stations?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1\\. add a column to the dataframe that indicates the month (integer value of 1 to 12):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data57.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "2\\. Now, we can calculate the mean of each month over the different years:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data58.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3\\. plot the typical monthly profile of the different stations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data59.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = data.drop('month', axis=1, errors='ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: plot the weekly 95% percentiles of the concentration in 'BETR801' and 'BETN029' for 2011\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df2011 = data['2011'].dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data64.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data65.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: The typical diurnal profile for the different stations?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data66.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "clear_cell": false,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: What are the number of exceedances of hourly values above the European limit 200 µg/m3 for each year/station?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data67.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data68.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data69.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data70.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: And are there exceedances of the yearly limit value of 40 µg/m3 since 200 ?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data72.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data73.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data74.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: What is the difference in the typical diurnal profile between week and weekend days? (and visualise it)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": false,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data.index.weekday?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data76.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Add a column indicating week/weekend"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data77.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data78.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data79.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data80.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data81.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: Visualize the typical week profile for the different stations as boxplots (where the values in one boxplot are the daily means for the different weeks for a certain weekday).\n",
    "</div>\n",
    "\n",
    "Tip: the boxplot method of a DataFrame expects the data for the different boxes in different columns). For this, you can either use `pivot_table` as a combination of `groupby` and `unstack`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data82.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data83.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data84.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An alternative method using `groupby` and `unstack`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data85.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: The maximum daily 8 hour mean should be below 100 µg/m³. What are the number of exceedances of this limit for each year/station?\n",
    "</div>\n",
    "\n",
    "Tip: have a look at the `rolling` method to perform moving window operations.\n",
    "\n",
    "Note: this is not an actual limit for NO2, but a nice exercise to introduce the `rolling` method. Other pollutans, such as 03 have actually such kind of limit values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data86.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>QUESTION</b>: Calculate the correlation between the different stations\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data87.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# %load snippets/07 - Case study - air quality data88.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Nbtutor - export exercises",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  },
  "nav_menu": {},
  "toc": {
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 6,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}