OSGeoLive-Notebooks/Pandas/03 - Indexing and selecting...

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Indexing and selecting data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "try:\n",
    "    import seaborn\n",
    "except ImportError:\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# redefining the example objects\n",
    "\n",
    "# series\n",
    "population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, \n",
    "                        'United Kingdom': 64.9, 'Netherlands': 16.9})\n",
    "\n",
    "# dataframe\n",
    "data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n",
    "        'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n",
    "        'area': [30510, 671308, 357050, 41526, 244820],\n",
    "        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}\n",
    "countries = pd.DataFrame(data)\n",
    "countries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Setting the index to the country names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries = countries.set_index('country')\n",
    "countries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some notes on selecting data\n",
    "\n",
    "One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. We now have to distuinguish between:\n",
    "\n",
    "- selection by label\n",
    "- selection by position."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `data[]` provides some convenience shortcuts "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a DataFrame, basic indexing selects the columns.\n",
    "\n",
    "Selecting a single column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries['area']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or multiple columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries[['area', 'population']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But, slicing accesses the rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries['France':'Netherlands']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-danger\">\n",
    "    <b>NOTE</b>: Unlike slicing in numpy, the end label is **included**.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So as a summary, `[]` provides the following convenience shortcuts:\n",
    "\n",
    "- Series: selecting a label: `s[label]`\n",
    "- DataFrame: selecting a single or multiple columns: `df['col']` or `df[['col1', 'col2']]`\n",
    "- DataFrame: slicing the rows: `df['row_label1':'row_label2']` or `df[mask]`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Systematic indexing with `loc` and `iloc`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:\n",
    "    \n",
    "* `loc`: selection by label\n",
    "* `iloc`: selection by position\n",
    "\n",
    "These methods index the different dimensions of the frame:\n",
    "\n",
    "* `df.loc[row_indexer, column_indexer]`\n",
    "* `df.iloc[row_indexer, column_indexer]`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting a single element:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries.loc['Germany', 'area']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But the row or column indexer can also be a list, slice, boolean array, .."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries.loc['France':'Germany', ['area', 'population']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "Selecting by position with `iloc` works similar as indexing numpy arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries.iloc[0:2,1:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The different indexing methods can also be used to assign data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "countries2 = countries.copy()\n",
    "countries2.loc['Belgium':'Germany', 'population'] = 10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Boolean indexing (filtering)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL). \n",
    "\n",
    "The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries['area'] > 100000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries[countries['area'] > 100000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Add a column `density` with the population density (note: population column is expressed in millions)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Select the capital and the population column of those countries where the density is larger than 300\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Add a column 'density_ratio' with the ratio of the density to the mean density\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Change the capital of the UK to Cambridge\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Select all countries whose population density is between 100 and 300 people/km²\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Some other useful methods: `isin` and string methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `isin` method of Series is very useful to select rows that may contain certain values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "s = countries['capital']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "s.isin?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "s.isin(['Berlin', 'London'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can then be used to filter the dataframe with boolean indexing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries[countries['capital'].isin(['Berlin', 'London'])]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the `startswith` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "'Berlin'.startswith('B')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In pandas, these are available on a Series through the `str` namespace:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries['capital'].str.startswith('B')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Select all countries that have capital names with more than 7 characters\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Select all countries that have capital names that contain the character sequence 'am'\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pitfall: chained indexing (and the 'SettingWithCopyWarning')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "countries.loc['Belgium', 'capital'] = 'Ghent' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries['capital']['Belgium'] = 'Antwerp' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries[countries['capital'] == 'Antwerp']['capital'] = 'Brussels' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "countries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How to avoid this?\n",
    "\n",
    "* Use `loc` instead of chained indexing if possible!\n",
    "* Or `copy` explicitly if you don't want to change the original data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## More exercises!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so all credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cast = pd.read_csv('data/cast.csv')\n",
    "cast.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "titles = pd.read_csv('data/titles.csv')\n",
    "titles.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: How many movies are listed in the titles dataframe?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: What are the earliest two films listed in the titles dataframe?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: How many movies have the title \"Hamlet\"?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: List all of the \"Treasure Island\" movies from earliest to most recent.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: How many movies were made from 1950 through 1959?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: How many roles in the movie \"Inception\" are NOT ranked by an \"n\" value?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: But how many roles in the movie \"Inception\" did receive an \"n\" value?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: Display the cast of \"North by Northwest\" in their correct \"n\"-value order, ignoring roles that did not earn a numeric \"n\" value.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: How many roles were credited in the silent 1921 version of Hamlet?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\">\n",
    "    <b>EXERCISE</b>: List the supporting roles (having n=2) played by Cary Grant in the 1940s, in order by year.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "clear_cell": true,
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Nbtutor - export exercises",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}