OSGeoLive-Notebooks/Pandas/03 - Indexing and selecting...

914 lines
19 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Indexing and selecting data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"try:\n",
" import seaborn\n",
"except ImportError:\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# redefining the example objects\n",
"\n",
"# series\n",
"population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, \n",
" 'United Kingdom': 64.9, 'Netherlands': 16.9})\n",
"\n",
"# dataframe\n",
"data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n",
" 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n",
" 'area': [30510, 671308, 357050, 41526, 244820],\n",
" 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}\n",
"countries = pd.DataFrame(data)\n",
"countries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setting the index to the country names:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries = countries.set_index('country')\n",
"countries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some notes on selecting data\n",
"\n",
"One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. We now have to distuinguish between:\n",
"\n",
"- selection by label\n",
"- selection by position."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `data[]` provides some convenience shortcuts "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For a DataFrame, basic indexing selects the columns.\n",
"\n",
"Selecting a single column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['area']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or multiple columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries[['area', 'population']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But, slicing accesses the rows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['France':'Netherlands']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-danger\">\n",
" <b>NOTE</b>: Unlike slicing in numpy, the end label is **included**.\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So as a summary, `[]` provides the following convenience shortcuts:\n",
"\n",
"- Series: selecting a label: `s[label]`\n",
"- DataFrame: selecting a single or multiple columns: `df['col']` or `df[['col1', 'col2']]`\n",
"- DataFrame: slicing the rows: `df['row_label1':'row_label2']` or `df[mask]`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Systematic indexing with `loc` and `iloc`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:\n",
" \n",
"* `loc`: selection by label\n",
"* `iloc`: selection by position\n",
"\n",
"These methods index the different dimensions of the frame:\n",
"\n",
"* `df.loc[row_indexer, column_indexer]`\n",
"* `df.iloc[row_indexer, column_indexer]`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting a single element:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.loc['Germany', 'area']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But the row or column indexer can also be a list, slice, boolean array, .."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.loc['France':'Germany', ['area', 'population']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"Selecting by position with `iloc` works similar as indexing numpy arrays:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.iloc[0:2,1:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The different indexing methods can also be used to assign data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"countries2 = countries.copy()\n",
"countries2.loc['Belgium':'Germany', 'population'] = 10"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Boolean indexing (filtering)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL). \n",
"\n",
"The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['area'] > 100000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries[countries['area'] > 100000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Add a column `density` with the population density (note: population column is expressed in millions)\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Select the capital and the population column of those countries where the density is larger than 300\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Add a column 'density_ratio' with the ratio of the density to the mean density\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Change the capital of the UK to Cambridge\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Select all countries whose population density is between 100 and 300 people/km²\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Some other useful methods: `isin` and string methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `isin` method of Series is very useful to select rows that may contain certain values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"s = countries['capital']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"s.isin?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s.isin(['Berlin', 'London'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This can then be used to filter the dataframe with boolean indexing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries[countries['capital'].isin(['Berlin', 'London'])]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the `startswith` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"'Berlin'.startswith('B')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In pandas, these are available on a Series through the `str` namespace:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['capital'].str.startswith('B')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Select all countries that have capital names with more than 7 characters\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Select all countries that have capital names that contain the character sequence 'am'\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pitfall: chained indexing (and the 'SettingWithCopyWarning')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"countries.loc['Belgium', 'capital'] = 'Ghent' "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['capital']['Belgium'] = 'Antwerp' "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries[countries['capital'] == 'Antwerp']['capital'] = 'Brussels' "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How to avoid this?\n",
"\n",
"* Use `loc` instead of chained indexing if possible!\n",
"* Or `copy` explicitly if you don't want to change the original data."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## More exercises!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the quick ones among you, here are some more exercises with some larger dataframe with film data. These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so all credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"cast = pd.read_csv('data/cast.csv')\n",
"cast.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"titles = pd.read_csv('data/titles.csv')\n",
"titles.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: How many movies are listed in the titles dataframe?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: What are the earliest two films listed in the titles dataframe?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: How many movies have the title \"Hamlet\"?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: List all of the \"Treasure Island\" movies from earliest to most recent.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: How many movies were made from 1950 through 1959?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: How many roles in the movie \"Inception\" are NOT ranked by an \"n\" value?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: But how many roles in the movie \"Inception\" did receive an \"n\" value?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Display the cast of \"North by Northwest\" in their correct \"n\"-value order, ignoring roles that did not earn a numeric \"n\" value.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: How many roles were credited in the silent 1921 version of Hamlet?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: List the supporting roles (having n=2) played by Cary Grant in the 1940s, in order by year.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Nbtutor - export exercises",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}