OSGeoLive-Notebooks/Pandas/04b - Advanced groupby oper...

836 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Groupby operations"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Some imports:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"try:\n",
" import seaborn\n",
"except ImportError:\n",
" pass\n",
"\n",
"pd.options.display.max_rows = 10"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Recap: the groupby operation (split-apply-combine)\n",
"\n",
"The \"group by\" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**\n",
"\n",
"This operation is also referred to as the \"split-apply-combine\" operation, involving the following steps:\n",
"\n",
"* **Splitting** the data into groups based on some criteria\n",
"* **Applying** a function to each group independently\n",
"* **Combining** the results into a data structure\n",
"\n",
"<img src=\"img/splitApplyCombine.png\">\n",
"\n",
"Similar to SQL `GROUP BY`"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The example of the image in pandas syntax:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],\n",
" 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:\n",
"\n",
"\n",
" df[df['key'] == \"A\"].sum()\n",
" df[df['key'] == \"B\"].sum()\n",
" ...\n",
"\n",
"But pandas provides the `groupby` method to do this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"df.groupby('key').aggregate('sum') # np.sum"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby('key').sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas does not only let you group by a column name. In `df.groupby(grouper)` can be many things:\n",
"\n",
"- Series (or string indicating a column in df)\n",
"- function (to be applied on the index)\n",
"- dict : groups by values\n",
"- levels=[], names of levels in a MultiIndex\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby(lambda x: x % 2).mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## And now applying this on some real data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These exercises are based on the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so all credit to him!) and the datasets he prepared for that. You can download these data from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8) and put them in the `/data` folder."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`cast` dataset: different roles played by actors/actresses in films\n",
"\n",
"- title: title of the film\n",
"- name: name of the actor/actress\n",
"- type: actor/actress\n",
"- n: the order of the role (n=1: leading role)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cast = pd.read_csv('data/cast.csv')\n",
"cast.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"titles = pd.read_csv('data/titles.csv')\n",
"titles.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Using groupby(), plot the number of films that have been released each decade in the history of cinema.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"titles.groupby('year').sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations8.py\n",
"titles.groupby(titles.year // 10 * 10).size().plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Use groupby() to plot the number of \"Hamlet\" films made each decade.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations9.py\n",
"hamlet = titles[titles['title'] == 'Hamlet']\n",
"hamlet.groupby(hamlet.year // 10 * 10).size().plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: How many leading (n=1) roles were available to actors, and how many to actresses, in each year of the 1950s?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations10.py\n",
"cast1950 = cast[cast.year // 10 == 195]\n",
"cast1950 = cast1950[cast1950.n == 1]\n",
"cast1950.groupby(['year', 'type']).size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: List the 10 actors/actresses that have the most leading roles (n=1) since the 1990's.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations11.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Use groupby() to determine how many roles are listed for each of The Pink Panther movies.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations12.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: List, in order by year, each of the films in which Frank Oz has played more than 1 role.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations13.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: List each of the characters that Frank Oz has portrayed at least twice.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations15.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transforms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes you don't want to aggregate the groups, but transform the values in each group. This can be achieved with `transform`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby('key').transform('mean')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def normalize(group):\n",
" return (group - group.mean()) / group.std()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby('key').transform(normalize)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby('key').transform('sum')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Add a column to the `cast` dataframe that indicates the number of roles for the film.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations21.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Calculate the ratio of leading actor and actress roles to the total number of leading roles per decade. \n",
"</div>\n",
"\n",
"Tip: you can to do a groupby twice in two steps, once calculating the numbers, and then the ratios."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations22.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations23.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"scrolled": false
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations24.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Intermezzo: string manipulations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python strings have a lot of useful methods available to manipulate or check the content of the string:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s = 'Bradwurst'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s.startswith('B')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In pandas, those methods (together with some additional methods) are also available for string Series through the `.str` accessor:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s = pd.Series(['Bradwurst', 'Kartoffelsalat', 'Sauerkraut'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s.str.startswith('B')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: We already plotted the number of 'Hamlet' films released each decade, but not all titles are exactly called 'Hamlet'. Give an overview of the titles that contain 'Hamlet', and that start with 'Hamlet':\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations29.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations30.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: List the 10 movie titles with the longest name.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations31.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations32.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Value counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A useful shortcut to calculate the number of occurences of certain values is `value_counts` (this is somewhat equivalent to `df.groupby(key).size())`)\n",
"\n",
"For example, what are the most occuring movie titles?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"titles.title.value_counts().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Which years saw the most films released?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations34.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Plot the number of released films over time\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations35.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Plot the number of \"Hamlet\" films made each decade.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations36.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: What are the 11 most common character names in movie history?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations37.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Which actors or actresses appeared in the most movies in the year 2010?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations38.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Plot how many roles Brad Pitt has played in each year of his career.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations39.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: What are the 10 most film titles roles that start with the word \"The Life\"?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations40.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: How many leading (n=1) roles were available to actors, and how many to actresses, in the 1950s? And in 2000s?\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations41.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": [
"# %load snippets/04b - Advanced groupby operations42.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Nbtutor - export exercises",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 1
}