OSGeoLive-Notebooks/Pandas/04 - Groupby operations.ipynb

289 lines
5.9 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Groupby operations"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Some imports:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"try:\n",
" import seaborn\n",
"except ImportError:\n",
" pass\n",
"\n",
"pd.options.display.max_rows = 10"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Some 'theory': the groupby operation (split-apply-combine)\n",
"\n",
"The \"group by\" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**\n",
"\n",
"This operation is also referred to as the \"split-apply-combine\" operation, involving the following steps:\n",
"\n",
"* **Splitting** the data into groups based on some criteria\n",
"* **Applying** a function to each group independently\n",
"* **Combining** the results into a data structure\n",
"\n",
"<img src=\"img/splitApplyCombine.png\">\n",
"\n",
"Similar to SQL `GROUP BY`"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The example of the image in pandas syntax:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],\n",
" 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:\n",
"\n",
"\n",
" df[df['key'] == \"A\"].sum()\n",
" df[df['key'] == \"B\"].sum()\n",
" ...\n",
"\n",
"But pandas provides the `groupby` method to do this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"df.groupby('key').aggregate(np.sum) # 'sum'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby('key').sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And many more methods are available. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## And now applying this on some real data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We go back to the titanic survival data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/titanic.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Using groupby(), calculate the average age for each sex.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Calculate the average survival ratio for all passengers.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Calculate this survival ratio for all passengers younger that 25 (remember: filtering/boolean indexing).\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Is there a difference in this survival ratio between the sexes? (tip: write the above calculation of the survival ratio as a function)\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Make a bar plot of the survival ratio for the different classes ('Pclass' column).\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are ready, more groupby exercises can be found in the \"Advanded groupby operations\" notebook."
]
}
],
"metadata": {
"celltoolbar": "Nbtutor - export exercises",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 1
}