289 lines
5.9 KiB
Plaintext
289 lines
5.9 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Groupby operations"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Some imports:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "-"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%matplotlib inline\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"try:\n",
|
|
" import seaborn\n",
|
|
"except ImportError:\n",
|
|
" pass\n",
|
|
"\n",
|
|
"pd.options.display.max_rows = 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Some 'theory': the groupby operation (split-apply-combine)\n",
|
|
"\n",
|
|
"The \"group by\" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**\n",
|
|
"\n",
|
|
"This operation is also referred to as the \"split-apply-combine\" operation, involving the following steps:\n",
|
|
"\n",
|
|
"* **Splitting** the data into groups based on some criteria\n",
|
|
"* **Applying** a function to each group independently\n",
|
|
"* **Combining** the results into a data structure\n",
|
|
"\n",
|
|
"<img src=\"img/splitApplyCombine.png\">\n",
|
|
"\n",
|
|
"Similar to SQL `GROUP BY`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"The example of the image in pandas syntax:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],\n",
|
|
" 'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})\n",
|
|
"df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Using the filtering and reductions operations we have seen in the previous notebooks, we could do something like:\n",
|
|
"\n",
|
|
"\n",
|
|
" df[df['key'] == \"A\"].sum()\n",
|
|
" df[df['key'] == \"B\"].sum()\n",
|
|
" ...\n",
|
|
"\n",
|
|
"But pandas provides the `groupby` method to do this:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.groupby('key').aggregate(np.sum) # 'sum'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.groupby('key').sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"And many more methods are available. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## And now applying this on some real data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We go back to the titanic survival data:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.read_csv(\"data/titanic.csv\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
" <b>EXERCISE</b>: Using groupby(), calculate the average age for each sex.\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
" <b>EXERCISE</b>: Calculate the average survival ratio for all passengers.\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"<div class=\"alert alert-success\">\n",
|
|
" <b>EXERCISE</b>: Calculate this survival ratio for all passengers younger that 25 (remember: filtering/boolean indexing).\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
" <b>EXERCISE</b>: Is there a difference in this survival ratio between the sexes? (tip: write the above calculation of the survival ratio as a function)\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<div class=\"alert alert-success\">\n",
|
|
" <b>EXERCISE</b>: Make a bar plot of the survival ratio for the different classes ('Pclass' column).\n",
|
|
"</div>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"clear_cell": true
|
|
},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"If you are ready, more groupby exercises can be found in the \"Advanded groupby operations\" notebook."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"celltoolbar": "Nbtutor - export exercises",
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.8"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|