OSGeoLive-Notebooks/Pandas/02 - Data structures.ipynb

922 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"try:\n",
" import seaborn\n",
"except ImportError:\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tabular data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.read_csv(\"data/titanic.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Starting from reading this dataset, to answering questions about this data in a few lines of code:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What is the age distribution of the passengers?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df['Age'].hist()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**How does the survival rate of the passengers differ between sexes?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupby('Sex')[['Survived']].aggregate(lambda x: x.sum() / len(x))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Or how does it differ between the different classes?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.groupby('Pclass')['Survived'].aggregate(lambda x: x.sum() / len(x)).plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Are young people more likely to survive?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df['Survived'].sum() / df['Survived'].count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df25 = df[df['Age'] <= 25]\n",
"df25['Survived'].sum() / len(df25['Survived'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All the needed functionality for the above examples will be explained throughout this tutorial."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data structures\n",
"\n",
"Pandas provides two fundamental data objects, for 1D (``Series``) and 2D data (``DataFrame``)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Series\n",
"\n",
"A Series is a basic holder for **one-dimensional labeled data**. It can be created much as a NumPy array is created:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s = pd.Series([0.1, 0.2, 0.3, 0.4])\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Attributes of a Series: `index` and `values`\n",
"\n",
"The series has a built-in concept of an **index**, which by default is the numbers *0* through *N - 1*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s.index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can access the underlying numpy array representation with the `.values` attribute:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can access series values via the index, just like for NumPy arrays:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unlike the NumPy array, though, this index can be something other than integers:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])\n",
"s2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s2['c']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this way, a ``Series`` object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value.\n",
"\n",
"In fact, it's possible to construct a series directly from a Python dictionary:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pop_dict = {'Germany': 81.3, \n",
" 'Belgium': 11.3, \n",
" 'France': 64.3, \n",
" 'United Kingdom': 64.9, \n",
" 'Netherlands': 16.9}\n",
"population = pd.Series(pop_dict)\n",
"population"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can index the populations like a dict as expected:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"population['France']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"but with the power of numpy arrays:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"population * 1000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrames: Multi-dimensional Data\n",
"\n",
"A DataFrame is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.\n",
"\n",
"<img src=\"img/dataframe.png\" width=110%>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.\n",
"\n",
"Note that in the IPython notebook, the dataframe will display in a rich HTML view:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n",
" 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n",
" 'area': [30510, 671308, 357050, 41526, 244820],\n",
" 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}\n",
"countries = pd.DataFrame(data)\n",
"countries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Attributes of the DataFrame\n",
"\n",
"A DataFrame has besides a `index` attribute, also a `columns` attribute:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To check the data types of the different columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An overview of that information can be given with the `info()` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also a DataFrame has a `values` attribute, but attention: when you have heterogeneous data, all values will be upcasted:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we don't like what the index looks like, we can reset it and set one of our columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries = countries.set_index('country')\n",
"countries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To access a Series representing a column in the data, use typical indexing syntax:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['area']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Basic operations on Series/Dataframes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# redefining the example objects\n",
"\n",
"population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, \n",
" 'United Kingdom': 64.9, 'Netherlands': 16.9})\n",
"\n",
"countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],\n",
" 'population': [11.3, 64.3, 81.3, 16.9, 64.9],\n",
" 'area': [30510, 671308, 357050, 41526, 244820],\n",
" 'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Elementwise-operations (like numpy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just like with numpy arrays, many operations are element-wise:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"population / 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['population'] / countries['area']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Alignment! (unlike numpy)\n",
"\n",
"Only, pay attention to **alignment**: operations between series will align on the index: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s1 = population[['Belgium', 'France']]\n",
"s2 = population[['France', 'Germany']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"s2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"s1 + s2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reductions (like numpy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The average population number:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"population.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The minimum area:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['area'].min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For dataframes, often only the numeric columns are included in the result:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.median()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Calculate the population numbers relative to Belgium\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success\">\n",
" <b>EXERCISE</b>: Calculate the population density for each country and add this as a new column to the dataframe.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"clear_cell": true,
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some other useful methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sorting the rows of the DataFrame according to the values in a column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.sort_values('density', ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One useful method to use is the ``describe`` method, which computes summary statistics for each column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `plot` method can be used to quickly visualize the data in different ways:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, for this dataset, it does not say that much:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"countries['population'].plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can play with the `kind` keyword: 'line', 'bar', 'hist', 'density', 'area', 'pie', 'scatter', 'hexbin'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Importing and exporting data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A wide range of input/output formats are natively supported by pandas:\n",
"\n",
"* CSV, text\n",
"* SQL database\n",
"* Excel\n",
"* HDF5\n",
"* json\n",
"* html\n",
"* pickle\n",
"* ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pd.read"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"states.to"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Other features\n",
"\n",
"* Working with missing data (`.dropna()`, `pd.isnull()`)\n",
"* Merging and joining (`concat`, `join`)\n",
"* Grouping: `groupby` functionality\n",
"* Reshaping (`stack`, `pivot`)\n",
"* Time series manipulation (resampling, timezones, ..)\n",
"* Easy plotting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.\n",
"\n",
"See the next notebooks!"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Acknowledgement\n",
"\n",
"> *© 2015, Stijn Van Hoey and Joris Van den Bossche (<mailto:stijnvanhoey@gmail.com>, <mailto:jorisvandenbossche@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*\n",
"\n",
"> This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).\n",
"\n",
"---"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}