{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AKON Metadata - Data Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Get a first impression of the postcard metadata*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the [Pandas Python Data Analysis Library](https://pandas.pydata.org/).\n",
"\n",
"For an intro to pandas feel free to take a look at this [Workshop for CBioVikings](https://github.com/dblyon/PandasIntro) by David Lyon."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`df` stands for *Data Frame*"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('akon_postcards_public_domain.csv.bz2')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## View Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rough Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How much datasets are in there?"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"28882"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What does a dataset look like?\n",
"Show me the first one!"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" akon_id | \n",
" id | \n",
" altitude | \n",
" building | \n",
" city | \n",
" color | \n",
" comment | \n",
" mountain | \n",
" other | \n",
" ... | \n",
" date | \n",
" feature_class | \n",
" feature_code | \n",
" geoname_id | \n",
" latitude | \n",
" longitude | \n",
" name | \n",
" country_id | \n",
" admin_name_1 | \n",
" admin_code_1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" AK111_024 | \n",
" 74685 | \n",
" NaN | \n",
" NaN | \n",
" Kierling | \n",
" True | \n",
" 1908 | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" gelaufen 1908 | \n",
" P | \n",
" PPL | \n",
" 2774449.0 | \n",
" 48.30997 | \n",
" 16.27616 | \n",
" Kierling | \n",
" AT | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
1 rows × 29 columns
\n",
"
"
],
"text/plain": [
" Unnamed: 0 akon_id id altitude building city color comment \\\n",
"0 0 AK111_024 74685 NaN NaN Kierling True 1908 \n",
"\n",
" mountain other ... date feature_class feature_code \\\n",
"0 NaN NaN ... gelaufen 1908 P PPL \n",
"\n",
" geoname_id latitude longitude name country_id admin_name_1 \\\n",
"0 2774449.0 48.30997 16.27616 Kierling AT NaN \n",
"\n",
" admin_code_1 \n",
"0 NaN \n",
"\n",
"[1 rows x 29 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There seem to be a few columns missing from the output. Let's fix that by setting pandas output options:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_columns', 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try again:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" akon_id | \n",
" id | \n",
" altitude | \n",
" building | \n",
" city | \n",
" color | \n",
" comment | \n",
" mountain | \n",
" other | \n",
" photographer | \n",
" publisher | \n",
" publisher_place | \n",
" region | \n",
" water_body | \n",
" year | \n",
" inventory_number | \n",
" signature | \n",
" revision_date | \n",
" date | \n",
" feature_class | \n",
" feature_code | \n",
" geoname_id | \n",
" latitude | \n",
" longitude | \n",
" name | \n",
" country_id | \n",
" admin_name_1 | \n",
" admin_code_1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" AK111_024 | \n",
" 74685 | \n",
" NaN | \n",
" NaN | \n",
" Kierling | \n",
" True | \n",
" 1908 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Geogr. Topogr. Bilder-Samml. 1944, 6380 | \n",
" 2014-09-05 10:13:12.536 | \n",
" gelaufen 1908 | \n",
" P | \n",
" PPL | \n",
" 2774449.0 | \n",
" 48.30997 | \n",
" 16.27616 | \n",
" Kierling | \n",
" AT | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 akon_id id altitude building city color comment \\\n",
"0 0 AK111_024 74685 NaN NaN Kierling True 1908 \n",
"\n",
" mountain other photographer publisher publisher_place region water_body \\\n",
"0 NaN NaN NaN NaN NaN NaN NaN \n",
"\n",
" year inventory_number signature \\\n",
"0 NaN NaN Geogr. Topogr. Bilder-Samml. 1944, 6380 \n",
"\n",
" revision_date date feature_class feature_code \\\n",
"0 2014-09-05 10:13:12.536 gelaufen 1908 P PPL \n",
"\n",
" geoname_id latitude longitude name country_id admin_name_1 \\\n",
"0 2774449.0 48.30997 16.27616 Kierling AT NaN \n",
"\n",
" admin_code_1 \n",
"0 NaN "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we see all columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are all the columns called again?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Unnamed: 0', 'akon_id', 'id', 'altitude', 'building', 'city', 'color',\n",
" 'comment', 'mountain', 'other', 'photographer', 'publisher',\n",
" 'publisher_place', 'region', 'water_body', 'year', 'inventory_number',\n",
" 'signature', 'revision_date', 'date', 'feature_class', 'feature_code',\n",
" 'geoname_id', 'latitude', 'longitude', 'name', 'country_id',\n",
" 'admin_name_1', 'admin_code_1'],\n",
" dtype='object')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Show Random Entries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show me 3 random entries:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" akon_id | \n",
" id | \n",
" altitude | \n",
" building | \n",
" city | \n",
" color | \n",
" comment | \n",
" mountain | \n",
" other | \n",
" photographer | \n",
" publisher | \n",
" publisher_place | \n",
" region | \n",
" water_body | \n",
" year | \n",
" inventory_number | \n",
" signature | \n",
" revision_date | \n",
" date | \n",
" feature_class | \n",
" feature_code | \n",
" geoname_id | \n",
" latitude | \n",
" longitude | \n",
" name | \n",
" country_id | \n",
" admin_name_1 | \n",
" admin_code_1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 458 | \n",
" 458 | \n",
" AK114_189 | \n",
" 76614 | \n",
" NaN | \n",
" NaN | \n",
" Airolo | \n",
" True | \n",
" 1909 gel | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Geogr. Topogr. Bilder-Samml. 1944, 27 | \n",
" 2014-09-09 08:50:37.905 | \n",
" gelaufen 1909 | \n",
" P | \n",
" PPL | \n",
" 2661830.0 | \n",
" 46.52847 | \n",
" 8.60881 | \n",
" Airolo | \n",
" CH | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 24510 | \n",
" 24510 | \n",
" AK075_467 | \n",
" 46922 | \n",
" NaN | \n",
" NaN | \n",
" Klausen | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1913.0 | \n",
" NaN | \n",
" NaN | \n",
" 2014-08-21 10:05:32.779 | \n",
" 1913 | \n",
" P | \n",
" PPLA3 | \n",
" 3178764.0 | \n",
" 46.64001 | \n",
" 11.56573 | \n",
" Klausen | \n",
" IT | \n",
" Südtirol | \n",
" 17 | \n",
"
\n",
" \n",
" 18564 | \n",
" 18564 | \n",
" AK082_546 | \n",
" 51980 | \n",
" NaN | \n",
" NaN | \n",
" Velsen | \n",
" False | \n",
" v 1905 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2014-08-25 17:46:23.417 | \n",
" vor 1905 | \n",
" P | \n",
" PPL | \n",
" 2745673.0 | \n",
" 52.46000 | \n",
" 4.65000 | \n",
" Velsen | \n",
" NL | \n",
" Nord-Holland | \n",
" 07 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 akon_id id altitude building city color \\\n",
"458 458 AK114_189 76614 NaN NaN Airolo True \n",
"24510 24510 AK075_467 46922 NaN NaN Klausen False \n",
"18564 18564 AK082_546 51980 NaN NaN Velsen False \n",
"\n",
" comment mountain other photographer publisher publisher_place region \\\n",
"458 1909 gel NaN NaN NaN NaN NaN NaN \n",
"24510 NaN NaN NaN NaN NaN NaN NaN \n",
"18564 v 1905 NaN NaN NaN NaN NaN NaN \n",
"\n",
" water_body year inventory_number \\\n",
"458 NaN NaN NaN \n",
"24510 NaN 1913.0 NaN \n",
"18564 NaN NaN NaN \n",
"\n",
" signature revision_date \\\n",
"458 Geogr. Topogr. Bilder-Samml. 1944, 27 2014-09-09 08:50:37.905 \n",
"24510 NaN 2014-08-21 10:05:32.779 \n",
"18564 NaN 2014-08-25 17:46:23.417 \n",
"\n",
" date feature_class feature_code geoname_id latitude \\\n",
"458 gelaufen 1909 P PPL 2661830.0 46.52847 \n",
"24510 1913 P PPLA3 3178764.0 46.64001 \n",
"18564 vor 1905 P PPL 2745673.0 52.46000 \n",
"\n",
" longitude name country_id admin_name_1 admin_code_1 \n",
"458 8.60881 Airolo CH NaN NaN \n",
"24510 11.56573 Klausen IT Südtirol 17 \n",
"18564 4.65000 Velsen NL Nord-Holland 07 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calling `sample` again yields different entries:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" akon_id | \n",
" id | \n",
" altitude | \n",
" building | \n",
" city | \n",
" color | \n",
" comment | \n",
" mountain | \n",
" other | \n",
" photographer | \n",
" publisher | \n",
" publisher_place | \n",
" region | \n",
" water_body | \n",
" year | \n",
" inventory_number | \n",
" signature | \n",
" revision_date | \n",
" date | \n",
" feature_class | \n",
" feature_code | \n",
" geoname_id | \n",
" latitude | \n",
" longitude | \n",
" name | \n",
" country_id | \n",
" admin_name_1 | \n",
" admin_code_1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 27658 | \n",
" 27658 | \n",
" AK097_351 | \n",
" 62277 | \n",
" NaN | \n",
" Station Brenner | \n",
" Steinach am Brenner | \n",
" False | \n",
" 1905 gel | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2014-09-02 11:15:45.988 | \n",
" vor 1905 | \n",
" P | \n",
" PPLA3 | \n",
" 2764557.0 | \n",
" 47.08333 | \n",
" 11.46667 | \n",
" Steinach am Brenner | \n",
" AT | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 14993 | \n",
" 14993 | \n",
" AK021_515 | \n",
" 12604 | \n",
" NaN | \n",
" NaN | \n",
" Gross-Pöchlarn, Klein-Pöchlarn | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Ledermann | \n",
" Wien | \n",
" NaN | \n",
" NaN | \n",
" 1917.0 | \n",
" NaN | \n",
" NaN | \n",
" 2014-08-04 07:59:10.136 | \n",
" 1917 | \n",
" P | \n",
" PPLA3 | \n",
" 2768627.0 | \n",
" 48.20000 | \n",
" 15.20000 | \n",
" Pöchlarn | \n",
" AT | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 15934 | \n",
" 15934 | \n",
" AK025_596 | \n",
" 15102 | \n",
" NaN | \n",
" Pfarrkirche | \n",
" Mondsee | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1914.0 | \n",
" NaN | \n",
" NaN | \n",
" 2014-08-04 07:59:10.187 | \n",
" 1914 | \n",
" P | \n",
" PPLA3 | \n",
" 2771277.0 | \n",
" 47.85648 | \n",
" 13.34908 | \n",
" Mondsee | \n",
" AT | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 akon_id id altitude building \\\n",
"27658 27658 AK097_351 62277 NaN Station Brenner \n",
"14993 14993 AK021_515 12604 NaN NaN \n",
"15934 15934 AK025_596 15102 NaN Pfarrkirche \n",
"\n",
" city color comment mountain other \\\n",
"27658 Steinach am Brenner False 1905 gel NaN NaN \n",
"14993 Gross-Pöchlarn, Klein-Pöchlarn False NaN NaN NaN \n",
"15934 Mondsee False NaN NaN NaN \n",
"\n",
" photographer publisher publisher_place region water_body year \\\n",
"27658 NaN NaN NaN NaN NaN NaN \n",
"14993 NaN Ledermann Wien NaN NaN 1917.0 \n",
"15934 NaN NaN NaN NaN NaN 1914.0 \n",
"\n",
" inventory_number signature revision_date date \\\n",
"27658 NaN NaN 2014-09-02 11:15:45.988 vor 1905 \n",
"14993 NaN NaN 2014-08-04 07:59:10.136 1917 \n",
"15934 NaN NaN 2014-08-04 07:59:10.187 1914 \n",
"\n",
" feature_class feature_code geoname_id latitude longitude \\\n",
"27658 P PPLA3 2764557.0 47.08333 11.46667 \n",
"14993 P PPLA3 2768627.0 48.20000 15.20000 \n",
"15934 P PPLA3 2771277.0 47.85648 13.34908 \n",
"\n",
" name country_id admin_name_1 admin_code_1 \n",
"27658 Steinach am Brenner AT NaN NaN \n",
"14993 Pöchlarn AT NaN NaN \n",
"15934 Mondsee AT NaN NaN "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Count Things"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many entries show things in Italy?\n",
"\n",
"Let's use the `country_id` for this question:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"df_in_italy = df[df['country_id'] == 'IT']"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2983"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df_in_italy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many postcards are in color?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"df_in_color = df[df['color'] == True]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7075"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df_in_color)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can I do this in one line?"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7075"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df[df['color'] == True])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many different publisher places are in the data set?"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1324"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df['publisher_place'].unique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show me some!"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"ename": "AttributeError",
"evalue": "'numpy.ndarray' object has no attribute 'sample'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'publisher_place'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mAttributeError\u001b[0m: 'numpy.ndarray' object has no attribute 'sample'"
]
}
],
"source": [
"df['publisher_place'].unique().sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Oh, that doesn't work. Let's wrap it in a pandas DataFrame, step by step:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"publisher_places = df['publisher_place'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([nan, 'Kierling', 'Kindberg', ..., 'Straßburg i./E.', 'Detmold',\n",
" 'Furth i. W.'], dtype=object)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"publisher_places"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"pp = pd.DataFrame(publisher_places)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" Kierling | \n",
"
\n",
" \n",
" 2 | \n",
" Kindberg | \n",
"
\n",
" \n",
" 3 | \n",
" Kirchau | \n",
"
\n",
" \n",
" 4 | \n",
" Wien | \n",
"
\n",
" \n",
" 5 | \n",
" Kirchhain | \n",
"
\n",
" \n",
" 6 | \n",
" München | \n",
"
\n",
" \n",
" 7 | \n",
" Kitzbühel | \n",
"
\n",
" \n",
" 8 | \n",
" Klagenfurt | \n",
"
\n",
" \n",
" 9 | \n",
" Grein a/D. | \n",
"
\n",
" \n",
" 10 | \n",
" Bozen | \n",
"
\n",
" \n",
" 11 | \n",
" Znaim | \n",
"
\n",
" \n",
" 12 | \n",
" Graz | \n",
"
\n",
" \n",
" 13 | \n",
" Heidelberg | \n",
"
\n",
" \n",
" 14 | \n",
" Komotau | \n",
"
\n",
" \n",
" 15 | \n",
" Köln | \n",
"
\n",
" \n",
" 16 | \n",
" Bodenbach a. d. Elbe | \n",
"
\n",
" \n",
" 17 | \n",
" Meissen | \n",
"
\n",
" \n",
" 18 | \n",
" Leipzig | \n",
"
\n",
" \n",
" 19 | \n",
" Konstanz | \n",
"
\n",
" \n",
" 20 | \n",
" Korneuburg | \n",
"
\n",
" \n",
" 21 | \n",
" Brașov | \n",
"
\n",
" \n",
" 22 | \n",
" Mürzzuschlag | \n",
"
\n",
" \n",
" 23 | \n",
" Salzburg | \n",
"
\n",
" \n",
" 24 | \n",
" Frankfurt a. M. | \n",
"
\n",
" \n",
" 25 | \n",
" Arys | \n",
"
\n",
" \n",
" 26 | \n",
" Tegernsee | \n",
"
\n",
" \n",
" 27 | \n",
" Garmisch | \n",
"
\n",
" \n",
" 28 | \n",
" Partenkirchen-Garmisch | \n",
"
\n",
" \n",
" 29 | \n",
" Zürich | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 1294 | \n",
" Neugersdorf | \n",
"
\n",
" \n",
" 1295 | \n",
" Friedau | \n",
"
\n",
" \n",
" 1296 | \n",
" Freystadt | \n",
"
\n",
" \n",
" 1297 | \n",
" Gießhübl | \n",
"
\n",
" \n",
" 1298 | \n",
" Gams | \n",
"
\n",
" \n",
" 1299 | \n",
" Schlossberg | \n",
"
\n",
" \n",
" 1300 | \n",
" Frakfurt a. Oder | \n",
"
\n",
" \n",
" 1301 | \n",
" Casale Monferrato | \n",
"
\n",
" \n",
" 1302 | \n",
" Sternberg | \n",
"
\n",
" \n",
" 1303 | \n",
" Stössing | \n",
"
\n",
" \n",
" 1304 | \n",
" Mauterndorf | \n",
"
\n",
" \n",
" 1305 | \n",
" Thörl | \n",
"
\n",
" \n",
" 1306 | \n",
" Traismauer | \n",
"
\n",
" \n",
" 1307 | \n",
" Trebnitz | \n",
"
\n",
" \n",
" 1308 | \n",
" Voitsberg | \n",
"
\n",
" \n",
" 1309 | \n",
" Daun | \n",
"
\n",
" \n",
" 1310 | \n",
" Kilchberg-Züich | \n",
"
\n",
" \n",
" 1311 | \n",
" Strassburg i. Els. | \n",
"
\n",
" \n",
" 1312 | \n",
" Mühlhausen | \n",
"
\n",
" \n",
" 1313 | \n",
" Eschwege | \n",
"
\n",
" \n",
" 1314 | \n",
" Tabarz | \n",
"
\n",
" \n",
" 1315 | \n",
" Weimar | \n",
"
\n",
" \n",
" 1316 | \n",
" Coburg | \n",
"
\n",
" \n",
" 1317 | \n",
" Friedrichsroda i. Th. | \n",
"
\n",
" \n",
" 1318 | \n",
" Leipa i. B. | \n",
"
\n",
" \n",
" 1319 | \n",
" Schumburg a. D. | \n",
"
\n",
" \n",
" 1320 | \n",
" Pisa | \n",
"
\n",
" \n",
" 1321 | \n",
" Straßburg i./E. | \n",
"
\n",
" \n",
" 1322 | \n",
" Detmold | \n",
"
\n",
" \n",
" 1323 | \n",
" Furth i. W. | \n",
"
\n",
" \n",
"
\n",
"
1324 rows × 1 columns
\n",
"
"
],
"text/plain": [
" 0\n",
"0 NaN\n",
"1 Kierling\n",
"2 Kindberg\n",
"3 Kirchau\n",
"4 Wien\n",
"5 Kirchhain\n",
"6 München\n",
"7 Kitzbühel\n",
"8 Klagenfurt\n",
"9 Grein a/D.\n",
"10 Bozen\n",
"11 Znaim\n",
"12 Graz\n",
"13 Heidelberg\n",
"14 Komotau\n",
"15 Köln\n",
"16 Bodenbach a. d. Elbe\n",
"17 Meissen\n",
"18 Leipzig\n",
"19 Konstanz\n",
"20 Korneuburg\n",
"21 Brașov\n",
"22 Mürzzuschlag\n",
"23 Salzburg\n",
"24 Frankfurt a. M.\n",
"25 Arys\n",
"26 Tegernsee\n",
"27 Garmisch\n",
"28 Partenkirchen-Garmisch\n",
"29 Zürich\n",
"... ...\n",
"1294 Neugersdorf\n",
"1295 Friedau\n",
"1296 Freystadt\n",
"1297 Gießhübl\n",
"1298 Gams\n",
"1299 Schlossberg\n",
"1300 Frakfurt a. Oder\n",
"1301 Casale Monferrato\n",
"1302 Sternberg\n",
"1303 Stössing\n",
"1304 Mauterndorf\n",
"1305 Thörl\n",
"1306 Traismauer\n",
"1307 Trebnitz\n",
"1308 Voitsberg\n",
"1309 Daun\n",
"1310 Kilchberg-Züich\n",
"1311 Strassburg i. Els.\n",
"1312 Mühlhausen\n",
"1313 Eschwege\n",
"1314 Tabarz\n",
"1315 Weimar\n",
"1316 Coburg\n",
"1317 Friedrichsroda i. Th.\n",
"1318 Leipa i. B.\n",
"1319 Schumburg a. D.\n",
"1320 Pisa\n",
"1321 Straßburg i./E.\n",
"1322 Detmold\n",
"1323 Furth i. W.\n",
"\n",
"[1324 rows x 1 columns]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Better. Now show me some randomly:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 830 | \n",
" Friedberg | \n",
"
\n",
" \n",
" 993 | \n",
" Kreisbach | \n",
"
\n",
" \n",
" 743 | \n",
" Neustadt a. d. D. | \n",
"
\n",
" \n",
" 70 | \n",
" Halberstadt | \n",
"
\n",
" \n",
" 344 | \n",
" Kastelruth | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0\n",
"830 Friedberg\n",
"993 Kreisbach\n",
"743 Neustadt a. d. D.\n",
"70 Halberstadt\n",
"344 Kastelruth"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp.sample(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sort Things"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just sort the sample, please:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 953 | \n",
" Arnsdorf | \n",
"
\n",
" \n",
" 1293 | \n",
" Bad Neuhaus | \n",
"
\n",
" \n",
" 309 | \n",
" Bad Pyrmont | \n",
"
\n",
" \n",
" 545 | \n",
" Bellegarde | \n",
"
\n",
" \n",
" 1300 | \n",
" Frakfurt a. Oder | \n",
"
\n",
" \n",
" 301 | \n",
" Ischl | \n",
"
\n",
" \n",
" 3 | \n",
" Kirchau | \n",
"
\n",
" \n",
" 984 | \n",
" Mönichkirchen | \n",
"
\n",
" \n",
" 744 | \n",
" Ramsau | \n",
"
\n",
" \n",
" 391 | \n",
" Roustchouk | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0\n",
"953 Arnsdorf\n",
"1293 Bad Neuhaus\n",
"309 Bad Pyrmont\n",
"545 Bellegarde\n",
"1300 Frakfurt a. Oder\n",
"301 Ischl\n",
"3 Kirchau\n",
"984 Mönichkirchen\n",
"744 Ramsau\n",
"391 Roustchouk"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp.sample(10).sort_values(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why the '0' in `sort_values(0)`? That's the name of the column to sort by."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort the whole thing:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 1053 | \n",
" Békéscsaba | \n",
"
\n",
" \n",
" 1102 | \n",
" Łuck | \n",
"
\n",
" \n",
" 359 | \n",
" # | \n",
"
\n",
" \n",
" 1268 | \n",
" A B. | \n",
"
\n",
" \n",
" 1044 | \n",
" A. | \n",
"
\n",
" \n",
" 1194 | \n",
" Aachen | \n",
"
\n",
" \n",
" 434 | \n",
" Abbazia | \n",
"
\n",
" \n",
" 1083 | \n",
" Abbazia-Lovrana | \n",
"
\n",
" \n",
" 291 | \n",
" Absam b. Innsbruck | \n",
"
\n",
" \n",
" 997 | \n",
" Abtenau | \n",
"
\n",
" \n",
" 637 | \n",
" Achensee | \n",
"
\n",
" \n",
" 1257 | \n",
" Adelsberg | \n",
"
\n",
" \n",
" 314 | \n",
" Admont | \n",
"
\n",
" \n",
" 326 | \n",
" Aeuckens | \n",
"
\n",
" \n",
" 287 | \n",
" Aflenz | \n",
"
\n",
" \n",
" 840 | \n",
" Aggsbach | \n",
"
\n",
" \n",
" 824 | \n",
" Aigen | \n",
"
\n",
" \n",
" 413 | \n",
" Albendorf | \n",
"
\n",
" \n",
" 50 | \n",
" Alexandria | \n",
"
\n",
" \n",
" 685 | \n",
" Alland II | \n",
"
\n",
" \n",
" 902 | \n",
" Allentsteig | \n",
"
\n",
" \n",
" 366 | \n",
" Alsfeld | \n",
"
\n",
" \n",
" 646 | \n",
" Alt Lengbach | \n",
"
\n",
" \n",
" 1151 | \n",
" Altaussee | \n",
"
\n",
" \n",
" 820 | \n",
" Altenberg | \n",
"
\n",
" \n",
" 417 | \n",
" Altenburg | \n",
"
\n",
" \n",
" 861 | \n",
" Altenmarkt | \n",
"
\n",
" \n",
" 1034 | \n",
" Altenmarkt a. d. Tr. | \n",
"
\n",
" \n",
" 869 | \n",
" Altenmarkt a. d. Ysper | \n",
"
\n",
" \n",
" 645 | \n",
" Altenmarkt a.d. Trst. | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 839 | \n",
" Wr. Neustadt | \n",
"
\n",
" \n",
" 885 | \n",
" Wr.-Neustadt | \n",
"
\n",
" \n",
" 46 | \n",
" Wunsiedel | \n",
"
\n",
" \n",
" 384 | \n",
" Wurtsboro, N. Y. | \n",
"
\n",
" \n",
" 1079 | \n",
" Wöllan | \n",
"
\n",
" \n",
" 1274 | \n",
" Wörgl | \n",
"
\n",
" \n",
" 370 | \n",
" Wörishofen | \n",
"
\n",
" \n",
" 860 | \n",
" Wörschach | \n",
"
\n",
" \n",
" 597 | \n",
" Ybbs | \n",
"
\n",
" \n",
" 581 | \n",
" Ypres | \n",
"
\n",
" \n",
" 697 | \n",
" Ypser | \n",
"
\n",
" \n",
" 1266 | \n",
" Ysper | \n",
"
\n",
" \n",
" 977 | \n",
" Zakopane | \n",
"
\n",
" \n",
" 1054 | \n",
" Zantan | \n",
"
\n",
" \n",
" 582 | \n",
" Zara | \n",
"
\n",
" \n",
" 756 | \n",
" Zbiroh | \n",
"
\n",
" \n",
" 880 | \n",
" Zell a. See | \n",
"
\n",
" \n",
" 40 | \n",
" Zell am See | \n",
"
\n",
" \n",
" 918 | \n",
" Zistersdorf | \n",
"
\n",
" \n",
" 33 | \n",
" Zittau | \n",
"
\n",
" \n",
" 11 | \n",
" Znaim | \n",
"
\n",
" \n",
" 1169 | \n",
" Zuckmantel | \n",
"
\n",
" \n",
" 496 | \n",
" Zurigo | \n",
"
\n",
" \n",
" 1170 | \n",
" Zwettl | \n",
"
\n",
" \n",
" 1184 | \n",
" Zwiesel | \n",
"
\n",
" \n",
" 29 | \n",
" Zürich | \n",
"
\n",
" \n",
" 673 | \n",
" spitz | \n",
"
\n",
" \n",
" 1103 | \n",
" Łuck | \n",
"
\n",
" \n",
" 766 | \n",
" Šibenik | \n",
"
\n",
" \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
1324 rows × 1 columns
\n",
"
"
],
"text/plain": [
" 0\n",
"1053 Békéscsaba \n",
"1102 Łuck\n",
"359 #\n",
"1268 A B.\n",
"1044 A.\n",
"1194 Aachen\n",
"434 Abbazia\n",
"1083 Abbazia-Lovrana\n",
"291 Absam b. Innsbruck\n",
"997 Abtenau\n",
"637 Achensee\n",
"1257 Adelsberg\n",
"314 Admont\n",
"326 Aeuckens\n",
"287 Aflenz\n",
"840 Aggsbach\n",
"824 Aigen\n",
"413 Albendorf\n",
"50 Alexandria\n",
"685 Alland II\n",
"902 Allentsteig\n",
"366 Alsfeld\n",
"646 Alt Lengbach\n",
"1151 Altaussee\n",
"820 Altenberg\n",
"417 Altenburg\n",
"861 Altenmarkt\n",
"1034 Altenmarkt a. d. Tr.\n",
"869 Altenmarkt a. d. Ysper\n",
"645 Altenmarkt a.d. Trst.\n",
"... ...\n",
"839 Wr. Neustadt\n",
"885 Wr.-Neustadt\n",
"46 Wunsiedel\n",
"384 Wurtsboro, N. Y.\n",
"1079 Wöllan\n",
"1274 Wörgl\n",
"370 Wörishofen\n",
"860 Wörschach\n",
"597 Ybbs\n",
"581 Ypres\n",
"697 Ypser\n",
"1266 Ysper\n",
"977 Zakopane\n",
"1054 Zantan\n",
"582 Zara\n",
"756 Zbiroh\n",
"880 Zell a. See\n",
"40 Zell am See\n",
"918 Zistersdorf\n",
"33 Zittau\n",
"11 Znaim\n",
"1169 Zuckmantel\n",
"496 Zurigo\n",
"1170 Zwettl\n",
"1184 Zwiesel\n",
"29 Zürich\n",
"673 spitz\n",
"1103 Łuck\n",
"766 Šibenik\n",
"0 NaN\n",
"\n",
"[1324 rows x 1 columns]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp.sort_values(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It seems like there's something weird going on with 'Békéscsaba', it doesn't sort right. What is wrong?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's extract the datum:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Békéscsaba \n",
"Name: 1053, dtype: object"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp.iloc[1053]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More specifically the column '0':"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' Békéscsaba '"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp.iloc[1053][0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Seems there's a space in front of the 'B'. That's why it sorts wrong."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 User Default",
"language": "python",
"name": "python_3_user_default"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}