{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AKON Metadata - Data Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Get a first impression of the postcard metadata*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the [Pandas Python Data Analysis Library](https://pandas.pydata.org/).\n", "\n", "For an intro to pandas feel free to take a look at this [Workshop for CBioVikings](https://github.com/dblyon/PandasIntro) by David Lyon." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`df` stands for *Data Frame*" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('akon_postcards_public_domain.csv.bz2')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## View Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rough Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How much datasets are in there?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "28882" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does a dataset look like?\n", "Show me the first one!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0akon_ididaltitudebuildingcitycolorcommentmountainother...datefeature_classfeature_codegeoname_idlatitudelongitudenamecountry_idadmin_name_1admin_code_1
00AK111_02474685NaNNaNKierlingTrue1908NaNNaN...gelaufen 1908PPPL2774449.048.3099716.27616KierlingATNaNNaN
\n", "

1 rows × 29 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 akon_id id altitude building city color comment \\\n", "0 0 AK111_024 74685 NaN NaN Kierling True 1908 \n", "\n", " mountain other ... date feature_class feature_code \\\n", "0 NaN NaN ... gelaufen 1908 P PPL \n", "\n", " geoname_id latitude longitude name country_id admin_name_1 \\\n", "0 2774449.0 48.30997 16.27616 Kierling AT NaN \n", "\n", " admin_code_1 \n", "0 NaN \n", "\n", "[1 rows x 29 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There seem to be a few columns missing from the output. Let's fix that by setting pandas output options:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_columns', 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try again:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0akon_ididaltitudebuildingcitycolorcommentmountainotherphotographerpublisherpublisher_placeregionwater_bodyyearinventory_numbersignaturerevision_datedatefeature_classfeature_codegeoname_idlatitudelongitudenamecountry_idadmin_name_1admin_code_1
00AK111_02474685NaNNaNKierlingTrue1908NaNNaNNaNNaNNaNNaNNaNNaNNaNGeogr. Topogr. Bilder-Samml. 1944, 63802014-09-05 10:13:12.536gelaufen 1908PPPL2774449.048.3099716.27616KierlingATNaNNaN
\n", "
" ], "text/plain": [ " Unnamed: 0 akon_id id altitude building city color comment \\\n", "0 0 AK111_024 74685 NaN NaN Kierling True 1908 \n", "\n", " mountain other photographer publisher publisher_place region water_body \\\n", "0 NaN NaN NaN NaN NaN NaN NaN \n", "\n", " year inventory_number signature \\\n", "0 NaN NaN Geogr. Topogr. Bilder-Samml. 1944, 6380 \n", "\n", " revision_date date feature_class feature_code \\\n", "0 2014-09-05 10:13:12.536 gelaufen 1908 P PPL \n", "\n", " geoname_id latitude longitude name country_id admin_name_1 \\\n", "0 2774449.0 48.30997 16.27616 Kierling AT NaN \n", "\n", " admin_code_1 \n", "0 NaN " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we see all columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are all the columns called again?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Unnamed: 0', 'akon_id', 'id', 'altitude', 'building', 'city', 'color',\n", " 'comment', 'mountain', 'other', 'photographer', 'publisher',\n", " 'publisher_place', 'region', 'water_body', 'year', 'inventory_number',\n", " 'signature', 'revision_date', 'date', 'feature_class', 'feature_code',\n", " 'geoname_id', 'latitude', 'longitude', 'name', 'country_id',\n", " 'admin_name_1', 'admin_code_1'],\n", " dtype='object')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Show Random Entries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show me 3 random entries:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0akon_ididaltitudebuildingcitycolorcommentmountainotherphotographerpublisherpublisher_placeregionwater_bodyyearinventory_numbersignaturerevision_datedatefeature_classfeature_codegeoname_idlatitudelongitudenamecountry_idadmin_name_1admin_code_1
458458AK114_18976614NaNNaNAiroloTrue1909 gelNaNNaNNaNNaNNaNNaNNaNNaNNaNGeogr. Topogr. Bilder-Samml. 1944, 272014-09-09 08:50:37.905gelaufen 1909PPPL2661830.046.528478.60881AiroloCHNaNNaN
2451024510AK075_46746922NaNNaNKlausenFalseNaNNaNNaNNaNNaNNaNNaNNaN1913.0NaNNaN2014-08-21 10:05:32.7791913PPPLA33178764.046.6400111.56573KlausenITSüdtirol17
1856418564AK082_54651980NaNNaNVelsenFalsev 1905NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2014-08-25 17:46:23.417vor 1905PPPL2745673.052.460004.65000VelsenNLNord-Holland07
\n", "
" ], "text/plain": [ " Unnamed: 0 akon_id id altitude building city color \\\n", "458 458 AK114_189 76614 NaN NaN Airolo True \n", "24510 24510 AK075_467 46922 NaN NaN Klausen False \n", "18564 18564 AK082_546 51980 NaN NaN Velsen False \n", "\n", " comment mountain other photographer publisher publisher_place region \\\n", "458 1909 gel NaN NaN NaN NaN NaN NaN \n", "24510 NaN NaN NaN NaN NaN NaN NaN \n", "18564 v 1905 NaN NaN NaN NaN NaN NaN \n", "\n", " water_body year inventory_number \\\n", "458 NaN NaN NaN \n", "24510 NaN 1913.0 NaN \n", "18564 NaN NaN NaN \n", "\n", " signature revision_date \\\n", "458 Geogr. Topogr. Bilder-Samml. 1944, 27 2014-09-09 08:50:37.905 \n", "24510 NaN 2014-08-21 10:05:32.779 \n", "18564 NaN 2014-08-25 17:46:23.417 \n", "\n", " date feature_class feature_code geoname_id latitude \\\n", "458 gelaufen 1909 P PPL 2661830.0 46.52847 \n", "24510 1913 P PPLA3 3178764.0 46.64001 \n", "18564 vor 1905 P PPL 2745673.0 52.46000 \n", "\n", " longitude name country_id admin_name_1 admin_code_1 \n", "458 8.60881 Airolo CH NaN NaN \n", "24510 11.56573 Klausen IT Südtirol 17 \n", "18564 4.65000 Velsen NL Nord-Holland 07 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling `sample` again yields different entries:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0akon_ididaltitudebuildingcitycolorcommentmountainotherphotographerpublisherpublisher_placeregionwater_bodyyearinventory_numbersignaturerevision_datedatefeature_classfeature_codegeoname_idlatitudelongitudenamecountry_idadmin_name_1admin_code_1
2765827658AK097_35162277NaNStation BrennerSteinach am BrennerFalse1905 gelNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2014-09-02 11:15:45.988vor 1905PPPLA32764557.047.0833311.46667Steinach am BrennerATNaNNaN
1499314993AK021_51512604NaNNaNGross-Pöchlarn, Klein-PöchlarnFalseNaNNaNNaNNaNLedermannWienNaNNaN1917.0NaNNaN2014-08-04 07:59:10.1361917PPPLA32768627.048.2000015.20000PöchlarnATNaNNaN
1593415934AK025_59615102NaNPfarrkircheMondseeFalseNaNNaNNaNNaNNaNNaNNaNNaN1914.0NaNNaN2014-08-04 07:59:10.1871914PPPLA32771277.047.8564813.34908MondseeATNaNNaN
\n", "
" ], "text/plain": [ " Unnamed: 0 akon_id id altitude building \\\n", "27658 27658 AK097_351 62277 NaN Station Brenner \n", "14993 14993 AK021_515 12604 NaN NaN \n", "15934 15934 AK025_596 15102 NaN Pfarrkirche \n", "\n", " city color comment mountain other \\\n", "27658 Steinach am Brenner False 1905 gel NaN NaN \n", "14993 Gross-Pöchlarn, Klein-Pöchlarn False NaN NaN NaN \n", "15934 Mondsee False NaN NaN NaN \n", "\n", " photographer publisher publisher_place region water_body year \\\n", "27658 NaN NaN NaN NaN NaN NaN \n", "14993 NaN Ledermann Wien NaN NaN 1917.0 \n", "15934 NaN NaN NaN NaN NaN 1914.0 \n", "\n", " inventory_number signature revision_date date \\\n", "27658 NaN NaN 2014-09-02 11:15:45.988 vor 1905 \n", "14993 NaN NaN 2014-08-04 07:59:10.136 1917 \n", "15934 NaN NaN 2014-08-04 07:59:10.187 1914 \n", "\n", " feature_class feature_code geoname_id latitude longitude \\\n", "27658 P PPLA3 2764557.0 47.08333 11.46667 \n", "14993 P PPLA3 2768627.0 48.20000 15.20000 \n", "15934 P PPLA3 2771277.0 47.85648 13.34908 \n", "\n", " name country_id admin_name_1 admin_code_1 \n", "27658 Steinach am Brenner AT NaN NaN \n", "14993 Pöchlarn AT NaN NaN \n", "15934 Mondsee AT NaN NaN " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count Things" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many entries show things in Italy?\n", "\n", "Let's use the `country_id` for this question:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df_in_italy = df[df['country_id'] == 'IT']" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2983" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_in_italy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many postcards are in color?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "df_in_color = df[df['color'] == True]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7075" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_in_color)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can I do this in one line?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7075" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df[df['color'] == True])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many different publisher places are in the data set?" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1324" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df['publisher_place'].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show me some!" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "'numpy.ndarray' object has no attribute 'sample'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'publisher_place'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m: 'numpy.ndarray' object has no attribute 'sample'" ] } ], "source": [ "df['publisher_place'].unique().sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh, that doesn't work. Let's wrap it in a pandas DataFrame, step by step:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "publisher_places = df['publisher_place'].unique()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([nan, 'Kierling', 'Kindberg', ..., 'Straßburg i./E.', 'Detmold',\n", " 'Furth i. W.'], dtype=object)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "publisher_places" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "pp = pd.DataFrame(publisher_places)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
0NaN
1Kierling
2Kindberg
3Kirchau
4Wien
5Kirchhain
6München
7Kitzbühel
8Klagenfurt
9Grein a/D.
10Bozen
11Znaim
12Graz
13Heidelberg
14Komotau
15Köln
16Bodenbach a. d. Elbe
17Meissen
18Leipzig
19Konstanz
20Korneuburg
21Brașov
22Mürzzuschlag
23Salzburg
24Frankfurt a. M.
25Arys
26Tegernsee
27Garmisch
28Partenkirchen-Garmisch
29Zürich
......
1294Neugersdorf
1295Friedau
1296Freystadt
1297Gießhübl
1298Gams
1299Schlossberg
1300Frakfurt a. Oder
1301Casale Monferrato
1302Sternberg
1303Stössing
1304Mauterndorf
1305Thörl
1306Traismauer
1307Trebnitz
1308Voitsberg
1309Daun
1310Kilchberg-Züich
1311Strassburg i. Els.
1312Mühlhausen
1313Eschwege
1314Tabarz
1315Weimar
1316Coburg
1317Friedrichsroda i. Th.
1318Leipa i. B.
1319Schumburg a. D.
1320Pisa
1321Straßburg i./E.
1322Detmold
1323Furth i. W.
\n", "

1324 rows × 1 columns

\n", "
" ], "text/plain": [ " 0\n", "0 NaN\n", "1 Kierling\n", "2 Kindberg\n", "3 Kirchau\n", "4 Wien\n", "5 Kirchhain\n", "6 München\n", "7 Kitzbühel\n", "8 Klagenfurt\n", "9 Grein a/D.\n", "10 Bozen\n", "11 Znaim\n", "12 Graz\n", "13 Heidelberg\n", "14 Komotau\n", "15 Köln\n", "16 Bodenbach a. d. Elbe\n", "17 Meissen\n", "18 Leipzig\n", "19 Konstanz\n", "20 Korneuburg\n", "21 Brașov\n", "22 Mürzzuschlag\n", "23 Salzburg\n", "24 Frankfurt a. M.\n", "25 Arys\n", "26 Tegernsee\n", "27 Garmisch\n", "28 Partenkirchen-Garmisch\n", "29 Zürich\n", "... ...\n", "1294 Neugersdorf\n", "1295 Friedau\n", "1296 Freystadt\n", "1297 Gießhübl\n", "1298 Gams\n", "1299 Schlossberg\n", "1300 Frakfurt a. Oder\n", "1301 Casale Monferrato\n", "1302 Sternberg\n", "1303 Stössing\n", "1304 Mauterndorf\n", "1305 Thörl\n", "1306 Traismauer\n", "1307 Trebnitz\n", "1308 Voitsberg\n", "1309 Daun\n", "1310 Kilchberg-Züich\n", "1311 Strassburg i. Els.\n", "1312 Mühlhausen\n", "1313 Eschwege\n", "1314 Tabarz\n", "1315 Weimar\n", "1316 Coburg\n", "1317 Friedrichsroda i. Th.\n", "1318 Leipa i. B.\n", "1319 Schumburg a. D.\n", "1320 Pisa\n", "1321 Straßburg i./E.\n", "1322 Detmold\n", "1323 Furth i. W.\n", "\n", "[1324 rows x 1 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Better. Now show me some randomly:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
830Friedberg
993Kreisbach
743Neustadt a. d. D.
70Halberstadt
344Kastelruth
\n", "
" ], "text/plain": [ " 0\n", "830 Friedberg\n", "993 Kreisbach\n", "743 Neustadt a. d. D.\n", "70 Halberstadt\n", "344 Kastelruth" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sort Things" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just sort the sample, please:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
953Arnsdorf
1293Bad Neuhaus
309Bad Pyrmont
545Bellegarde
1300Frakfurt a. Oder
301Ischl
3Kirchau
984Mönichkirchen
744Ramsau
391Roustchouk
\n", "
" ], "text/plain": [ " 0\n", "953 Arnsdorf\n", "1293 Bad Neuhaus\n", "309 Bad Pyrmont\n", "545 Bellegarde\n", "1300 Frakfurt a. Oder\n", "301 Ischl\n", "3 Kirchau\n", "984 Mönichkirchen\n", "744 Ramsau\n", "391 Roustchouk" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.sample(10).sort_values(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why the '0' in `sort_values(0)`? That's the name of the column to sort by." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort the whole thing:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
1053Békéscsaba
1102Łuck
359#
1268A B.
1044A.
1194Aachen
434Abbazia
1083Abbazia-Lovrana
291Absam b. Innsbruck
997Abtenau
637Achensee
1257Adelsberg
314Admont
326Aeuckens
287Aflenz
840Aggsbach
824Aigen
413Albendorf
50Alexandria
685Alland II
902Allentsteig
366Alsfeld
646Alt Lengbach
1151Altaussee
820Altenberg
417Altenburg
861Altenmarkt
1034Altenmarkt a. d. Tr.
869Altenmarkt a. d. Ysper
645Altenmarkt a.d. Trst.
......
839Wr. Neustadt
885Wr.-Neustadt
46Wunsiedel
384Wurtsboro, N. Y.
1079Wöllan
1274Wörgl
370Wörishofen
860Wörschach
597Ybbs
581Ypres
697Ypser
1266Ysper
977Zakopane
1054Zantan
582Zara
756Zbiroh
880Zell a. See
40Zell am See
918Zistersdorf
33Zittau
11Znaim
1169Zuckmantel
496Zurigo
1170Zwettl
1184Zwiesel
29Zürich
673spitz
1103Łuck
766Šibenik
0NaN
\n", "

1324 rows × 1 columns

\n", "
" ], "text/plain": [ " 0\n", "1053 Békéscsaba \n", "1102 Łuck\n", "359 #\n", "1268 A B.\n", "1044 A.\n", "1194 Aachen\n", "434 Abbazia\n", "1083 Abbazia-Lovrana\n", "291 Absam b. Innsbruck\n", "997 Abtenau\n", "637 Achensee\n", "1257 Adelsberg\n", "314 Admont\n", "326 Aeuckens\n", "287 Aflenz\n", "840 Aggsbach\n", "824 Aigen\n", "413 Albendorf\n", "50 Alexandria\n", "685 Alland II\n", "902 Allentsteig\n", "366 Alsfeld\n", "646 Alt Lengbach\n", "1151 Altaussee\n", "820 Altenberg\n", "417 Altenburg\n", "861 Altenmarkt\n", "1034 Altenmarkt a. d. Tr.\n", "869 Altenmarkt a. d. Ysper\n", "645 Altenmarkt a.d. Trst.\n", "... ...\n", "839 Wr. Neustadt\n", "885 Wr.-Neustadt\n", "46 Wunsiedel\n", "384 Wurtsboro, N. Y.\n", "1079 Wöllan\n", "1274 Wörgl\n", "370 Wörishofen\n", "860 Wörschach\n", "597 Ybbs\n", "581 Ypres\n", "697 Ypser\n", "1266 Ysper\n", "977 Zakopane\n", "1054 Zantan\n", "582 Zara\n", "756 Zbiroh\n", "880 Zell a. See\n", "40 Zell am See\n", "918 Zistersdorf\n", "33 Zittau\n", "11 Znaim\n", "1169 Zuckmantel\n", "496 Zurigo\n", "1170 Zwettl\n", "1184 Zwiesel\n", "29 Zürich\n", "673 spitz\n", "1103 Łuck\n", "766 Šibenik\n", "0 NaN\n", "\n", "[1324 rows x 1 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.sort_values(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems like there's something weird going on with 'Békéscsaba', it doesn't sort right. What is wrong?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's extract the datum:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Békéscsaba \n", "Name: 1053, dtype: object" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.iloc[1053]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More specifically the column '0':" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "' Békéscsaba '" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.iloc[1053][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Seems there's a space in front of the 'B'. That's why it sorts wrong." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 User Default", "language": "python", "name": "python_3_user_default" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }