{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AKON Metadata - Data Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Get a first impression of the postcard metadata*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the [Pandas Python Data Analysis Library](https://pandas.pydata.org/).\n", "\n", "For an intro to pandas feel free to take a look at this [Workshop for CBioVikings](https://github.com/dblyon/PandasIntro) by David Lyon." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`df` stands for *Data Frame*" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/kst/tmp/dingsdi/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3049: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.\n", " interactivity=interactivity, compiler=compiler, result=result)\n" ] } ], "source": [ "df = pd.read_csv('https://labs.onb.ac.at/gitlab/labs-team/raw-metadata/raw/master/akon_postcards_public_domain.csv.bz2', compression='bz2')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## View Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rough Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How much datasets are in there?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "34846" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does a dataset look like?\n", "Show me the first one!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0akon_ididaltitudebuildingcitycolorcommentmountainother...geoname_idlatitudelongitudenamecountry_idadmin_name_1admin_code_1geodownload_linkdownload_link_256x256
00AK111_02174682NaNNaNKiel, BlücherplatzFalse1921 gelNaNNaN...2891122.054.3213310.13489KielDENaNNaN54.32133, 10.13489https://iiif.onb.ac.at/images/AKON/AK111_021/0...https://iiif.onb.ac.at/images/AKON/AK111_021/0...
\n", "

1 rows × 32 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 akon_id id altitude building city color \\\n", "0 0 AK111_021 74682 NaN NaN Kiel, Blücherplatz False \n", "\n", " comment mountain other ... geoname_id latitude longitude name \\\n", "0 1921 gel NaN NaN ... 2891122.0 54.32133 10.13489 Kiel \n", "\n", " country_id admin_name_1 admin_code_1 geo \\\n", "0 DE NaN NaN 54.32133, 10.13489 \n", "\n", " download_link \\\n", "0 https://iiif.onb.ac.at/images/AKON/AK111_021/0... \n", "\n", " download_link_256x256 \n", "0 https://iiif.onb.ac.at/images/AKON/AK111_021/0... \n", "\n", "[1 rows x 32 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There seem to be a few columns missing from the output. Let's fix that by setting pandas output options:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_columns', 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try again:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0akon_ididaltitudebuildingcitycolorcommentmountainotherphotographerpublisherpublisher_placeregionwater_bodyyearinventory_numbersignaturerevision_datedatefeature_classfeature_codegeoname_idlatitudelongitudenamecountry_idadmin_name_1admin_code_1geodownload_linkdownload_link_256x256
00AK111_02174682NaNNaNKiel, BlücherplatzFalse1921 gelNaNNaNNaNNaNNaNNaNNaNNaNNaNGeogr. Topogr. Bilder-Samml. 1943, 77352014-09-05 10:13:06.342gelaufen 1921PPPLA2891122.054.3213310.13489KielDENaNNaN54.32133, 10.13489https://iiif.onb.ac.at/images/AKON/AK111_021/0...https://iiif.onb.ac.at/images/AKON/AK111_021/0...
\n", "
" ], "text/plain": [ " Unnamed: 0 akon_id id altitude building city color \\\n", "0 0 AK111_021 74682 NaN NaN Kiel, Blücherplatz False \n", "\n", " comment mountain other photographer publisher publisher_place region \\\n", "0 1921 gel NaN NaN NaN NaN NaN NaN \n", "\n", " water_body year inventory_number signature \\\n", "0 NaN NaN NaN Geogr. Topogr. Bilder-Samml. 1943, 7735 \n", "\n", " revision_date date feature_class feature_code \\\n", "0 2014-09-05 10:13:06.342 gelaufen 1921 P PPLA \n", "\n", " geoname_id latitude longitude name country_id admin_name_1 admin_code_1 \\\n", "0 2891122.0 54.32133 10.13489 Kiel DE NaN NaN \n", "\n", " geo download_link \\\n", "0 54.32133, 10.13489 https://iiif.onb.ac.at/images/AKON/AK111_021/0... \n", "\n", " download_link_256x256 \n", "0 https://iiif.onb.ac.at/images/AKON/AK111_021/0... " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we see all columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are all the columns called again?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Unnamed: 0', 'akon_id', 'id', 'altitude', 'building', 'city', 'color',\n", " 'comment', 'mountain', 'other', 'photographer', 'publisher',\n", " 'publisher_place', 'region', 'water_body', 'year', 'inventory_number',\n", " 'signature', 'revision_date', 'date', 'feature_class', 'feature_code',\n", " 'geoname_id', 'latitude', 'longitude', 'name', 'country_id',\n", " 'admin_name_1', 'admin_code_1', 'geo', 'download_link',\n", " 'download_link_256x256'],\n", " dtype='object')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Show Random Entries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show me 3 random entries:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0akon_ididaltitudebuildingcitycolorcommentmountainotherphotographerpublisherpublisher_placeregionwater_bodyyearinventory_numbersignaturerevision_datedatefeature_classfeature_codegeoname_idlatitudelongitudenamecountry_idadmin_name_1admin_code_1geodownload_linkdownload_link_256x256
2890828908AK066_08640120NaNNaNInnsbruck, Maria TheresienstrasseFalse1907 gelNaNNaNNaNGratlInnsbruckNaNNaNNaNNaNNaN2014-08-04 07:59:10.424vor 1907PPPLA2775220.047.2626611.39454InnsbruckATNaNNaN47.26266, 11.39454https://iiif.onb.ac.at/images/AKON/AK066_086/0...https://iiif.onb.ac.at/images/AKON/AK066_086/0...
2131721317AK034_38620303251.0NaNGars-Thunau am KampTrueNaNNaNNaNNaNLedermannWienNaNNaN1909.0NaNNaN2014-08-04 07:59:10.2721909PPPL2763660.048.5833315.65000Thunau am KampATNaNNaN48.58333, 15.65https://iiif.onb.ac.at/images/AKON/AK034_386/3...https://iiif.onb.ac.at/images/AKON/AK034_386/3...
2320123201AK041_57224699251.0NaNGars-Thunau am KampFalseNaNNaNNaNNaNLedermannWienNaNNaN1908.0NaNNaN2014-08-04 07:59:10.3281908PPPL2763660.048.5833315.65000Thunau am KampATNaNNaN48.58333, 15.65https://iiif.onb.ac.at/images/AKON/AK041_572/5...https://iiif.onb.ac.at/images/AKON/AK041_572/5...
\n", "
" ], "text/plain": [ " Unnamed: 0 akon_id id altitude building \\\n", "28908 28908 AK066_086 40120 NaN NaN \n", "21317 21317 AK034_386 20303 251.0 NaN \n", "23201 23201 AK041_572 24699 251.0 NaN \n", "\n", " city color comment mountain other \\\n", "28908 Innsbruck, Maria Theresienstrasse False 1907 gel NaN NaN \n", "21317 Gars-Thunau am Kamp True NaN NaN NaN \n", "23201 Gars-Thunau am Kamp False NaN NaN NaN \n", "\n", " photographer publisher publisher_place region water_body year \\\n", "28908 NaN Gratl Innsbruck NaN NaN NaN \n", "21317 NaN Ledermann Wien NaN NaN 1909.0 \n", "23201 NaN Ledermann Wien NaN NaN 1908.0 \n", "\n", " inventory_number signature revision_date date \\\n", "28908 NaN NaN 2014-08-04 07:59:10.424 vor 1907 \n", "21317 NaN NaN 2014-08-04 07:59:10.272 1909 \n", "23201 NaN NaN 2014-08-04 07:59:10.328 1908 \n", "\n", " feature_class feature_code geoname_id latitude longitude \\\n", "28908 P PPLA 2775220.0 47.26266 11.39454 \n", "21317 P PPL 2763660.0 48.58333 15.65000 \n", "23201 P PPL 2763660.0 48.58333 15.65000 \n", "\n", " name country_id admin_name_1 admin_code_1 \\\n", "28908 Innsbruck AT NaN NaN \n", "21317 Thunau am Kamp AT NaN NaN \n", "23201 Thunau am Kamp AT NaN NaN \n", "\n", " geo download_link \\\n", "28908 47.26266, 11.39454 https://iiif.onb.ac.at/images/AKON/AK066_086/0... \n", "21317 48.58333, 15.65 https://iiif.onb.ac.at/images/AKON/AK034_386/3... \n", "23201 48.58333, 15.65 https://iiif.onb.ac.at/images/AKON/AK041_572/5... \n", "\n", " download_link_256x256 \n", "28908 https://iiif.onb.ac.at/images/AKON/AK066_086/0... \n", "21317 https://iiif.onb.ac.at/images/AKON/AK034_386/3... \n", "23201 https://iiif.onb.ac.at/images/AKON/AK041_572/5... " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling `sample` again yields different entries:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0akon_ididaltitudebuildingcitycolorcommentmountainotherphotographerpublisherpublisher_placeregionwater_bodyyearinventory_numbersignaturerevision_datedatefeature_classfeature_codegeoname_idlatitudelongitudenamecountry_idadmin_name_1admin_code_1geodownload_linkdownload_link_256x256
1881018810AK025_11114618NaNNaNBruck an der MurTrueNaNMugelNaNNaNLedermannWienNaNNaN1916.0NaNNaN2014-10-15 12:03:01.0281916PPPLA32781371.047.4166715.28333Bruck an der MurATNaNNaN47.41667, 15.28333https://iiif.onb.ac.at/images/AKON/AK025_111/1...https://iiif.onb.ac.at/images/AKON/AK025_111/1...
2814628146AK061_16536541NaNNaNOrosházaTrueNaNNaNNaNNaNVágnerOrosházaNaNNaN1917.0NaNKartensammlung 79/66 G2015-08-25 15:28:56.5471917PPPL716736.046.5666720.66667OroshazaHUBekes County0346.56667, 20.66667https://iiif.onb.ac.at/images/AKON/AK061_165/1...https://iiif.onb.ac.at/images/AKON/AK061_165/1...
83358335AK088_56356103NaNNaNBad ReichenhallFalse1907 gelNaNNaNNaNNaNNaNNaNNaNNaNNaNGeogrphisch-topographische Bildersammlung 1076/432014-08-28 16:20:02.029vor 1907PPPLA32953371.047.7294712.87819Bad ReichenhallDENaNNaN47.72947, 12.87819https://iiif.onb.ac.at/images/AKON/AK088_563/5...https://iiif.onb.ac.at/images/AKON/AK088_563/5...
\n", "
" ], "text/plain": [ " Unnamed: 0 akon_id id altitude building city \\\n", "18810 18810 AK025_111 14618 NaN NaN Bruck an der Mur \n", "28146 28146 AK061_165 36541 NaN NaN Orosháza \n", "8335 8335 AK088_563 56103 NaN NaN Bad Reichenhall \n", "\n", " color comment mountain other photographer publisher publisher_place \\\n", "18810 True NaN Mugel NaN NaN Ledermann Wien \n", "28146 True NaN NaN NaN NaN Vágner Orosháza \n", "8335 False 1907 gel NaN NaN NaN NaN NaN \n", "\n", " region water_body year inventory_number \\\n", "18810 NaN NaN 1916.0 NaN \n", "28146 NaN NaN 1917.0 NaN \n", "8335 NaN NaN NaN NaN \n", "\n", " signature \\\n", "18810 NaN \n", "28146 Kartensammlung 79/66 G \n", "8335 Geogrphisch-topographische Bildersammlung 1076/43 \n", "\n", " revision_date date feature_class feature_code \\\n", "18810 2014-10-15 12:03:01.028 1916 P PPLA3 \n", "28146 2015-08-25 15:28:56.547 1917 P PPL \n", "8335 2014-08-28 16:20:02.029 vor 1907 P PPLA3 \n", "\n", " geoname_id latitude longitude name country_id \\\n", "18810 2781371.0 47.41667 15.28333 Bruck an der Mur AT \n", "28146 716736.0 46.56667 20.66667 Oroshaza HU \n", "8335 2953371.0 47.72947 12.87819 Bad Reichenhall DE \n", "\n", " admin_name_1 admin_code_1 geo \\\n", "18810 NaN NaN 47.41667, 15.28333 \n", "28146 Bekes County 03 46.56667, 20.66667 \n", "8335 NaN NaN 47.72947, 12.87819 \n", "\n", " download_link \\\n", "18810 https://iiif.onb.ac.at/images/AKON/AK025_111/1... \n", "28146 https://iiif.onb.ac.at/images/AKON/AK061_165/1... \n", "8335 https://iiif.onb.ac.at/images/AKON/AK088_563/5... \n", "\n", " download_link_256x256 \n", "18810 https://iiif.onb.ac.at/images/AKON/AK025_111/1... \n", "28146 https://iiif.onb.ac.at/images/AKON/AK061_165/1... \n", "8335 https://iiif.onb.ac.at/images/AKON/AK088_563/5... " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count Things" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many entries show things in Italy?\n", "\n", "Let's use the `country_id` for this question:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df_in_italy = df[df['country_id'] == 'IT']" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3221" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_in_italy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many postcards are in color?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "df_in_color = df[df['color'] == True]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7667" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_in_color)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can I do this in one line?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7667" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df[df['color'] == True])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many different publisher places are in the data set?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1545" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df['publisher_place'].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show me some!" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "'numpy.ndarray' object has no attribute 'sample'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'publisher_place'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m: 'numpy.ndarray' object has no attribute 'sample'" ] } ], "source": [ "df['publisher_place'].unique().sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh, that doesn't work. Let's wrap it in a pandas DataFrame, step by step:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "publisher_places = df['publisher_place'].unique()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([nan, 'Wien', 'Kierling', ..., 'Königstein i. T.', 'Detmold',\n", " 'Furth i. W.'], dtype=object)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "publisher_places" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "pp = pd.DataFrame(publisher_places)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
0NaN
1Wien
2Kierling
3Kindberg
4Kirchau
5Kirchhain
6München
7Kitzbühel
8Innsbruck
9Klagenfurt
10Grein a/D.
11Bozen
12Znaim
13Graz
14Heidelberg
15Komotau
16Gr. Siegharts
17Köln
18Bodenbach a. d. Elbe
19Meissen
20Leipzig
21Konstanz
22Korneuburg
23Brașov
24Mürzzuschlag
25Salzburg
26Frankfurt a. M.
27Arosa
28Kilchberg
29Arys
......
1515Friedau
1516Wildalpe
1517Gießhübl
1518Schlossberg
1519Frakfurt a. Oder
1520Casale Monferrato
1521gr
1522Steinhaus a. Semmering
1523Sternberg
1524Stronsdorf
1525Thörl
1526Coburg
1527Traismauer
1528Trebnitz
1529Unterlamm
1530Daun
1531Kilchberg-Züich
1532Mühlhausen
1533Eschwege
1534Tabarz
1535Suhl
1536Weimar
1537Friedrichsroda i. Th.
1538Leipa i. B.
1539Schumburg a. D.
1540Pisa
1541Straßburg i./E.
1542Königstein i. T.
1543Detmold
1544Furth i. W.
\n", "

1545 rows × 1 columns

\n", "
" ], "text/plain": [ " 0\n", "0 NaN\n", "1 Wien\n", "2 Kierling\n", "3 Kindberg\n", "4 Kirchau\n", "5 Kirchhain\n", "6 München\n", "7 Kitzbühel\n", "8 Innsbruck\n", "9 Klagenfurt\n", "10 Grein a/D.\n", "11 Bozen\n", "12 Znaim\n", "13 Graz\n", "14 Heidelberg\n", "15 Komotau\n", "16 Gr. Siegharts\n", "17 Köln\n", "18 Bodenbach a. d. Elbe\n", "19 Meissen\n", "20 Leipzig\n", "21 Konstanz\n", "22 Korneuburg\n", "23 Brașov\n", "24 Mürzzuschlag\n", "25 Salzburg\n", "26 Frankfurt a. M.\n", "27 Arosa\n", "28 Kilchberg\n", "29 Arys\n", "... ...\n", "1515 Friedau\n", "1516 Wildalpe\n", "1517 Gießhübl\n", "1518 Schlossberg\n", "1519 Frakfurt a. Oder\n", "1520 Casale Monferrato\n", "1521 gr\n", "1522 Steinhaus a. Semmering\n", "1523 Sternberg\n", "1524 Stronsdorf\n", "1525 Thörl\n", "1526 Coburg\n", "1527 Traismauer\n", "1528 Trebnitz\n", "1529 Unterlamm\n", "1530 Daun\n", "1531 Kilchberg-Züich\n", "1532 Mühlhausen\n", "1533 Eschwege\n", "1534 Tabarz\n", "1535 Suhl\n", "1536 Weimar\n", "1537 Friedrichsroda i. Th.\n", "1538 Leipa i. B.\n", "1539 Schumburg a. D.\n", "1540 Pisa\n", "1541 Straßburg i./E.\n", "1542 Königstein i. T.\n", "1543 Detmold\n", "1544 Furth i. W.\n", "\n", "[1545 rows x 1 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Better. Now show me some randomly:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
1007Wörschach
1494Raibl
339Imst
879Zbiroh
457Bad Sachsa
\n", "
" ], "text/plain": [ " 0\n", "1007 Wörschach\n", "1494 Raibl\n", "339 Imst\n", "879 Zbiroh\n", "457 Bad Sachsa" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sort Things" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just sort the sample, please:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
599Aue
938Chocěn
314Ernstbrunn
739Hall Tirol
788Hardegg
3Kindberg
19Meissen
725Neuchatel
1211Sommerein
1302Vorkloster bei Bregenz
\n", "
" ], "text/plain": [ " 0\n", "599 Aue\n", "938 Chocěn\n", "314 Ernstbrunn\n", "739 Hall Tirol\n", "788 Hardegg\n", "3 Kindberg\n", "19 Meissen\n", "725 Neuchatel\n", "1211 Sommerein\n", "1302 Vorkloster bei Bregenz" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.sample(10).sort_values(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why the '0' in `sort_values(0)`? That's the name of the column to sort by." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort the whole thing:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
1248Békéscsaba
1303Łuck
389#
1489A B.
1239A.
1397Aachen
487Abbazia
1280Abbazia-Lovrana
313Absam b. Innsbruck
1181Abtenau
722Achensee
1479Adelsberg
340Admont
354Aeuckens
308Aflenz
934Afritz
982Aggsbach
819Aigen
462Albendorf
57Alexandria
1111Alland
786Alland II
947Allensteig
933Allentsteig
765Allerheiligen
396Alsfeld
730Alt Aussee
736Alt Lengbach
1348Altaussee
955Altenberg
......
1177Würflach
829Würnitz
650Würzburg
674Ybbs
651Ypres
802Ypser
1487Ysper
1144Zakopane
1249Zantan
652Zara
879Zbiroh
1028Zell a. See
45Zell am See
1074Zistersdorf
37Zittau
1368Zlabing
12Znaim
1369Zuckmantel
560Zurigo
654Zurzach
431Zwettl
401Zwiesel
33Zürich
1521gr
769spitz
861w
417Č. Krumlov
1304Łuck
893Šibenik
0NaN
\n", "

1545 rows × 1 columns

\n", "
" ], "text/plain": [ " 0\n", "1248 Békéscsaba \n", "1303 Łuck\n", "389 #\n", "1489 A B.\n", "1239 A.\n", "1397 Aachen\n", "487 Abbazia\n", "1280 Abbazia-Lovrana\n", "313 Absam b. Innsbruck\n", "1181 Abtenau\n", "722 Achensee\n", "1479 Adelsberg\n", "340 Admont\n", "354 Aeuckens\n", "308 Aflenz\n", "934 Afritz\n", "982 Aggsbach\n", "819 Aigen\n", "462 Albendorf\n", "57 Alexandria\n", "1111 Alland\n", "786 Alland II\n", "947 Allensteig\n", "933 Allentsteig\n", "765 Allerheiligen\n", "396 Alsfeld\n", "730 Alt Aussee\n", "736 Alt Lengbach\n", "1348 Altaussee\n", "955 Altenberg\n", "... ...\n", "1177 Würflach\n", "829 Würnitz\n", "650 Würzburg\n", "674 Ybbs\n", "651 Ypres\n", "802 Ypser\n", "1487 Ysper\n", "1144 Zakopane\n", "1249 Zantan\n", "652 Zara\n", "879 Zbiroh\n", "1028 Zell a. See\n", "45 Zell am See\n", "1074 Zistersdorf\n", "37 Zittau\n", "1368 Zlabing\n", "12 Znaim\n", "1369 Zuckmantel\n", "560 Zurigo\n", "654 Zurzach\n", "431 Zwettl\n", "401 Zwiesel\n", "33 Zürich\n", "1521 gr\n", "769 spitz\n", "861 w\n", "417 Č. Krumlov\n", "1304 Łuck\n", "893 Šibenik\n", "0 NaN\n", "\n", "[1545 rows x 1 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.sort_values(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems like there's something weird going on with 'Békéscsaba', it doesn't sort right. What is wrong?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's extract the datum:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Békéscsaba \n", "Name: 1248, dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.iloc[1248]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More specifically the column '0':" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "' Békéscsaba '" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pp.iloc[1248][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Seems there's a space in front of the 'B'. That's why it sorts wrong." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }