Skip to content
AKON Data Overview.ipynb 34.1 KiB
Newer Older
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# AKON Metadata - Data Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Get a first impression of the postcard metadata*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the [Pandas Python Data Analysis Library](https://pandas.pydata.org/).\n",
    "\n",
    "For an intro to pandas feel free to take a look at this [Workshop for CBioVikings](https://github.com/dblyon/PandasIntro) by David Lyon."
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`df` stands for *Data Frame*"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 4,
   "metadata": {},
csteindl's avatar
csteindl committed
   "outputs": [],
    "df = pd.read_csv('https://labs.onb.ac.at/gitlab/labs-team/raw-metadata/raw/master/akon_postcards_public_domain.csv.bz2', compression='bz2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## View Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rough Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How much datasets are in there?"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "34846"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 5
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What does a dataset look like?\n",
    "Show me the first one!"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "   Unnamed: 0    akon_id     id  altitude building                city  color  \\\n0           0  AK111_021  74682       NaN      NaN  Kiel, Blücherplatz  False   \n\n    comment mountain other  ... geoname_id  latitude longitude  name  \\\n0  1921 gel      NaN   NaN  ...  2891122.0  54.32133  10.13489  Kiel   \n\n  country_id  admin_name_1 admin_code_1                 geo  \\\n0         DE           NaN          NaN  54.32133, 10.13489   \n\n                                       download_link  \\\n0  https://iiif.onb.ac.at/images/AKON/AK111_021/0...   \n\n                               download_link_256x256  \n0  https://iiif.onb.ac.at/images/AKON/AK111_021/0...  \n\n[1 rows x 32 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Unnamed: 0</th>\n      <th>akon_id</th>\n      <th>id</th>\n      <th>altitude</th>\n      <th>building</th>\n      <th>city</th>\n      <th>color</th>\n      <th>comment</th>\n      <th>mountain</th>\n      <th>other</th>\n      <th>...</th>\n      <th>geoname_id</th>\n      <th>latitude</th>\n      <th>longitude</th>\n      <th>name</th>\n      <th>country_id</th>\n      <th>admin_name_1</th>\n      <th>admin_code_1</th>\n      <th>geo</th>\n      <th>download_link</th>\n      <th>download_link_256x256</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>AK111_021</td>\n      <td>74682</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Kiel, Blücherplatz</td>\n      <td>False</td>\n      <td>1921 gel</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>...</td>\n      <td>2891122.0</td>\n      <td>54.32133</td>\n      <td>10.13489</td>\n      <td>Kiel</td>\n      <td>DE</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>54.32133, 10.13489</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK111_021/0...</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK111_021/0...</td>\n    </tr>\n  </tbody>\n</table>\n<p>1 rows × 32 columns</p>\n</div>"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 6
    }
   ],
   "source": [
    "df.head(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There seem to be a few columns missing from the output. Let's fix that by setting pandas output options:"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('display.max_columns', 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try again:"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "   Unnamed: 0    akon_id     id  altitude building                city  color  \\\n0           0  AK111_021  74682       NaN      NaN  Kiel, Blücherplatz  False   \n\n    comment mountain other photographer publisher publisher_place region  \\\n0  1921 gel      NaN   NaN          NaN       NaN             NaN    NaN   \n\n  water_body  year inventory_number                                signature  \\\n0        NaN   NaN              NaN  Geogr. Topogr. Bilder-Samml. 1943, 7735   \n\n             revision_date           date feature_class feature_code  \\\n0  2014-09-05 10:13:06.342  gelaufen 1921             P         PPLA   \n\n   geoname_id  latitude  longitude  name country_id admin_name_1 admin_code_1  \\\n0   2891122.0  54.32133   10.13489  Kiel         DE          NaN          NaN   \n\n                  geo                                      download_link  \\\n0  54.32133, 10.13489  https://iiif.onb.ac.at/images/AKON/AK111_021/0...   \n\n                               download_link_256x256  \n0  https://iiif.onb.ac.at/images/AKON/AK111_021/0...  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Unnamed: 0</th>\n      <th>akon_id</th>\n      <th>id</th>\n      <th>altitude</th>\n      <th>building</th>\n      <th>city</th>\n      <th>color</th>\n      <th>comment</th>\n      <th>mountain</th>\n      <th>other</th>\n      <th>photographer</th>\n      <th>publisher</th>\n      <th>publisher_place</th>\n      <th>region</th>\n      <th>water_body</th>\n      <th>year</th>\n      <th>inventory_number</th>\n      <th>signature</th>\n      <th>revision_date</th>\n      <th>date</th>\n      <th>feature_class</th>\n      <th>feature_code</th>\n      <th>geoname_id</th>\n      <th>latitude</th>\n      <th>longitude</th>\n      <th>name</th>\n      <th>country_id</th>\n      <th>admin_name_1</th>\n      <th>admin_code_1</th>\n      <th>geo</th>\n      <th>download_link</th>\n      <th>download_link_256x256</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>AK111_021</td>\n      <td>74682</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Kiel, Blücherplatz</td>\n      <td>False</td>\n      <td>1921 gel</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Geogr. Topogr. Bilder-Samml. 1943, 7735</td>\n      <td>2014-09-05 10:13:06.342</td>\n      <td>gelaufen 1921</td>\n      <td>P</td>\n      <td>PPLA</td>\n      <td>2891122.0</td>\n      <td>54.32133</td>\n      <td>10.13489</td>\n      <td>Kiel</td>\n      <td>DE</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>54.32133, 10.13489</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK111_021/0...</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK111_021/0...</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 8
    }
   ],
   "source": [
    "df.head(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we see all columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are all the columns called again?"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "Index(['Unnamed: 0', 'akon_id', 'id', 'altitude', 'building', 'city', 'color',\n       'comment', 'mountain', 'other', 'photographer', 'publisher',\n       'publisher_place', 'region', 'water_body', 'year', 'inventory_number',\n       'signature', 'revision_date', 'date', 'feature_class', 'feature_code',\n       'geoname_id', 'latitude', 'longitude', 'name', 'country_id',\n       'admin_name_1', 'admin_code_1', 'geo', 'download_link',\n       'download_link_256x256'],\n      dtype='object')"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 9
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Show Random Entries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show me 3 random entries:"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "       Unnamed: 0    akon_id     id  altitude                  building  \\\n7880         7880  AK086_229  54365       NaN  Montraux-Palace, Belmont   \n16014       16014  AK015_033   8502       NaN        Schloss Waldhausen   \n26648       26648  AK053_218  31570       NaN                       NaN   \n\n                                  city  color   comment            mountain  \\\n7880                          Montreux   True  1911 gel  Alpes de la Savoie   \n16014                              NaN  False       NaN                 NaN   \n26648  Kalksburg, Breitenfurterstrasse  False       NaN                 NaN   \n\n      other photographer      publisher publisher_place region water_body  \\\n7880    NaN          NaN  Photoglob Co.          Zürich    NaN        NaN   \n16014   NaN          NaN            NaN             NaN    NaN        NaN   \n26648   NaN          NaN          Janko       Kalksburg    NaN        NaN   \n\n         year inventory_number signature            revision_date  \\\n7880      NaN              NaN       NaN  2014-08-27 15:44:51.079   \n16014  1910.0              NaN       NaN  2014-08-04 07:59:10.026   \n26648  1918.0              NaN       NaN  2014-08-04 07:59:10.386   \n\n                date feature_class feature_code  geoname_id  latitude  \\\n7880   gelaufen 1911             P          PPL   2659601.0  46.43301   \n16014           1910             P          PPL   2762012.0  48.27377   \n26648           1918             A         ADM4   2774904.0  48.13754   \n\n       longitude                      name country_id admin_name_1  \\\n7880     6.91143                  Montreux         CH        Waadt   \n16014   14.94750  Waldhausen im Strudengau         AT          NaN   \n26648   16.24599                 Kalksburg         AT          NaN   \n\n      admin_code_1                 geo  \\\n7880            VD   46.43301, 6.91143   \n16014          NaN   48.27377, 14.9475   \n26648          NaN  48.13754, 16.24599   \n\n                                           download_link  \\\n7880   https://iiif.onb.ac.at/images/AKON/AK086_229/2...   \n16014  https://iiif.onb.ac.at/images/AKON/AK015_033/0...   \n26648  https://iiif.onb.ac.at/images/AKON/AK053_218/2...   \n\n                                   download_link_256x256  \n7880   https://iiif.onb.ac.at/images/AKON/AK086_229/2...  \n16014  https://iiif.onb.ac.at/images/AKON/AK015_033/0...  \n26648  https://iiif.onb.ac.at/images/AKON/AK053_218/2...  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Unnamed: 0</th>\n      <th>akon_id</th>\n      <th>id</th>\n      <th>altitude</th>\n      <th>building</th>\n      <th>city</th>\n      <th>color</th>\n      <th>comment</th>\n      <th>mountain</th>\n      <th>other</th>\n      <th>photographer</th>\n      <th>publisher</th>\n      <th>publisher_place</th>\n      <th>region</th>\n      <th>water_body</th>\n      <th>year</th>\n      <th>inventory_number</th>\n      <th>signature</th>\n      <th>revision_date</th>\n      <th>date</th>\n      <th>feature_class</th>\n      <th>feature_code</th>\n      <th>geoname_id</th>\n      <th>latitude</th>\n      <th>longitude</th>\n      <th>name</th>\n      <th>country_id</th>\n      <th>admin_name_1</th>\n      <th>admin_code_1</th>\n      <th>geo</th>\n      <th>download_link</th>\n      <th>download_link_256x256</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>7880</th>\n      <td>7880</td>\n      <td>AK086_229</td>\n      <td>54365</td>\n      <td>NaN</td>\n      <td>Montraux-Palace, Belmont</td>\n      <td>Montreux</td>\n      <td>True</td>\n      <td>1911 gel</td>\n      <td>Alpes de la Savoie</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Photoglob Co.</td>\n      <td>Zürich</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>2014-08-27 15:44:51.079</td>\n      <td>gelaufen 1911</td>\n      <td>P</td>\n      <td>PPL</td>\n      <td>2659601.0</td>\n      <td>46.43301</td>\n      <td>6.91143</td>\n      <td>Montreux</td>\n      <td>CH</td>\n      <td>Waadt</td>\n      <td>VD</td>\n      <td>46.43301, 6.91143</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK086_229/2...</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK086_229/2...</td>\n    </tr>\n    <tr>\n      <th>16014</th>\n      <td>16014</td>\n      <td>AK015_033</td>\n      <td>8502</td>\n      <td>NaN</td>\n      <td>Schloss Waldhausen</td>\n      <td>NaN</td>\n      <td>False</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>1910.0</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>2014-08-04 07:59:10.026</td>\n      <td>1910</td>\n      <td>P</td>\n      <td>PPL</td>\n      <td>2762012.0</td>\n      <td>48.27377</td>\n      <td>14.94750</td>\n      <td>Waldhausen im Strudengau</td>\n      <td>AT</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>48.27377, 14.9475</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK015_033/0...</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK015_033/0...</td>\n    </tr>\n    <tr>\n      <th>26648</th>\n      <td>26648</td>\n      <td>AK053_218</td>\n      <td>31570</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Kalksburg, Breitenfurterstrasse</td>\n      <td>False</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Janko</td>\n      <td>Kalksburg</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>1918.0</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>2014-08-04 07:59:10.386</td>\n      <td>1918</td>\n      <td>A</td>\n      <td>ADM4</td>\n      <td>2774904.0</td>\n      <td>48.13754</td>\n      <td>16.24599</td>\n      <td>Kalksburg</td>\n      <td>AT</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>48.13754, 16.24599</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK053_218/2...</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK053_218/2...</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 10
    }
   ],
   "source": [
    "df.sample(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calling `sample` again yields different entries:"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "       Unnamed: 0    akon_id     id  altitude    building         city  color  \\\n32175       32175  AK068_223  41634       NaN         NaN        Baden   True   \n17757       17757  AK021_519  12608       NaN         NaN  Zell am See  False   \n2833         2833  AK121_438  81017       NaN  Festenburg   Festenburg  False   \n\n      comment mountain other photographer    publisher publisher_place region  \\\n32175     NaN      NaN   NaN          NaN        Bauer            Wien    NaN   \n17757  v 1907      NaN   NaN          NaN    Ledermann            Wien    NaN   \n2833      NaN      NaN   NaN          NaN  Pelnitschar          Aspang    NaN   \n\n       water_body    year inventory_number  \\\n32175         NaN  1913.0              NaN   \n17757  Zeller See     NaN              NaN   \n2833          NaN  1920.0              NaN   \n\n                                      signature            revision_date  \\\n32175                     Vues-Sammlung I. 7425  2014-08-13 14:19:10.145   \n17757                                       NaN  2014-08-04 07:59:10.136   \n2833   Nationalbibliothek Karten Abteilung 3062  2014-09-12 08:38:22.055   \n\n           date feature_class feature_code  geoname_id  latitude  longitude  \\\n32175      1913             P        PPLA3   2782067.0  48.00543   16.23264   \n17757  vor 1907             P        PPLA3   2760634.0  47.32556   12.79444   \n2833       1920             S         CSTL   2779616.0  47.45000   15.91667   \n\n                 name country_id admin_name_1 admin_code_1  \\\n32175  Baden bei Wien         AT          NaN          NaN   \n17757     Zell am See         AT          NaN          NaN   \n2833       Festenburg         AT          NaN          NaN   \n\n                      geo                                      download_link  \\\n32175  48.00543, 16.23264  https://iiif.onb.ac.at/images/AKON/AK068_223/2...   \n17757  47.32556, 12.79444  https://iiif.onb.ac.at/images/AKON/AK021_519/5...   \n2833      47.45, 15.91667  https://iiif.onb.ac.at/images/AKON/AK121_438/4...   \n\n                                   download_link_256x256  \n32175  https://iiif.onb.ac.at/images/AKON/AK068_223/2...  \n17757  https://iiif.onb.ac.at/images/AKON/AK021_519/5...  \n2833   https://iiif.onb.ac.at/images/AKON/AK121_438/4...  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Unnamed: 0</th>\n      <th>akon_id</th>\n      <th>id</th>\n      <th>altitude</th>\n      <th>building</th>\n      <th>city</th>\n      <th>color</th>\n      <th>comment</th>\n      <th>mountain</th>\n      <th>other</th>\n      <th>photographer</th>\n      <th>publisher</th>\n      <th>publisher_place</th>\n      <th>region</th>\n      <th>water_body</th>\n      <th>year</th>\n      <th>inventory_number</th>\n      <th>signature</th>\n      <th>revision_date</th>\n      <th>date</th>\n      <th>feature_class</th>\n      <th>feature_code</th>\n      <th>geoname_id</th>\n      <th>latitude</th>\n      <th>longitude</th>\n      <th>name</th>\n      <th>country_id</th>\n      <th>admin_name_1</th>\n      <th>admin_code_1</th>\n      <th>geo</th>\n      <th>download_link</th>\n      <th>download_link_256x256</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>32175</th>\n      <td>32175</td>\n      <td>AK068_223</td>\n      <td>41634</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Baden</td>\n      <td>True</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Bauer</td>\n      <td>Wien</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>1913.0</td>\n      <td>NaN</td>\n      <td>Vues-Sammlung I. 7425</td>\n      <td>2014-08-13 14:19:10.145</td>\n      <td>1913</td>\n      <td>P</td>\n      <td>PPLA3</td>\n      <td>2782067.0</td>\n      <td>48.00543</td>\n      <td>16.23264</td>\n      <td>Baden bei Wien</td>\n      <td>AT</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>48.00543, 16.23264</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK068_223/2...</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK068_223/2...</td>\n    </tr>\n    <tr>\n      <th>17757</th>\n      <td>17757</td>\n      <td>AK021_519</td>\n      <td>12608</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Zell am See</td>\n      <td>False</td>\n      <td>v 1907</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Ledermann</td>\n      <td>Wien</td>\n      <td>NaN</td>\n      <td>Zeller See</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>2014-08-04 07:59:10.136</td>\n      <td>vor 1907</td>\n      <td>P</td>\n      <td>PPLA3</td>\n      <td>2760634.0</td>\n      <td>47.32556</td>\n      <td>12.79444</td>\n      <td>Zell am See</td>\n      <td>AT</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>47.32556, 12.79444</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK021_519/5...</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK021_519/5...</td>\n    </tr>\n    <tr>\n      <th>2833</th>\n      <td>2833</td>\n      <td>AK121_438</td>\n      <td>81017</td>\n      <td>NaN</td>\n      <td>Festenburg</td>\n      <td>Festenburg</td>\n      <td>False</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>Pelnitschar</td>\n      <td>Aspang</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>1920.0</td>\n      <td>NaN</td>\n      <td>Nationalbibliothek Karten Abteilung 3062</td>\n      <td>2014-09-12 08:38:22.055</td>\n      <td>1920</td>\n      <td>S</td>\n      <td>CSTL</td>\n      <td>2779616.0</td>\n      <td>47.45000</td>\n      <td>15.91667</td>\n      <td>Festenburg</td>\n      <td>AT</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>47.45, 15.91667</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK121_438/4...</td>\n      <td>https://iiif.onb.ac.at/images/AKON/AK121_438/4...</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 11
    }
   ],
   "source": [
    "df.sample(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count Things"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many entries show things in Italy?\n",
    "\n",
    "Let's use the `country_id` for this question:"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_in_italy = df[df['country_id'] == 'IT']"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "3221"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 13
    }
   ],
   "source": [
    "len(df_in_italy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many postcards are in color?"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_in_color = df[df['color'] == True]"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "7667"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 15
    }
   ],
   "source": [
    "len(df_in_color)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can I do this in one line?"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "7667"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 16
    }
   ],
   "source": [
    "len(df[df['color'] == True])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many different publisher places are in the data set?"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "1545"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 17
    }
   ],
   "source": [
    "len(df['publisher_place'].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
csteindl's avatar
csteindl committed
    "Show me some! Let's wrap it in a pandas DataFrame, step by step:"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "publisher_places = df['publisher_place'].unique()"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "array([nan, 'Wien', 'Kierling', ..., 'Königstein i. T.', 'Detmold',\n       'Furth i. W.'], dtype=object)"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 19
    }
   ],
   "source": [
    "publisher_places"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "pp = pd.DataFrame(publisher_places)"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "                     0\n0                  NaN\n1                 Wien\n2             Kierling\n3             Kindberg\n4              Kirchau\n...                ...\n1540              Pisa\n1541   Straßburg i./E.\n1542  Königstein i. T.\n1543           Detmold\n1544       Furth i. W.\n\n[1545 rows x 1 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Wien</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Kierling</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Kindberg</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Kirchau</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>1540</th>\n      <td>Pisa</td>\n    </tr>\n    <tr>\n      <th>1541</th>\n      <td>Straßburg i./E.</td>\n    </tr>\n    <tr>\n      <th>1542</th>\n      <td>Königstein i. T.</td>\n    </tr>\n    <tr>\n      <th>1543</th>\n      <td>Detmold</td>\n    </tr>\n    <tr>\n      <th>1544</th>\n      <td>Furth i. W.</td>\n    </tr>\n  </tbody>\n</table>\n<p>1545 rows × 1 columns</p>\n</div>"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 21
    }
   ],
   "source": [
    "pp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Better. Now show me some randomly:"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "                        0\n624        Braunau a. Inn\n1360              Kratzau\n1319          Nový Bydžov\n592                Hyères\n1060  Kapellen a. d. Mürz",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>624</th>\n      <td>Braunau a. Inn</td>\n    </tr>\n    <tr>\n      <th>1360</th>\n      <td>Kratzau</td>\n    </tr>\n    <tr>\n      <th>1319</th>\n      <td>Nový Bydžov</td>\n    </tr>\n    <tr>\n      <th>592</th>\n      <td>Hyères</td>\n    </tr>\n    <tr>\n      <th>1060</th>\n      <td>Kapellen a. d. Mürz</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 22
    }
   ],
   "source": [
    "pp.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sort Things"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just sort the sample, please:"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "                 0\n29            Arys\n576   Buenos Aires\n1372   Getzersdorf\n148         Kochel\n1119   Maria Trost\n238      Mariazell\n1096          Melk\n273         Münden\n1196         Stein\n1047         Vitis",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>29</th>\n      <td>Arys</td>\n    </tr>\n    <tr>\n      <th>576</th>\n      <td>Buenos Aires</td>\n    </tr>\n    <tr>\n      <th>1372</th>\n      <td>Getzersdorf</td>\n    </tr>\n    <tr>\n      <th>148</th>\n      <td>Kochel</td>\n    </tr>\n    <tr>\n      <th>1119</th>\n      <td>Maria Trost</td>\n    </tr>\n    <tr>\n      <th>238</th>\n      <td>Mariazell</td>\n    </tr>\n    <tr>\n      <th>1096</th>\n      <td>Melk</td>\n    </tr>\n    <tr>\n      <th>273</th>\n      <td>Münden</td>\n    </tr>\n    <tr>\n      <th>1196</th>\n      <td>Stein</td>\n    </tr>\n    <tr>\n      <th>1047</th>\n      <td>Vitis</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 23
    }
   ],
   "source": [
    "pp.sample(10).sort_values(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why the '0' in `sort_values(0)`? That's the name of the column to sort by."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sort the whole thing:"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "                 0\n1248   Békéscsaba \n1303          Łuck\n389              #\n1489          A B.\n1239            A.\n...            ...\n861              w\n417     Č. Krumlov\n1304          Łuck\n893        Šibenik\n0              NaN\n\n[1545 rows x 1 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>1248</th>\n      <td>Békéscsaba</td>\n    </tr>\n    <tr>\n      <th>1303</th>\n      <td>Łuck</td>\n    </tr>\n    <tr>\n      <th>389</th>\n      <td>#</td>\n    </tr>\n    <tr>\n      <th>1489</th>\n      <td>A B.</td>\n    </tr>\n    <tr>\n      <th>1239</th>\n      <td>A.</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>861</th>\n      <td>w</td>\n    </tr>\n    <tr>\n      <th>417</th>\n      <td>Č. Krumlov</td>\n    </tr>\n    <tr>\n      <th>1304</th>\n      <td>Łuck</td>\n    </tr>\n    <tr>\n      <th>893</th>\n      <td>Šibenik</td>\n    </tr>\n    <tr>\n      <th>0</th>\n      <td>NaN</td>\n    </tr>\n  </tbody>\n</table>\n<p>1545 rows × 1 columns</p>\n</div>"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 24
    }
   ],
   "source": [
    "pp.sort_values(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems like there's something weird going on with 'Békéscsaba', it doesn't sort right. What is wrong?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's extract the datum:"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "0     Békéscsaba \nName: 1248, dtype: object"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 25
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More specifically the column '0':"
   ]
  },
  {
   "cell_type": "code",
csteindl's avatar
csteindl committed
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
csteindl's avatar
csteindl committed
     "output_type": "execute_result",
csteindl's avatar
csteindl committed
      "text/plain": "' Békéscsaba '"
     },
     "metadata": {},
csteindl's avatar
csteindl committed
     "execution_count": 26
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Seems there's a space in front of the 'B'. That's why it sorts wrong."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
csteindl's avatar
csteindl committed
   "display_name": "Python 3.7.7 64-bit ('venv': venv)",
   "language": "python",
csteindl's avatar
csteindl committed
   "name": "python37764bitvenvvenveb3c9aa788d446a5bb7cfee674062d0a"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
csteindl's avatar
csteindl committed
   "version": "3.7.7-final"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
csteindl's avatar
csteindl committed
}