{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AKON Metadata - Data Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Get a first impression of the postcard metadata*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the [Pandas Python Data Analysis Library](https://pandas.pydata.org/).\n",
"\n",
"For an intro to pandas feel free to take a look at this [Workshop for CBioVikings](https://github.com/dblyon/PandasIntro) by David Lyon."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`df` stands for *Data Frame*"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/kst/tmp/dingsdi/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3049: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.\n",
" interactivity=interactivity, compiler=compiler, result=result)\n"
]
}
],
"source": [
"df = pd.read_csv('https://labs.onb.ac.at/gitlab/labs-team/raw-metadata/raw/master/akon_postcards_public_domain.csv.bz2', compression='bz2')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## View Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rough Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How much datasets are in there?"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"34846"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What does a dataset look like?\n",
"Show me the first one!"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" akon_id | \n",
" id | \n",
" altitude | \n",
" building | \n",
" city | \n",
" color | \n",
" comment | \n",
" mountain | \n",
" other | \n",
" ... | \n",
" geoname_id | \n",
" latitude | \n",
" longitude | \n",
" name | \n",
" country_id | \n",
" admin_name_1 | \n",
" admin_code_1 | \n",
" geo | \n",
" download_link | \n",
" download_link_256x256 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" AK111_021 | \n",
" 74682 | \n",
" NaN | \n",
" NaN | \n",
" Kiel, Blücherplatz | \n",
" False | \n",
" 1921 gel | \n",
" NaN | \n",
" NaN | \n",
" ... | \n",
" 2891122.0 | \n",
" 54.32133 | \n",
" 10.13489 | \n",
" Kiel | \n",
" DE | \n",
" NaN | \n",
" NaN | \n",
" 54.32133, 10.13489 | \n",
" https://iiif.onb.ac.at/images/AKON/AK111_021/0... | \n",
" https://iiif.onb.ac.at/images/AKON/AK111_021/0... | \n",
"
\n",
" \n",
"
\n",
"
1 rows × 32 columns
\n",
"
"
],
"text/plain": [
" Unnamed: 0 akon_id id altitude building city color \\\n",
"0 0 AK111_021 74682 NaN NaN Kiel, Blücherplatz False \n",
"\n",
" comment mountain other ... geoname_id latitude longitude name \\\n",
"0 1921 gel NaN NaN ... 2891122.0 54.32133 10.13489 Kiel \n",
"\n",
" country_id admin_name_1 admin_code_1 geo \\\n",
"0 DE NaN NaN 54.32133, 10.13489 \n",
"\n",
" download_link \\\n",
"0 https://iiif.onb.ac.at/images/AKON/AK111_021/0... \n",
"\n",
" download_link_256x256 \n",
"0 https://iiif.onb.ac.at/images/AKON/AK111_021/0... \n",
"\n",
"[1 rows x 32 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There seem to be a few columns missing from the output. Let's fix that by setting pandas output options:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_columns', 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try again:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" akon_id | \n",
" id | \n",
" altitude | \n",
" building | \n",
" city | \n",
" color | \n",
" comment | \n",
" mountain | \n",
" other | \n",
" photographer | \n",
" publisher | \n",
" publisher_place | \n",
" region | \n",
" water_body | \n",
" year | \n",
" inventory_number | \n",
" signature | \n",
" revision_date | \n",
" date | \n",
" feature_class | \n",
" feature_code | \n",
" geoname_id | \n",
" latitude | \n",
" longitude | \n",
" name | \n",
" country_id | \n",
" admin_name_1 | \n",
" admin_code_1 | \n",
" geo | \n",
" download_link | \n",
" download_link_256x256 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" AK111_021 | \n",
" 74682 | \n",
" NaN | \n",
" NaN | \n",
" Kiel, Blücherplatz | \n",
" False | \n",
" 1921 gel | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Geogr. Topogr. Bilder-Samml. 1943, 7735 | \n",
" 2014-09-05 10:13:06.342 | \n",
" gelaufen 1921 | \n",
" P | \n",
" PPLA | \n",
" 2891122.0 | \n",
" 54.32133 | \n",
" 10.13489 | \n",
" Kiel | \n",
" DE | \n",
" NaN | \n",
" NaN | \n",
" 54.32133, 10.13489 | \n",
" https://iiif.onb.ac.at/images/AKON/AK111_021/0... | \n",
" https://iiif.onb.ac.at/images/AKON/AK111_021/0... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 akon_id id altitude building city color \\\n",
"0 0 AK111_021 74682 NaN NaN Kiel, Blücherplatz False \n",
"\n",
" comment mountain other photographer publisher publisher_place region \\\n",
"0 1921 gel NaN NaN NaN NaN NaN NaN \n",
"\n",
" water_body year inventory_number signature \\\n",
"0 NaN NaN NaN Geogr. Topogr. Bilder-Samml. 1943, 7735 \n",
"\n",
" revision_date date feature_class feature_code \\\n",
"0 2014-09-05 10:13:06.342 gelaufen 1921 P PPLA \n",
"\n",
" geoname_id latitude longitude name country_id admin_name_1 admin_code_1 \\\n",
"0 2891122.0 54.32133 10.13489 Kiel DE NaN NaN \n",
"\n",
" geo download_link \\\n",
"0 54.32133, 10.13489 https://iiif.onb.ac.at/images/AKON/AK111_021/0... \n",
"\n",
" download_link_256x256 \n",
"0 https://iiif.onb.ac.at/images/AKON/AK111_021/0... "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we see all columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are all the columns called again?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Unnamed: 0', 'akon_id', 'id', 'altitude', 'building', 'city', 'color',\n",
" 'comment', 'mountain', 'other', 'photographer', 'publisher',\n",
" 'publisher_place', 'region', 'water_body', 'year', 'inventory_number',\n",
" 'signature', 'revision_date', 'date', 'feature_class', 'feature_code',\n",
" 'geoname_id', 'latitude', 'longitude', 'name', 'country_id',\n",
" 'admin_name_1', 'admin_code_1', 'geo', 'download_link',\n",
" 'download_link_256x256'],\n",
" dtype='object')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Show Random Entries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show me 3 random entries:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" akon_id | \n",
" id | \n",
" altitude | \n",
" building | \n",
" city | \n",
" color | \n",
" comment | \n",
" mountain | \n",
" other | \n",
" photographer | \n",
" publisher | \n",
" publisher_place | \n",
" region | \n",
" water_body | \n",
" year | \n",
" inventory_number | \n",
" signature | \n",
" revision_date | \n",
" date | \n",
" feature_class | \n",
" feature_code | \n",
" geoname_id | \n",
" latitude | \n",
" longitude | \n",
" name | \n",
" country_id | \n",
" admin_name_1 | \n",
" admin_code_1 | \n",
" geo | \n",
" download_link | \n",
" download_link_256x256 | \n",
"
\n",
" \n",
" \n",
" \n",
" 28908 | \n",
" 28908 | \n",
" AK066_086 | \n",
" 40120 | \n",
" NaN | \n",
" NaN | \n",
" Innsbruck, Maria Theresienstrasse | \n",
" False | \n",
" 1907 gel | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Gratl | \n",
" Innsbruck | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2014-08-04 07:59:10.424 | \n",
" vor 1907 | \n",
" P | \n",
" PPLA | \n",
" 2775220.0 | \n",
" 47.26266 | \n",
" 11.39454 | \n",
" Innsbruck | \n",
" AT | \n",
" NaN | \n",
" NaN | \n",
" 47.26266, 11.39454 | \n",
" https://iiif.onb.ac.at/images/AKON/AK066_086/0... | \n",
" https://iiif.onb.ac.at/images/AKON/AK066_086/0... | \n",
"
\n",
" \n",
" 21317 | \n",
" 21317 | \n",
" AK034_386 | \n",
" 20303 | \n",
" 251.0 | \n",
" NaN | \n",
" Gars-Thunau am Kamp | \n",
" True | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Ledermann | \n",
" Wien | \n",
" NaN | \n",
" NaN | \n",
" 1909.0 | \n",
" NaN | \n",
" NaN | \n",
" 2014-08-04 07:59:10.272 | \n",
" 1909 | \n",
" P | \n",
" PPL | \n",
" 2763660.0 | \n",
" 48.58333 | \n",
" 15.65000 | \n",
" Thunau am Kamp | \n",
" AT | \n",
" NaN | \n",
" NaN | \n",
" 48.58333, 15.65 | \n",
" https://iiif.onb.ac.at/images/AKON/AK034_386/3... | \n",
" https://iiif.onb.ac.at/images/AKON/AK034_386/3... | \n",
"
\n",
" \n",
" 23201 | \n",
" 23201 | \n",
" AK041_572 | \n",
" 24699 | \n",
" 251.0 | \n",
" NaN | \n",
" Gars-Thunau am Kamp | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Ledermann | \n",
" Wien | \n",
" NaN | \n",
" NaN | \n",
" 1908.0 | \n",
" NaN | \n",
" NaN | \n",
" 2014-08-04 07:59:10.328 | \n",
" 1908 | \n",
" P | \n",
" PPL | \n",
" 2763660.0 | \n",
" 48.58333 | \n",
" 15.65000 | \n",
" Thunau am Kamp | \n",
" AT | \n",
" NaN | \n",
" NaN | \n",
" 48.58333, 15.65 | \n",
" https://iiif.onb.ac.at/images/AKON/AK041_572/5... | \n",
" https://iiif.onb.ac.at/images/AKON/AK041_572/5... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 akon_id id altitude building \\\n",
"28908 28908 AK066_086 40120 NaN NaN \n",
"21317 21317 AK034_386 20303 251.0 NaN \n",
"23201 23201 AK041_572 24699 251.0 NaN \n",
"\n",
" city color comment mountain other \\\n",
"28908 Innsbruck, Maria Theresienstrasse False 1907 gel NaN NaN \n",
"21317 Gars-Thunau am Kamp True NaN NaN NaN \n",
"23201 Gars-Thunau am Kamp False NaN NaN NaN \n",
"\n",
" photographer publisher publisher_place region water_body year \\\n",
"28908 NaN Gratl Innsbruck NaN NaN NaN \n",
"21317 NaN Ledermann Wien NaN NaN 1909.0 \n",
"23201 NaN Ledermann Wien NaN NaN 1908.0 \n",
"\n",
" inventory_number signature revision_date date \\\n",
"28908 NaN NaN 2014-08-04 07:59:10.424 vor 1907 \n",
"21317 NaN NaN 2014-08-04 07:59:10.272 1909 \n",
"23201 NaN NaN 2014-08-04 07:59:10.328 1908 \n",
"\n",
" feature_class feature_code geoname_id latitude longitude \\\n",
"28908 P PPLA 2775220.0 47.26266 11.39454 \n",
"21317 P PPL 2763660.0 48.58333 15.65000 \n",
"23201 P PPL 2763660.0 48.58333 15.65000 \n",
"\n",
" name country_id admin_name_1 admin_code_1 \\\n",
"28908 Innsbruck AT NaN NaN \n",
"21317 Thunau am Kamp AT NaN NaN \n",
"23201 Thunau am Kamp AT NaN NaN \n",
"\n",
" geo download_link \\\n",
"28908 47.26266, 11.39454 https://iiif.onb.ac.at/images/AKON/AK066_086/0... \n",
"21317 48.58333, 15.65 https://iiif.onb.ac.at/images/AKON/AK034_386/3... \n",
"23201 48.58333, 15.65 https://iiif.onb.ac.at/images/AKON/AK041_572/5... \n",
"\n",
" download_link_256x256 \n",
"28908 https://iiif.onb.ac.at/images/AKON/AK066_086/0... \n",
"21317 https://iiif.onb.ac.at/images/AKON/AK034_386/3... \n",
"23201 https://iiif.onb.ac.at/images/AKON/AK041_572/5... "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calling `sample` again yields different entries:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" akon_id | \n",
" id | \n",
" altitude | \n",
" building | \n",
" city | \n",
" color | \n",
" comment | \n",
" mountain | \n",
" other | \n",
" photographer | \n",
" publisher | \n",
" publisher_place | \n",
" region | \n",
" water_body | \n",
" year | \n",
" inventory_number | \n",
" signature | \n",
" revision_date | \n",
" date | \n",
" feature_class | \n",
" feature_code | \n",
" geoname_id | \n",
" latitude | \n",
" longitude | \n",
" name | \n",
" country_id | \n",
" admin_name_1 | \n",
" admin_code_1 | \n",
" geo | \n",
" download_link | \n",
" download_link_256x256 | \n",
"
\n",
" \n",
" \n",
" \n",
" 18810 | \n",
" 18810 | \n",
" AK025_111 | \n",
" 14618 | \n",
" NaN | \n",
" NaN | \n",
" Bruck an der Mur | \n",
" True | \n",
" NaN | \n",
" Mugel | \n",
" NaN | \n",
" NaN | \n",
" Ledermann | \n",
" Wien | \n",
" NaN | \n",
" NaN | \n",
" 1916.0 | \n",
" NaN | \n",
" NaN | \n",
" 2014-10-15 12:03:01.028 | \n",
" 1916 | \n",
" P | \n",
" PPLA3 | \n",
" 2781371.0 | \n",
" 47.41667 | \n",
" 15.28333 | \n",
" Bruck an der Mur | \n",
" AT | \n",
" NaN | \n",
" NaN | \n",
" 47.41667, 15.28333 | \n",
" https://iiif.onb.ac.at/images/AKON/AK025_111/1... | \n",
" https://iiif.onb.ac.at/images/AKON/AK025_111/1... | \n",
"
\n",
" \n",
" 28146 | \n",
" 28146 | \n",
" AK061_165 | \n",
" 36541 | \n",
" NaN | \n",
" NaN | \n",
" Orosháza | \n",
" True | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Vágner | \n",
" Orosháza | \n",
" NaN | \n",
" NaN | \n",
" 1917.0 | \n",
" NaN | \n",
" Kartensammlung 79/66 G | \n",
" 2015-08-25 15:28:56.547 | \n",
" 1917 | \n",
" P | \n",
" PPL | \n",
" 716736.0 | \n",
" 46.56667 | \n",
" 20.66667 | \n",
" Oroshaza | \n",
" HU | \n",
" Bekes County | \n",
" 03 | \n",
" 46.56667, 20.66667 | \n",
" https://iiif.onb.ac.at/images/AKON/AK061_165/1... | \n",
" https://iiif.onb.ac.at/images/AKON/AK061_165/1... | \n",
"
\n",
" \n",
" 8335 | \n",
" 8335 | \n",
" AK088_563 | \n",
" 56103 | \n",
" NaN | \n",
" NaN | \n",
" Bad Reichenhall | \n",
" False | \n",
" 1907 gel | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" Geogrphisch-topographische Bildersammlung 1076/43 | \n",
" 2014-08-28 16:20:02.029 | \n",
" vor 1907 | \n",
" P | \n",
" PPLA3 | \n",
" 2953371.0 | \n",
" 47.72947 | \n",
" 12.87819 | \n",
" Bad Reichenhall | \n",
" DE | \n",
" NaN | \n",
" NaN | \n",
" 47.72947, 12.87819 | \n",
" https://iiif.onb.ac.at/images/AKON/AK088_563/5... | \n",
" https://iiif.onb.ac.at/images/AKON/AK088_563/5... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 akon_id id altitude building city \\\n",
"18810 18810 AK025_111 14618 NaN NaN Bruck an der Mur \n",
"28146 28146 AK061_165 36541 NaN NaN Orosháza \n",
"8335 8335 AK088_563 56103 NaN NaN Bad Reichenhall \n",
"\n",
" color comment mountain other photographer publisher publisher_place \\\n",
"18810 True NaN Mugel NaN NaN Ledermann Wien \n",
"28146 True NaN NaN NaN NaN Vágner Orosháza \n",
"8335 False 1907 gel NaN NaN NaN NaN NaN \n",
"\n",
" region water_body year inventory_number \\\n",
"18810 NaN NaN 1916.0 NaN \n",
"28146 NaN NaN 1917.0 NaN \n",
"8335 NaN NaN NaN NaN \n",
"\n",
" signature \\\n",
"18810 NaN \n",
"28146 Kartensammlung 79/66 G \n",
"8335 Geogrphisch-topographische Bildersammlung 1076/43 \n",
"\n",
" revision_date date feature_class feature_code \\\n",
"18810 2014-10-15 12:03:01.028 1916 P PPLA3 \n",
"28146 2015-08-25 15:28:56.547 1917 P PPL \n",
"8335 2014-08-28 16:20:02.029 vor 1907 P PPLA3 \n",
"\n",
" geoname_id latitude longitude name country_id \\\n",
"18810 2781371.0 47.41667 15.28333 Bruck an der Mur AT \n",
"28146 716736.0 46.56667 20.66667 Oroshaza HU \n",
"8335 2953371.0 47.72947 12.87819 Bad Reichenhall DE \n",
"\n",
" admin_name_1 admin_code_1 geo \\\n",
"18810 NaN NaN 47.41667, 15.28333 \n",
"28146 Bekes County 03 46.56667, 20.66667 \n",
"8335 NaN NaN 47.72947, 12.87819 \n",
"\n",
" download_link \\\n",
"18810 https://iiif.onb.ac.at/images/AKON/AK025_111/1... \n",
"28146 https://iiif.onb.ac.at/images/AKON/AK061_165/1... \n",
"8335 https://iiif.onb.ac.at/images/AKON/AK088_563/5... \n",
"\n",
" download_link_256x256 \n",
"18810 https://iiif.onb.ac.at/images/AKON/AK025_111/1... \n",
"28146 https://iiif.onb.ac.at/images/AKON/AK061_165/1... \n",
"8335 https://iiif.onb.ac.at/images/AKON/AK088_563/5... "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Count Things"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many entries show things in Italy?\n",
"\n",
"Let's use the `country_id` for this question:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"df_in_italy = df[df['country_id'] == 'IT']"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3221"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df_in_italy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many postcards are in color?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"df_in_color = df[df['color'] == True]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7667"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df_in_color)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can I do this in one line?"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7667"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df[df['color'] == True])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many different publisher places are in the data set?"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1545"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df['publisher_place'].unique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show me some!"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"ename": "AttributeError",
"evalue": "'numpy.ndarray' object has no attribute 'sample'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'publisher_place'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mAttributeError\u001b[0m: 'numpy.ndarray' object has no attribute 'sample'"
]
}
],
"source": [
"df['publisher_place'].unique().sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Oh, that doesn't work. Let's wrap it in a pandas DataFrame, step by step:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"publisher_places = df['publisher_place'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([nan, 'Wien', 'Kierling', ..., 'Königstein i. T.', 'Detmold',\n",
" 'Furth i. W.'], dtype=object)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"publisher_places"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"pp = pd.DataFrame(publisher_places)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" Wien | \n",
"
\n",
" \n",
" 2 | \n",
" Kierling | \n",
"
\n",
" \n",
" 3 | \n",
" Kindberg | \n",
"
\n",
" \n",
" 4 | \n",
" Kirchau | \n",
"
\n",
" \n",
" 5 | \n",
" Kirchhain | \n",
"
\n",
" \n",
" 6 | \n",
" München | \n",
"
\n",
" \n",
" 7 | \n",
" Kitzbühel | \n",
"
\n",
" \n",
" 8 | \n",
" Innsbruck | \n",
"
\n",
" \n",
" 9 | \n",
" Klagenfurt | \n",
"
\n",
" \n",
" 10 | \n",
" Grein a/D. | \n",
"
\n",
" \n",
" 11 | \n",
" Bozen | \n",
"
\n",
" \n",
" 12 | \n",
" Znaim | \n",
"
\n",
" \n",
" 13 | \n",
" Graz | \n",
"
\n",
" \n",
" 14 | \n",
" Heidelberg | \n",
"
\n",
" \n",
" 15 | \n",
" Komotau | \n",
"
\n",
" \n",
" 16 | \n",
" Gr. Siegharts | \n",
"
\n",
" \n",
" 17 | \n",
" Köln | \n",
"
\n",
" \n",
" 18 | \n",
" Bodenbach a. d. Elbe | \n",
"
\n",
" \n",
" 19 | \n",
" Meissen | \n",
"
\n",
" \n",
" 20 | \n",
" Leipzig | \n",
"
\n",
" \n",
" 21 | \n",
" Konstanz | \n",
"
\n",
" \n",
" 22 | \n",
" Korneuburg | \n",
"
\n",
" \n",
" 23 | \n",
" Brașov | \n",
"
\n",
" \n",
" 24 | \n",
" Mürzzuschlag | \n",
"
\n",
" \n",
" 25 | \n",
" Salzburg | \n",
"
\n",
" \n",
" 26 | \n",
" Frankfurt a. M. | \n",
"
\n",
" \n",
" 27 | \n",
" Arosa | \n",
"
\n",
" \n",
" 28 | \n",
" Kilchberg | \n",
"
\n",
" \n",
" 29 | \n",
" Arys | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 1515 | \n",
" Friedau | \n",
"
\n",
" \n",
" 1516 | \n",
" Wildalpe | \n",
"
\n",
" \n",
" 1517 | \n",
" Gießhübl | \n",
"
\n",
" \n",
" 1518 | \n",
" Schlossberg | \n",
"
\n",
" \n",
" 1519 | \n",
" Frakfurt a. Oder | \n",
"
\n",
" \n",
" 1520 | \n",
" Casale Monferrato | \n",
"
\n",
" \n",
" 1521 | \n",
" gr | \n",
"
\n",
" \n",
" 1522 | \n",
" Steinhaus a. Semmering | \n",
"
\n",
" \n",
" 1523 | \n",
" Sternberg | \n",
"
\n",
" \n",
" 1524 | \n",
" Stronsdorf | \n",
"
\n",
" \n",
" 1525 | \n",
" Thörl | \n",
"
\n",
" \n",
" 1526 | \n",
" Coburg | \n",
"
\n",
" \n",
" 1527 | \n",
" Traismauer | \n",
"
\n",
" \n",
" 1528 | \n",
" Trebnitz | \n",
"
\n",
" \n",
" 1529 | \n",
" Unterlamm | \n",
"
\n",
" \n",
" 1530 | \n",
" Daun | \n",
"
\n",
" \n",
" 1531 | \n",
" Kilchberg-Züich | \n",
"
\n",
" \n",
" 1532 | \n",
" Mühlhausen | \n",
"
\n",
" \n",
" 1533 | \n",
" Eschwege | \n",
"
\n",
" \n",
" 1534 | \n",
" Tabarz | \n",
"
\n",
" \n",
" 1535 | \n",
" Suhl | \n",
"
\n",
" \n",
" 1536 | \n",
" Weimar | \n",
"
\n",
" \n",
" 1537 | \n",
" Friedrichsroda i. Th. | \n",
"
\n",
" \n",
" 1538 | \n",
" Leipa i. B. | \n",
"
\n",
" \n",
" 1539 | \n",
" Schumburg a. D. | \n",
"
\n",
" \n",
" 1540 | \n",
" Pisa | \n",
"
\n",
" \n",
" 1541 | \n",
" Straßburg i./E. | \n",
"
\n",
" \n",
" 1542 | \n",
" Königstein i. T. | \n",
"
\n",
" \n",
" 1543 | \n",
" Detmold | \n",
"
\n",
" \n",
" 1544 | \n",
" Furth i. W. | \n",
"
\n",
" \n",
"
\n",
"
1545 rows × 1 columns
\n",
"
"
],
"text/plain": [
" 0\n",
"0 NaN\n",
"1 Wien\n",
"2 Kierling\n",
"3 Kindberg\n",
"4 Kirchau\n",
"5 Kirchhain\n",
"6 München\n",
"7 Kitzbühel\n",
"8 Innsbruck\n",
"9 Klagenfurt\n",
"10 Grein a/D.\n",
"11 Bozen\n",
"12 Znaim\n",
"13 Graz\n",
"14 Heidelberg\n",
"15 Komotau\n",
"16 Gr. Siegharts\n",
"17 Köln\n",
"18 Bodenbach a. d. Elbe\n",
"19 Meissen\n",
"20 Leipzig\n",
"21 Konstanz\n",
"22 Korneuburg\n",
"23 Brașov\n",
"24 Mürzzuschlag\n",
"25 Salzburg\n",
"26 Frankfurt a. M.\n",
"27 Arosa\n",
"28 Kilchberg\n",
"29 Arys\n",
"... ...\n",
"1515 Friedau\n",
"1516 Wildalpe\n",
"1517 Gießhübl\n",
"1518 Schlossberg\n",
"1519 Frakfurt a. Oder\n",
"1520 Casale Monferrato\n",
"1521 gr\n",
"1522 Steinhaus a. Semmering\n",
"1523 Sternberg\n",
"1524 Stronsdorf\n",
"1525 Thörl\n",
"1526 Coburg\n",
"1527 Traismauer\n",
"1528 Trebnitz\n",
"1529 Unterlamm\n",
"1530 Daun\n",
"1531 Kilchberg-Züich\n",
"1532 Mühlhausen\n",
"1533 Eschwege\n",
"1534 Tabarz\n",
"1535 Suhl\n",
"1536 Weimar\n",
"1537 Friedrichsroda i. Th.\n",
"1538 Leipa i. B.\n",
"1539 Schumburg a. D.\n",
"1540 Pisa\n",
"1541 Straßburg i./E.\n",
"1542 Königstein i. T.\n",
"1543 Detmold\n",
"1544 Furth i. W.\n",
"\n",
"[1545 rows x 1 columns]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Better. Now show me some randomly:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 1007 | \n",
" Wörschach | \n",
"
\n",
" \n",
" 1494 | \n",
" Raibl | \n",
"
\n",
" \n",
" 339 | \n",
" Imst | \n",
"
\n",
" \n",
" 879 | \n",
" Zbiroh | \n",
"
\n",
" \n",
" 457 | \n",
" Bad Sachsa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0\n",
"1007 Wörschach\n",
"1494 Raibl\n",
"339 Imst\n",
"879 Zbiroh\n",
"457 Bad Sachsa"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp.sample(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sort Things"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just sort the sample, please:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 599 | \n",
" Aue | \n",
"
\n",
" \n",
" 938 | \n",
" Chocěn | \n",
"
\n",
" \n",
" 314 | \n",
" Ernstbrunn | \n",
"
\n",
" \n",
" 739 | \n",
" Hall Tirol | \n",
"
\n",
" \n",
" 788 | \n",
" Hardegg | \n",
"
\n",
" \n",
" 3 | \n",
" Kindberg | \n",
"
\n",
" \n",
" 19 | \n",
" Meissen | \n",
"
\n",
" \n",
" 725 | \n",
" Neuchatel | \n",
"
\n",
" \n",
" 1211 | \n",
" Sommerein | \n",
"
\n",
" \n",
" 1302 | \n",
" Vorkloster bei Bregenz | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0\n",
"599 Aue\n",
"938 Chocěn\n",
"314 Ernstbrunn\n",
"739 Hall Tirol\n",
"788 Hardegg\n",
"3 Kindberg\n",
"19 Meissen\n",
"725 Neuchatel\n",
"1211 Sommerein\n",
"1302 Vorkloster bei Bregenz"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp.sample(10).sort_values(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Why the '0' in `sort_values(0)`? That's the name of the column to sort by."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort the whole thing:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 1248 | \n",
" Békéscsaba | \n",
"
\n",
" \n",
" 1303 | \n",
" Łuck | \n",
"
\n",
" \n",
" 389 | \n",
" # | \n",
"
\n",
" \n",
" 1489 | \n",
" A B. | \n",
"
\n",
" \n",
" 1239 | \n",
" A. | \n",
"
\n",
" \n",
" 1397 | \n",
" Aachen | \n",
"
\n",
" \n",
" 487 | \n",
" Abbazia | \n",
"
\n",
" \n",
" 1280 | \n",
" Abbazia-Lovrana | \n",
"
\n",
" \n",
" 313 | \n",
" Absam b. Innsbruck | \n",
"
\n",
" \n",
" 1181 | \n",
" Abtenau | \n",
"
\n",
" \n",
" 722 | \n",
" Achensee | \n",
"
\n",
" \n",
" 1479 | \n",
" Adelsberg | \n",
"
\n",
" \n",
" 340 | \n",
" Admont | \n",
"
\n",
" \n",
" 354 | \n",
" Aeuckens | \n",
"
\n",
" \n",
" 308 | \n",
" Aflenz | \n",
"
\n",
" \n",
" 934 | \n",
" Afritz | \n",
"
\n",
" \n",
" 982 | \n",
" Aggsbach | \n",
"
\n",
" \n",
" 819 | \n",
" Aigen | \n",
"
\n",
" \n",
" 462 | \n",
" Albendorf | \n",
"
\n",
" \n",
" 57 | \n",
" Alexandria | \n",
"
\n",
" \n",
" 1111 | \n",
" Alland | \n",
"
\n",
" \n",
" 786 | \n",
" Alland II | \n",
"
\n",
" \n",
" 947 | \n",
" Allensteig | \n",
"
\n",
" \n",
" 933 | \n",
" Allentsteig | \n",
"
\n",
" \n",
" 765 | \n",
" Allerheiligen | \n",
"
\n",
" \n",
" 396 | \n",
" Alsfeld | \n",
"
\n",
" \n",
" 730 | \n",
" Alt Aussee | \n",
"
\n",
" \n",
" 736 | \n",
" Alt Lengbach | \n",
"
\n",
" \n",
" 1348 | \n",
" Altaussee | \n",
"
\n",
" \n",
" 955 | \n",
" Altenberg | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 1177 | \n",
" Würflach | \n",
"
\n",
" \n",
" 829 | \n",
" Würnitz | \n",
"
\n",
" \n",
" 650 | \n",
" Würzburg | \n",
"
\n",
" \n",
" 674 | \n",
" Ybbs | \n",
"
\n",
" \n",
" 651 | \n",
" Ypres | \n",
"
\n",
" \n",
" 802 | \n",
" Ypser | \n",
"
\n",
" \n",
" 1487 | \n",
" Ysper | \n",
"
\n",
" \n",
" 1144 | \n",
" Zakopane | \n",
"
\n",
" \n",
" 1249 | \n",
" Zantan | \n",
"
\n",
" \n",
" 652 | \n",
" Zara | \n",
"
\n",
" \n",
" 879 | \n",
" Zbiroh | \n",
"
\n",
" \n",
" 1028 | \n",
" Zell a. See | \n",
"
\n",
" \n",
" 45 | \n",
" Zell am See | \n",
"
\n",
" \n",
" 1074 | \n",
" Zistersdorf | \n",
"
\n",
" \n",
" 37 | \n",
" Zittau | \n",
"
\n",
" \n",
" 1368 | \n",
" Zlabing | \n",
"
\n",
" \n",
" 12 | \n",
" Znaim | \n",
"
\n",
" \n",
" 1369 | \n",
" Zuckmantel | \n",
"
\n",
" \n",
" 560 | \n",
" Zurigo | \n",
"
\n",
" \n",
" 654 | \n",
" Zurzach | \n",
"
\n",
" \n",
" 431 | \n",
" Zwettl | \n",
"
\n",
" \n",
" 401 | \n",
" Zwiesel | \n",
"
\n",
" \n",
" 33 | \n",
" Zürich | \n",
"
\n",
" \n",
" 1521 | \n",
" gr | \n",
"
\n",
" \n",
" 769 | \n",
" spitz | \n",
"
\n",
" \n",
" 861 | \n",
" w | \n",
"
\n",
" \n",
" 417 | \n",
" Č. Krumlov | \n",
"
\n",
" \n",
" 1304 | \n",
" Łuck | \n",
"
\n",
" \n",
" 893 | \n",
" Šibenik | \n",
"
\n",
" \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
1545 rows × 1 columns
\n",
"
"
],
"text/plain": [
" 0\n",
"1248 Békéscsaba \n",
"1303 Łuck\n",
"389 #\n",
"1489 A B.\n",
"1239 A.\n",
"1397 Aachen\n",
"487 Abbazia\n",
"1280 Abbazia-Lovrana\n",
"313 Absam b. Innsbruck\n",
"1181 Abtenau\n",
"722 Achensee\n",
"1479 Adelsberg\n",
"340 Admont\n",
"354 Aeuckens\n",
"308 Aflenz\n",
"934 Afritz\n",
"982 Aggsbach\n",
"819 Aigen\n",
"462 Albendorf\n",
"57 Alexandria\n",
"1111 Alland\n",
"786 Alland II\n",
"947 Allensteig\n",
"933 Allentsteig\n",
"765 Allerheiligen\n",
"396 Alsfeld\n",
"730 Alt Aussee\n",
"736 Alt Lengbach\n",
"1348 Altaussee\n",
"955 Altenberg\n",
"... ...\n",
"1177 Würflach\n",
"829 Würnitz\n",
"650 Würzburg\n",
"674 Ybbs\n",
"651 Ypres\n",
"802 Ypser\n",
"1487 Ysper\n",
"1144 Zakopane\n",
"1249 Zantan\n",
"652 Zara\n",
"879 Zbiroh\n",
"1028 Zell a. See\n",
"45 Zell am See\n",
"1074 Zistersdorf\n",
"37 Zittau\n",
"1368 Zlabing\n",
"12 Znaim\n",
"1369 Zuckmantel\n",
"560 Zurigo\n",
"654 Zurzach\n",
"431 Zwettl\n",
"401 Zwiesel\n",
"33 Zürich\n",
"1521 gr\n",
"769 spitz\n",
"861 w\n",
"417 Č. Krumlov\n",
"1304 Łuck\n",
"893 Šibenik\n",
"0 NaN\n",
"\n",
"[1545 rows x 1 columns]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp.sort_values(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It seems like there's something weird going on with 'Békéscsaba', it doesn't sort right. What is wrong?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's extract the datum:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Békéscsaba \n",
"Name: 1248, dtype: object"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp.iloc[1248]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More specifically the column '0':"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' Békéscsaba '"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pp.iloc[1248][0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Seems there's a space in front of the 'B'. That's why it sorts wrong."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}