# AKON Metadata - Data Overview

*Get a first impression of the postcard metadata*

## Setup

Using the [Pandas Python Data Analysis Library](https://pandas.pydata.org/).

For an intro to pandas feel free to take a look at this [Workshop for CBioVikings](https://github.com/dblyon/PandasIntro) by David Lyon.

In [3]:
import pandas as pd

## Load Data

`df` stands for *Data Frame*

In [4]:
df = pd.read_csv('https://labs.onb.ac.at/gitlab/labs-team/raw-metadata/raw/master/akon_postcards_public_domain.csv.bz2', compression='bz2')

## View Data

### Rough Overview

How much datasets are in there?

In [5]:
len(df)

34846

What does a dataset look like?
Show me the first one!

In [6]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,...,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1,geo,download_link,download_link_256x256
0,0,AK111_021,74682,,,"Kiel, Blücherplatz",False,1921 gel,,,...,2891122.0,54.32133,10.13489,Kiel,DE,,,"54.32133, 10.13489",https://iiif.onb.ac.at/images/AKON/AK111_021/0...,https://iiif.onb.ac.at/images/AKON/AK111_021/0...


There seem to be a few columns missing from the output. Let's fix that by setting pandas output options:

In [7]:
pd.set_option('display.max_columns', 100)

Let's try again:

In [8]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,photographer,publisher,publisher_place,region,water_body,year,inventory_number,signature,revision_date,date,feature_class,feature_code,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1,geo,download_link,download_link_256x256
0,0,AK111_021,74682,,,"Kiel, Blücherplatz",False,1921 gel,,,,,,,,,,"Geogr. Topogr. Bilder-Samml. 1943, 7735",2014-09-05 10:13:06.342,gelaufen 1921,P,PPLA,2891122.0,54.32133,10.13489,Kiel,DE,,,"54.32133, 10.13489",https://iiif.onb.ac.at/images/AKON/AK111_021/0...,https://iiif.onb.ac.at/images/AKON/AK111_021/0...


Now we see all columns.

What are all the columns called again?

In [9]:
df.columns

Index(['Unnamed: 0', 'akon_id', 'id', 'altitude', 'building', 'city', 'color',
       'comment', 'mountain', 'other', 'photographer', 'publisher',
       'publisher_place', 'region', 'water_body', 'year', 'inventory_number',
       'signature', 'revision_date', 'date', 'feature_class', 'feature_code',
       'geoname_id', 'latitude', 'longitude', 'name', 'country_id',
       'admin_name_1', 'admin_code_1', 'geo', 'download_link',
       'download_link_256x256'],
      dtype='object')

### Show Random Entries

Show me 3 random entries:

In [10]:
df.sample(3)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,photographer,publisher,publisher_place,region,water_body,year,inventory_number,signature,revision_date,date,feature_class,feature_code,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1,geo,download_link,download_link_256x256
7880,7880,AK086_229,54365,,"Montraux-Palace, Belmont",Montreux,True,1911 gel,Alpes de la Savoie,,,Photoglob Co.,Zürich,,,,,,2014-08-27 15:44:51.079,gelaufen 1911,P,PPL,2659601.0,46.43301,6.91143,Montreux,CH,Waadt,VD,"46.43301, 6.91143",https://iiif.onb.ac.at/images/AKON/AK086_229/2...,https://iiif.onb.ac.at/images/AKON/AK086_229/2...
16014,16014,AK015_033,8502,,Schloss Waldhausen,,False,,,,,,,,,1910.0,,,2014-08-04 07:59:10.026,1910,P,PPL,2762012.0,48.27377,14.9475,Waldhausen im Strudengau,AT,,,"48.27377, 14.9475",https://iiif.onb.ac.at/images/AKON/AK015_033/0...,https://iiif.onb.ac.at/images/AKON/AK015_033/0...
26648,26648,AK053_218,31570,,,"Kalksburg, Breitenfurterstrasse",False,,,,,Janko,Kalksburg,,,1918.0,,,2014-08-04 07:59:10.386,1918,A,ADM4,2774904.0,48.13754,16.24599,Kalksburg,AT,,,"48.13754, 16.24599",https://iiif.onb.ac.at/images/AKON/AK053_218/2...,https://iiif.onb.ac.at/images/AKON/AK053_218/2...


Calling `sample` again yields different entries:

In [11]:
df.sample(3)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,photographer,publisher,publisher_place,region,water_body,year,inventory_number,signature,revision_date,date,feature_class,feature_code,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1,geo,download_link,download_link_256x256
32175,32175,AK068_223,41634,,,Baden,True,,,,,Bauer,Wien,,,1913.0,,Vues-Sammlung I. 7425,2014-08-13 14:19:10.145,1913,P,PPLA3,2782067.0,48.00543,16.23264,Baden bei Wien,AT,,,"48.00543, 16.23264",https://iiif.onb.ac.at/images/AKON/AK068_223/2...,https://iiif.onb.ac.at/images/AKON/AK068_223/2...
17757,17757,AK021_519,12608,,,Zell am See,False,v 1907,,,,Ledermann,Wien,,Zeller See,,,,2014-08-04 07:59:10.136,vor 1907,P,PPLA3,2760634.0,47.32556,12.79444,Zell am See,AT,,,"47.32556, 12.79444",https://iiif.onb.ac.at/images/AKON/AK021_519/5...,https://iiif.onb.ac.at/images/AKON/AK021_519/5...
2833,2833,AK121_438,81017,,Festenburg,Festenburg,False,,,,,Pelnitschar,Aspang,,,1920.0,,Nationalbibliothek Karten Abteilung 3062,2014-09-12 08:38:22.055,1920,S,CSTL,2779616.0,47.45,15.91667,Festenburg,AT,,,"47.45, 15.91667",https://iiif.onb.ac.at/images/AKON/AK121_438/4...,https://iiif.onb.ac.at/images/AKON/AK121_438/4...


### Count Things

How many entries show things in Italy?

Let's use the `country_id` for this question:

In [12]:
df_in_italy = df[df['country_id'] == 'IT']

In [13]:
len(df_in_italy)

3221

How many postcards are in color?

In [14]:
df_in_color = df[df['color'] == True]

In [15]:
len(df_in_color)

7667

Can I do this in one line?

In [16]:
len(df[df['color'] == True])

7667

How many different publisher places are in the data set?

In [17]:
len(df['publisher_place'].unique())

1545

Show me some! Let's wrap it in a pandas DataFrame, step by step:

In [18]:
publisher_places = df['publisher_place'].unique()

In [19]:
publisher_places

array([nan, 'Wien', 'Kierling', ..., 'Königstein i. T.', 'Detmold',
       'Furth i. W.'], dtype=object)

In [20]:
pp = pd.DataFrame(publisher_places)

In [21]:
pp

Unnamed: 0,0
0,
1,Wien
2,Kierling
3,Kindberg
4,Kirchau
...,...
1540,Pisa
1541,Straßburg i./E.
1542,Königstein i. T.
1543,Detmold


Better. Now show me some randomly:

In [22]:
pp.sample(5)

Unnamed: 0,0
624,Braunau a. Inn
1360,Kratzau
1319,Nový Bydžov
592,Hyères
1060,Kapellen a. d. Mürz


### Sort Things

Just sort the sample, please:

In [23]:
pp.sample(10).sort_values(0)

Unnamed: 0,0
29,Arys
576,Buenos Aires
1372,Getzersdorf
148,Kochel
1119,Maria Trost
238,Mariazell
1096,Melk
273,Münden
1196,Stein
1047,Vitis


Why the '0' in `sort_values(0)`? That's the name of the column to sort by.

Sort the whole thing:

In [24]:
pp.sort_values(0)

Unnamed: 0,0
1248,Békéscsaba
1303,Łuck
389,#
1489,A B.
1239,A.
...,...
861,w
417,Č. Krumlov
1304,Łuck
893,Šibenik


It seems like there's something weird going on with 'Békéscsaba', it doesn't sort right. What is wrong?

Let's extract the datum:

In [25]:
pp.iloc[1248]

0     Békéscsaba 
Name: 1248, dtype: object

More specifically the column '0':

In [26]:
pp.iloc[1248][0]

' Békéscsaba '

Seems there's a space in front of the 'B'. That's why it sorts wrong.