# AKON Metadata - Data Overview

*Get a first impression of the postcard metadata*

## Setup

Using the [Pandas Python Data Analysis Library](https://pandas.pydata.org/).

For an intro to pandas feel free to take a look at this [Workshop for CBioVikings](https://github.com/dblyon/PandasIntro) by David Lyon.

In [1]:
import pandas as pd

## Load Data

`df` stands for *Data Frame*

In [2]:
df = pd.read_csv('https://labs.onb.ac.at/gitlab/labs-team/raw-metadata/raw/master/akon_postcards_public_domain.csv.bz2', compression='bz2')

  interactivity=interactivity, compiler=compiler, result=result)


## View Data

### Rough Overview

How much datasets are in there?

In [3]:
len(df)

34846

What does a dataset look like?
Show me the first one!

In [4]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,...,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1,geo,download_link,download_link_256x256
0,0,AK111_021,74682,,,"Kiel, Blücherplatz",False,1921 gel,,,...,2891122.0,54.32133,10.13489,Kiel,DE,,,"54.32133, 10.13489",https://iiif.onb.ac.at/images/AKON/AK111_021/0...,https://iiif.onb.ac.at/images/AKON/AK111_021/0...


There seem to be a few columns missing from the output. Let's fix that by setting pandas output options:

In [5]:
pd.set_option('display.max_columns', 100)

Let's try again:

In [6]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,photographer,publisher,publisher_place,region,water_body,year,inventory_number,signature,revision_date,date,feature_class,feature_code,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1,geo,download_link,download_link_256x256
0,0,AK111_021,74682,,,"Kiel, Blücherplatz",False,1921 gel,,,,,,,,,,"Geogr. Topogr. Bilder-Samml. 1943, 7735",2014-09-05 10:13:06.342,gelaufen 1921,P,PPLA,2891122.0,54.32133,10.13489,Kiel,DE,,,"54.32133, 10.13489",https://iiif.onb.ac.at/images/AKON/AK111_021/0...,https://iiif.onb.ac.at/images/AKON/AK111_021/0...


Now we see all columns.

What are all the columns called again?

In [7]:
df.columns

Index(['Unnamed: 0', 'akon_id', 'id', 'altitude', 'building', 'city', 'color',
       'comment', 'mountain', 'other', 'photographer', 'publisher',
       'publisher_place', 'region', 'water_body', 'year', 'inventory_number',
       'signature', 'revision_date', 'date', 'feature_class', 'feature_code',
       'geoname_id', 'latitude', 'longitude', 'name', 'country_id',
       'admin_name_1', 'admin_code_1', 'geo', 'download_link',
       'download_link_256x256'],
      dtype='object')

### Show Random Entries

Show me 3 random entries:

In [8]:
df.sample(3)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,photographer,publisher,publisher_place,region,water_body,year,inventory_number,signature,revision_date,date,feature_class,feature_code,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1,geo,download_link,download_link_256x256
28908,28908,AK066_086,40120,,,"Innsbruck, Maria Theresienstrasse",False,1907 gel,,,,Gratl,Innsbruck,,,,,,2014-08-04 07:59:10.424,vor 1907,P,PPLA,2775220.0,47.26266,11.39454,Innsbruck,AT,,,"47.26266, 11.39454",https://iiif.onb.ac.at/images/AKON/AK066_086/0...,https://iiif.onb.ac.at/images/AKON/AK066_086/0...
21317,21317,AK034_386,20303,251.0,,Gars-Thunau am Kamp,True,,,,,Ledermann,Wien,,,1909.0,,,2014-08-04 07:59:10.272,1909,P,PPL,2763660.0,48.58333,15.65,Thunau am Kamp,AT,,,"48.58333, 15.65",https://iiif.onb.ac.at/images/AKON/AK034_386/3...,https://iiif.onb.ac.at/images/AKON/AK034_386/3...
23201,23201,AK041_572,24699,251.0,,Gars-Thunau am Kamp,False,,,,,Ledermann,Wien,,,1908.0,,,2014-08-04 07:59:10.328,1908,P,PPL,2763660.0,48.58333,15.65,Thunau am Kamp,AT,,,"48.58333, 15.65",https://iiif.onb.ac.at/images/AKON/AK041_572/5...,https://iiif.onb.ac.at/images/AKON/AK041_572/5...


Calling `sample` again yields different entries:

In [9]:
df.sample(3)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,photographer,publisher,publisher_place,region,water_body,year,inventory_number,signature,revision_date,date,feature_class,feature_code,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1,geo,download_link,download_link_256x256
18810,18810,AK025_111,14618,,,Bruck an der Mur,True,,Mugel,,,Ledermann,Wien,,,1916.0,,,2014-10-15 12:03:01.028,1916,P,PPLA3,2781371.0,47.41667,15.28333,Bruck an der Mur,AT,,,"47.41667, 15.28333",https://iiif.onb.ac.at/images/AKON/AK025_111/1...,https://iiif.onb.ac.at/images/AKON/AK025_111/1...
28146,28146,AK061_165,36541,,,Orosháza,True,,,,,Vágner,Orosháza,,,1917.0,,Kartensammlung 79/66 G,2015-08-25 15:28:56.547,1917,P,PPL,716736.0,46.56667,20.66667,Oroshaza,HU,Bekes County,3.0,"46.56667, 20.66667",https://iiif.onb.ac.at/images/AKON/AK061_165/1...,https://iiif.onb.ac.at/images/AKON/AK061_165/1...
8335,8335,AK088_563,56103,,,Bad Reichenhall,False,1907 gel,,,,,,,,,,Geogrphisch-topographische Bildersammlung 1076/43,2014-08-28 16:20:02.029,vor 1907,P,PPLA3,2953371.0,47.72947,12.87819,Bad Reichenhall,DE,,,"47.72947, 12.87819",https://iiif.onb.ac.at/images/AKON/AK088_563/5...,https://iiif.onb.ac.at/images/AKON/AK088_563/5...


### Count Things

How many entries show things in Italy?

Let's use the `country_id` for this question:

In [10]:
df_in_italy = df[df['country_id'] == 'IT']

In [11]:
len(df_in_italy)

3221

How many postcards are in color?

In [12]:
df_in_color = df[df['color'] == True]

In [13]:
len(df_in_color)

7667

Can I do this in one line?

In [14]:
len(df[df['color'] == True])

7667

How many different publisher places are in the data set?

In [15]:
len(df['publisher_place'].unique())

1545

Show me some!

In [16]:
df['publisher_place'].unique().sample(10)

AttributeError: 'numpy.ndarray' object has no attribute 'sample'

Oh, that doesn't work. Let's wrap it in a pandas DataFrame, step by step:

In [17]:
publisher_places = df['publisher_place'].unique()

In [18]:
publisher_places

array([nan, 'Wien', 'Kierling', ..., 'Königstein i. T.', 'Detmold',
       'Furth i. W.'], dtype=object)

In [19]:
pp = pd.DataFrame(publisher_places)

In [20]:
pp

Unnamed: 0,0
0,
1,Wien
2,Kierling
3,Kindberg
4,Kirchau
5,Kirchhain
6,München
7,Kitzbühel
8,Innsbruck
9,Klagenfurt


Better. Now show me some randomly:

In [21]:
pp.sample(5)

Unnamed: 0,0
1007,Wörschach
1494,Raibl
339,Imst
879,Zbiroh
457,Bad Sachsa


### Sort Things

Just sort the sample, please:

In [22]:
pp.sample(10).sort_values(0)

Unnamed: 0,0
599,Aue
938,Chocěn
314,Ernstbrunn
739,Hall Tirol
788,Hardegg
3,Kindberg
19,Meissen
725,Neuchatel
1211,Sommerein
1302,Vorkloster bei Bregenz


Why the '0' in `sort_values(0)`? That's the name of the column to sort by.

Sort the whole thing:

In [23]:
pp.sort_values(0)

Unnamed: 0,0
1248,Békéscsaba
1303,Łuck
389,#
1489,A B.
1239,A.
1397,Aachen
487,Abbazia
1280,Abbazia-Lovrana
313,Absam b. Innsbruck
1181,Abtenau


It seems like there's something weird going on with 'Békéscsaba', it doesn't sort right. What is wrong?

Let's extract the datum:

In [24]:
pp.iloc[1248]

0     Békéscsaba 
Name: 1248, dtype: object

More specifically the column '0':

In [25]:
pp.iloc[1248][0]

' Békéscsaba '

Seems there's a space in front of the 'B'. That's why it sorts wrong.