# AKON Metadata - Data Overview

*Get a first impression of the postcard metadata*

## Setup

Using the [Pandas Python Data Analysis Library](https://pandas.pydata.org/).

For an intro to pandas feel free to take a look at this [Workshop for CBioVikings](https://github.com/dblyon/PandasIntro) by David Lyon.

In [1]:
import pandas as pd

## Load Data

`df` stands for *Data Frame*

In [2]:
df = pd.read_csv('akon_postcards_public_domain.csv.bz2')

## View Data

### Rough Overview

How much datasets are in there?

In [3]:
len(df)

28882

What does a dataset look like?
Show me the first one!

In [4]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,...,date,feature_class,feature_code,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1
0,0,AK111_024,74685,,,Kierling,True,1908,,,...,gelaufen 1908,P,PPL,2774449.0,48.30997,16.27616,Kierling,AT,,


There seem to be a few columns missing from the output. Let's fix that by setting pandas output options:

In [5]:
pd.set_option('display.max_columns', 100)

Let's try again:

In [6]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,photographer,publisher,publisher_place,region,water_body,year,inventory_number,signature,revision_date,date,feature_class,feature_code,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1
0,0,AK111_024,74685,,,Kierling,True,1908,,,,,,,,,,"Geogr. Topogr. Bilder-Samml. 1944, 6380",2014-09-05 10:13:12.536,gelaufen 1908,P,PPL,2774449.0,48.30997,16.27616,Kierling,AT,,


Now we see all columns.

What are all the columns called again?

In [7]:
df.columns

Index(['Unnamed: 0', 'akon_id', 'id', 'altitude', 'building', 'city', 'color',
       'comment', 'mountain', 'other', 'photographer', 'publisher',
       'publisher_place', 'region', 'water_body', 'year', 'inventory_number',
       'signature', 'revision_date', 'date', 'feature_class', 'feature_code',
       'geoname_id', 'latitude', 'longitude', 'name', 'country_id',
       'admin_name_1', 'admin_code_1'],
      dtype='object')

### Show Random Entries

Show me 3 random entries:

In [11]:
df.sample(3)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,photographer,publisher,publisher_place,region,water_body,year,inventory_number,signature,revision_date,date,feature_class,feature_code,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1
458,458,AK114_189,76614,,,Airolo,True,1909 gel,,,,,,,,,,"Geogr. Topogr. Bilder-Samml. 1944, 27",2014-09-09 08:50:37.905,gelaufen 1909,P,PPL,2661830.0,46.52847,8.60881,Airolo,CH,,
24510,24510,AK075_467,46922,,,Klausen,False,,,,,,,,,1913.0,,,2014-08-21 10:05:32.779,1913,P,PPLA3,3178764.0,46.64001,11.56573,Klausen,IT,Südtirol,17.0
18564,18564,AK082_546,51980,,,Velsen,False,v 1905,,,,,,,,,,,2014-08-25 17:46:23.417,vor 1905,P,PPL,2745673.0,52.46,4.65,Velsen,NL,Nord-Holland,7.0


Calling `sample` again yields different entries:

In [12]:
df.sample(3)

Unnamed: 0.1,Unnamed: 0,akon_id,id,altitude,building,city,color,comment,mountain,other,photographer,publisher,publisher_place,region,water_body,year,inventory_number,signature,revision_date,date,feature_class,feature_code,geoname_id,latitude,longitude,name,country_id,admin_name_1,admin_code_1
27658,27658,AK097_351,62277,,Station Brenner,Steinach am Brenner,False,1905 gel,,,,,,,,,,,2014-09-02 11:15:45.988,vor 1905,P,PPLA3,2764557.0,47.08333,11.46667,Steinach am Brenner,AT,,
14993,14993,AK021_515,12604,,,"Gross-Pöchlarn, Klein-Pöchlarn",False,,,,,Ledermann,Wien,,,1917.0,,,2014-08-04 07:59:10.136,1917,P,PPLA3,2768627.0,48.2,15.2,Pöchlarn,AT,,
15934,15934,AK025_596,15102,,Pfarrkirche,Mondsee,False,,,,,,,,,1914.0,,,2014-08-04 07:59:10.187,1914,P,PPLA3,2771277.0,47.85648,13.34908,Mondsee,AT,,


### Count Things

How many entries show things in Italy?

Let's use the `country_id` for this question:

In [9]:
df_in_italy = df[df['country_id'] == 'IT']

In [10]:
len(df_in_italy)

2983

How many postcards are in color?

In [13]:
df_in_color = df[df['color'] == True]

In [14]:
len(df_in_color)

7075

Can I do this in one line?

In [15]:
len(df[df['color'] == True])

7075

How many different publisher places are in the data set?

In [17]:
len(df['publisher_place'].unique())

1324

Show me some!

In [21]:
df['publisher_place'].unique().sample(10)

AttributeError: 'numpy.ndarray' object has no attribute 'sample'

Oh, that doesn't work. Let's wrap it in a pandas DataFrame, step by step:

In [22]:
publisher_places = df['publisher_place'].unique()

In [23]:
publisher_places

array([nan, 'Kierling', 'Kindberg', ..., 'Straßburg i./E.', 'Detmold',
       'Furth i. W.'], dtype=object)

In [25]:
pp = pd.DataFrame(publisher_places)

In [26]:
pp

Unnamed: 0,0
0,
1,Kierling
2,Kindberg
3,Kirchau
4,Wien
5,Kirchhain
6,München
7,Kitzbühel
8,Klagenfurt
9,Grein a/D.


Better. Now show me some randomly:

In [33]:
pp.sample(5)

Unnamed: 0,0
830,Friedberg
993,Kreisbach
743,Neustadt a. d. D.
70,Halberstadt
344,Kastelruth


### Sort Things

Just sort the sample, please:

In [34]:
pp.sample(10).sort_values(0)

Unnamed: 0,0
953,Arnsdorf
1293,Bad Neuhaus
309,Bad Pyrmont
545,Bellegarde
1300,Frakfurt a. Oder
301,Ischl
3,Kirchau
984,Mönichkirchen
744,Ramsau
391,Roustchouk


Why the '0' in `sort_values(0)`? That's the name of the column to sort by.

Sort the whole thing:

In [35]:
pp.sort_values(0)

Unnamed: 0,0
1053,Békéscsaba
1102,Łuck
359,#
1268,A B.
1044,A.
1194,Aachen
434,Abbazia
1083,Abbazia-Lovrana
291,Absam b. Innsbruck
997,Abtenau


It seems like there's something weird going on with 'Békéscsaba', it doesn't sort right. What is wrong?

Let's extract the datum:

In [30]:
pp.iloc[1053]

0     Békéscsaba 
Name: 1053, dtype: object

More specifically the column '0':

In [32]:
pp.iloc[1053][0]

' Békéscsaba '

Seems there's a space in front of the 'B'. That's why it sorts wrong.