3.2 - Images - Download pre-downsized images for machine learning

I want to download a bunch of small images, already scaled down for my CNN

https://labs.onb.ac.at/en/dataset/akon/

https://labs.onb.ac.at/gitlab/labs-team/raw-metadata/raw/master/akon_postcards_public_domain.csv.bz2

https://github.com/h2non/jsonpath-ng

Let's say you got a bunch of old timey scenery photographs. And you want to extract all images containing mountains, why not. And, because you can, you want an AI to do all the dirty work for you.

What that has to do with this workshop?

You can use the historic postcards from the ONB Labs as training data for your AI.

Disclaimer: The AI-part is beyond the scope of this notebook, and would blow up the size of the venv considerably.

If you want instructions on actually performing the training, take a look at

One way to do it: Download a VGG16 network that's pre-trained on ImageNet, remove the last layer (the actual classifier), add your own output layer with 2 outputs ('mountain', 'no mountain') and train that one.

Now back to the show.

What do we have to do?

  • Download Metdata
    • List of all available postcards
    • Info about the 'mountain-ness' of postcards
  • Create Download Links
    • To fetch all images
  • Split Into Two Sets
    • Mountain and non-mountain
  • Download Images

Download Metadata

Download the metadata set from the ONB Labs

In [1]:
import pandas as pd

# Let pandas show all available columns
pd.set_option('display.max_columns', 50)
# Pandas can read data directly from web links, even compressed files
meta = pd.read_csv('https://labs.onb.ac.at/gitlab/labs-team/' \
                   'raw-metadata/raw/master/akon_postcards_public_domain.csv.bz2', compression='bz2')
/home/oida/labs/pydays19/venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3049: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [2]:
meta.sample(6)
Out[2]:
Unnamed: 0 akon_id id altitude building city color comment mountain other photographer publisher publisher_place region water_body year inventory_number signature revision_date date feature_class feature_code geoname_id latitude longitude name country_id admin_name_1 admin_code_1 geo
6683 6683 AK121_352 80931 NaN Zwinger Dresden False v. 1907 NaN NaN NaN NaN NaN NaN NaN NaN NaN Geogr. Topogr. Bilder-Samml. 1943, 7402 2014-08-25 13:52:35.479 vor 1907 P PPLA 2935022.0 51.05089 13.73832 Dresden DE NaN NaN 51.05089, 13.73832
1060 1060 AK074_287 45904 NaN NaN Solingen True 1908 gel NaN Kaiser Wilhelm-Brücke NaN NaN NaN NaN NaN NaN NaN NaN 2014-08-19 15:22:42.160 gelaufen 1908 P PPLA3 2831580.0 51.17343 7.08450 Solingen DE Nordrhein-Westfalen 07 51.17343, 7.0845
34225 34225 AK087_169 54994 NaN NaN Venezia, Piazza S. Marco False 1925 gel NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2014-08-25 09:26:12.544 gelaufen 1925 P PPLA 3164603.0 45.43713 12.33265 Venecia IT NaN NaN 45.43713, 12.33265
20250 20250 AK030_367 17883 NaN NaN Vorder Stoder False NaN Todtengebirge, Spitzmauer, Kleiner Priel, Groß... NaN NaN Ledermann Wien NaN NaN 1909.0 NaN NaN 2014-08-04 07:59:10.235 1909 P PPL 2762185.0 47.71337 14.22712 Vorderstoder AT NaN NaN 47.71337, 14.22712
19981 19981 AK029_173 17088 NaN NaN Pöggstall False 1903 gel NaN NaN NaN Hofmeister Pöggstall NaN NaN NaN NaN NaN 2014-08-04 07:59:10.223 gelaufen 1903 P PPLA3 2768616.0 48.31667 15.18333 Pöggstall AT NaN NaN 48.31667, 15.18333
30492 30492 AK088_055 55510 NaN NaN Neutitschein, Obertorstrasse False 1920 gel NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2014-08-28 13:39:02.860 gelaufen 1920 P PPL 3069305.0 49.59438 18.01028 Neutitschein CZ NaN NaN 49.59438, 18.01028

Ok, we have metadata. And look, there's a column mountain:

In [3]:
meta.sample(5)[['akon_id', 'mountain']]
Out[3]:
akon_id mountain
33148 AK076_442 Watzmann, Hochkalter
29503 AK070_327 NaN
5663 AK075_470 NaN
31604 AK107_561 NaN
8748 AK091_235 NaN

Later, we'll split the dataset in two using the data in this column.

The SACHA project provides an API for accessing digitized objects of the National Library via IIIF. The online documentation for the API is here: https://iiif.onb.ac.at/api.

We're especially interested in the possibility to serve manifests: https://iiif.onb.ac.at/api#_manifestrequestprocessor:

GET /presentation/{projectName}/{id}/manifest
GET /presentation/{projectName}/{id}/manifest

The projectName is AKON ('AnsichtsKarten ONline'), the id is the akon_id.

See also https://iiif.onb.ac.at/api#_digitization_projects.

In [4]:
def akon_id_to_manifest_link(akon_id):
    return f'https://iiif.onb.ac.at/presentation/AKON/{akon_id}/manifest'
In [5]:
akon_id_to_manifest_link('AK024_176')
Out[5]:
'https://iiif.onb.ac.at/presentation/AKON/AK024_176/manifest'

Let's test the link

In [6]:
import requests

r = requests.get(akon_id_to_manifest_link('AK024_176'))
r.json()
Out[6]:
{'@context': 'https://iiif.io/api/presentation/2/context.json',
 '@id': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/manifest',
 '@type': 'sc:Manifest',
 'label': 'Wien, III',
 'metadata': [{'label': [{'@value': 'Id', '@language': 'en'},
    {'@value': 'Id', '@language': 'ger'}],
   'value': 'AK024_176'},
  {'label': [{'@value': 'Title', '@language': 'en'},
    {'@value': 'Titel', '@language': 'ger'}],
   'value': 'Wien, III'},
  {'label': [{'@value': 'Place', '@language': 'en'},
    {'@value': 'Ort', '@language': 'ger'}],
   'value': "<a href='https://sws.geonames.org/2773040'>Landstraße</a>"},
  {'label': [{'@value': 'Publisher', '@language': 'en'},
    {'@value': 'Verlag', '@language': 'ger'}],
   'value': 'Ledermann'},
  {'label': [{'@value': 'Place of Publications', '@language': 'en'},
    {'@value': 'Erscheinungsort', '@language': 'ger'}],
   'value': 'Wien'},
  {'label': [{'@value': 'Year', '@language': 'en'},
    {'@value': 'Jahr', '@language': 'ger'}],
   'value': '1906'},
  {'label': [{'@value': 'Disseminator', '@language': 'en'},
    {'@value': 'Anbieter', '@language': 'ger'}],
   'value': "<a href='https://akon.onb.ac.at/'>Ansichtskarten Online</a>"},
  {'label': [{'@value': 'Physical Location', '@language': 'en'},
    {'@value': 'Standort', '@language': 'ger'}],
   'value': 'ÖNB'}],
 'description': 'Russische Kirche',
 'viewingDirection': 'left-to-right',
 'viewingHint': 'paged',
 'license': 'http://creativecommons.org/publicdomain/mark/1.0/',
 'attribution': [{'@value': 'Austrian National Library', '@language': 'en'},
  {'@value': 'Österreichische Nationalbibliothek', '@language': 'ger'}],
 'logo': 'https://iiif.onb.ac.at/logo/',
 'seeAlso': [{'@id': 'http://data.onb.ac.at/AKON/AK024_176',
   'format': 'text/html'},
  {'@id': 'http://data.onb.ac.at/AKON/AK024_176.rdf',
   'format': 'application/rdf+xml'}],
 'sequences': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
   '@id': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/sequence/normal',
   '@type': 'sc:Sequence',
   'startCanvas': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/canvas/176',
   'canvases': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
     '@id': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/canvas/176',
     '@type': 'sc:Canvas',
     'label': 'Wien, III',
     'height': 1681,
     'width': 1082,
     'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/annotation/176',
       '@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/images/AKON/AK024_176/176/full/full/0/native.jpg',
        '@type': 'dctypes:Image',
        'height': 1681,
        'width': 1082,
        'format': 'image/jpeg',
        'service': {'@context': 'https://iiif.io/api/image/2/context.json',
         '@id': 'https://iiif.onb.ac.at/images/AKON/AK024_176/176',
         'profile': 'https://iiif.io/api/image/2/level2.json'}},
       'on': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/canvas/176'}]}]}]}

The manifest link seems to work. Let's add manifest links for all postcards to the dataframe:

In [7]:
meta['manifest_link'] = meta['akon_id'].apply(akon_id_to_manifest_link)
In [8]:
meta.sample(6)[['akon_id', 'manifest_link']]
Out[8]:
akon_id manifest_link
32242 AK049_538 https://iiif.onb.ac.at/presentation/AKON/AK049...
10827 AK001_237 https://iiif.onb.ac.at/presentation/AKON/AK001...
14148 AK009_081 https://iiif.onb.ac.at/presentation/AKON/AK009...
8074 AK087_246 https://iiif.onb.ac.at/presentation/AKON/AK087...
33232 AK082_006 https://iiif.onb.ac.at/presentation/AKON/AK082...
22877 AK040_083 https://iiif.onb.ac.at/presentation/AKON/AK040...

Let's take a look at that manifest again:

In [9]:
r = requests.get(akon_id_to_manifest_link('AK024_176'))
r.json()
Out[9]:
{'@context': 'https://iiif.io/api/presentation/2/context.json',
 '@id': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/manifest',
 '@type': 'sc:Manifest',
 'label': 'Wien, III',
 'metadata': [{'label': [{'@value': 'Id', '@language': 'en'},
    {'@value': 'Id', '@language': 'ger'}],
   'value': 'AK024_176'},
  {'label': [{'@value': 'Title', '@language': 'en'},
    {'@value': 'Titel', '@language': 'ger'}],
   'value': 'Wien, III'},
  {'label': [{'@value': 'Place', '@language': 'en'},
    {'@value': 'Ort', '@language': 'ger'}],
   'value': "<a href='https://sws.geonames.org/2773040'>Landstraße</a>"},
  {'label': [{'@value': 'Publisher', '@language': 'en'},
    {'@value': 'Verlag', '@language': 'ger'}],
   'value': 'Ledermann'},
  {'label': [{'@value': 'Place of Publications', '@language': 'en'},
    {'@value': 'Erscheinungsort', '@language': 'ger'}],
   'value': 'Wien'},
  {'label': [{'@value': 'Year', '@language': 'en'},
    {'@value': 'Jahr', '@language': 'ger'}],
   'value': '1906'},
  {'label': [{'@value': 'Disseminator', '@language': 'en'},
    {'@value': 'Anbieter', '@language': 'ger'}],
   'value': "<a href='https://akon.onb.ac.at/'>Ansichtskarten Online</a>"},
  {'label': [{'@value': 'Physical Location', '@language': 'en'},
    {'@value': 'Standort', '@language': 'ger'}],
   'value': 'ÖNB'}],
 'description': 'Russische Kirche',
 'viewingDirection': 'left-to-right',
 'viewingHint': 'paged',
 'license': 'http://creativecommons.org/publicdomain/mark/1.0/',
 'attribution': [{'@value': 'Austrian National Library', '@language': 'en'},
  {'@value': 'Österreichische Nationalbibliothek', '@language': 'ger'}],
 'logo': 'https://iiif.onb.ac.at/logo/',
 'seeAlso': [{'@id': 'http://data.onb.ac.at/AKON/AK024_176',
   'format': 'text/html'},
  {'@id': 'http://data.onb.ac.at/AKON/AK024_176.rdf',
   'format': 'application/rdf+xml'}],
 'sequences': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
   '@id': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/sequence/normal',
   '@type': 'sc:Sequence',
   'startCanvas': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/canvas/176',
   'canvases': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
     '@id': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/canvas/176',
     '@type': 'sc:Canvas',
     'label': 'Wien, III',
     'height': 1681,
     'width': 1082,
     'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/annotation/176',
       '@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/images/AKON/AK024_176/176/full/full/0/native.jpg',
        '@type': 'dctypes:Image',
        'height': 1681,
        'width': 1082,
        'format': 'image/jpeg',
        'service': {'@context': 'https://iiif.io/api/image/2/context.json',
         '@id': 'https://iiif.onb.ac.at/images/AKON/AK024_176/176',
         'profile': 'https://iiif.io/api/image/2/level2.json'}},
       'on': 'https://iiif.onb.ac.at/presentation/AKON/AK024_176/canvas/176'}]}]}]}

We need to collect all @ids from all resources from all images from all canvases.

That's tedious by hand. We'll use jsonpath-ng:

In [10]:
from jsonpath_ng import jsonpath, parse

image_id_jp = parse('$.sequences[*].canvases[*].images[*].resource.@id')
In [11]:
[match.value for match in image_id_jp.find(r.json())]
Out[11]:
['https://iiif.onb.ac.at/images/AKON/AK024_176/176/full/full/0/native.jpg']

All of this in one function:

In [12]:
image_id_jp = parse('$.sequences[*].canvases[*].images[*].resource.@id')

def image_links_for_manifest_link(manifest_link):
    r = requests.get(manifest_link)
    try:
        json = r.json()
    except:
        # default to empty on exceptions - makes batch processing easier in pandas
        json = {}
    image_links = [match.value for match in image_id_jp.find(json)]
    return image_links

Let's test it:

In [13]:
random_akon_id = meta.sample().iloc[0]['akon_id']
manifest_link = akon_id_to_manifest_link(random_akon_id)
image_links_for_manifest_link(manifest_link)
Out[13]:
['https://iiif.onb.ac.at/images/AKON/AK036_284/284/full/full/0/native.jpg']

Looking good.

Now let's add the image links to the dataframe...

...actually, let's not do that now, because it takes a while (upwards of 10 minutes). Let's cheat instead, skip this step and load the resulting dataframe directly.

In [14]:
# %%time
# meta['image_links'] = meta['manifest_link'].apply(image_links_for_manifest_link)
In [15]:
import json

def load_json(s):
    try:
        return json.loads(s.replace("'", '"'))
    except:
        return []

meta = pd.read_csv('postcards_with_image_links.csv.bz2', compression='bz2', converters={
    'image_links': load_json
})
/home/oida/labs/pydays19/venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3049: DtypeWarning: Columns (14) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [16]:
meta.sample(10)
Out[16]:
Unnamed: 0 Unnamed: 0.1 akon_id id altitude building city color comment mountain other photographer publisher publisher_place region water_body year inventory_number signature revision_date date feature_class feature_code geoname_id latitude longitude name country_id admin_name_1 admin_code_1 geo manifest_link image_links
243 243 243 AK111_476 75139 NaN NaN Rochlitz False v. 1907 Rochlitzer Berg NaN NaN NaN NaN NaN NaN NaN NaN Niederösterreichische Landesbibliothek 1672 2014-09-05 11:30:43.299 vor 1907 T HLL 2846260.0 51.02678 12.77079 Rochlitzer Berg DE NaN NaN 51.02678, 12.77079 https://iiif.onb.ac.at/presentation/AKON/AK111... [https://iiif.onb.ac.at/images/AKON/AK111_476/...
34809 34809 34809 AK073_578 45523 NaN Kgl. Residenz Würzburg False 1909 gel NaN NaN NaN Martin Nürnberg NaN NaN NaN NaN NaN 2014-08-19 14:22:35.340 gelaufen 1909 P PPLA2 2805615.0 49.79391 9.95121 Würzburg DE Bayern 02 49.79391, 9.95121 https://iiif.onb.ac.at/presentation/AKON/AK073... [https://iiif.onb.ac.at/images/AKON/AK073_578/...
18069 18069 18069 AK023_145 13445 NaN NaN Villach True NaN Mittagskogel NaN NaN NaN NaN NaN NaN 1912.0 NaN NaN 2014-08-04 07:59:10.156 1912 P PPLA2 2762372.0 46.61028 13.85583 Villach AT NaN NaN 46.61028, 13.85583 https://iiif.onb.ac.at/presentation/AKON/AK023... [https://iiif.onb.ac.at/images/AKON/AK023_145/...
4554 4554 4554 AK034_086 20003 693.0 Chorherrensift Vorau Vorau False NaN NaN NaN NaN Raza Vorau NaN NaN 1924.0 NaN NaN 2014-09-16 14:48:11.455 1924 S MSTY 2762297.0 47.40000 15.90000 Stift Vorau AT NaN NaN 47.4, 15.9 https://iiif.onb.ac.at/presentation/AKON/AK034... [https://iiif.onb.ac.at/images/AKON/AK034_086/...
20907 20907 20907 AK032_497 19311 NaN Schloss Purgstall NaN False NaN NaN NaN NaN NaN NaN NaN NaN 1918.0 NaN NaN 2014-08-04 07:59:10.257 1918 A ADM3 7873031.0 48.05513 15.13316 Purgstall an der Erlauf AT NaN NaN 48.05513, 15.13316 https://iiif.onb.ac.at/presentation/AKON/AK032... [https://iiif.onb.ac.at/images/AKON/AK032_497/...
5136 5136 5136 AK111_054 74715 NaN NaN Kindberg False 1901 gel NaN NaN NaN NaN NaN NaN NaN NaN NaN Niederösterreichische Landesbibliothek 1664 2014-09-05 10:17:42.132 gelaufen 1901 P PPLA3 2774437.0 47.50000 15.45000 Kindberg AT NaN NaN 47.5, 15.45 https://iiif.onb.ac.at/presentation/AKON/AK111... [https://iiif.onb.ac.at/images/AKON/AK111_054/...
3871 3871 3871 AK125_381 83488 601.0 Hans Hackl's Gasthof zum Jaidhaus Hinterstoder False NaN NaN NaN NaN NaN NaN NaN NaN 1911.0 NaN Nationalbibliothek Karten Abteilung 5862 2014-09-12 16:07:31.780 1911 P PPL 2776235.0 47.69957 14.15468 Hinterstoder AT NaN NaN 47.69957, 14.15468 https://iiif.onb.ac.at/presentation/AKON/AK125... [https://iiif.onb.ac.at/images/AKON/AK125_381/...
1174 1174 1174 AK116_235 77922 NaN Burgruine Gars Gars a. Kamp False 1913 gel NaN NaN NaN Kiennast Gars NaN NaN NaN 79/59 K NaN 2014-09-09 12:22:52.928 gelaufen 1913 P PPLA3 2778845.0 48.58333 15.65000 Gars am Kamp AT NaN NaN 48.58333, 15.65 https://iiif.onb.ac.at/presentation/AKON/AK116... [https://iiif.onb.ac.at/images/AKON/AK116_235/...
1897 1897 1897 AK118_376 65136 NaN NaN NaN False 1925 gel NaN NaN NaN NaN NaN NaN Grundlsee NaN 11/44 Kt. Geogr. Topogr. Bilder-Samml. 1944, 4144 2014-09-10 07:51:30.611 gelaufen 1925 H LK 2777424.0 47.63333 13.86667 Grundlsee AT NaN NaN 47.63333, 13.86667 https://iiif.onb.ac.at/presentation/AKON/AK118... [https://iiif.onb.ac.at/images/AKON/AK118_376/...
33243 33243 33243 AK083_217 52264 NaN NaN Höllenthal False v 1905 NaN NaN NaN Johannes Partenkirchen-Garmisch NaN NaN NaN NaN NaN 2014-08-26 12:14:56.005 vor 1905 T CRQ 2900507.0 47.43333 11.01667 Höllental Kar DE Bayern 02 47.43333, 11.01667 https://iiif.onb.ac.at/presentation/AKON/AK083... [https://iiif.onb.ac.at/images/AKON/AK083_217/...

Split Into Two Sets

We'll split the dataframe into two: One with mountains, one without.

In [17]:
nomountain = meta[ meta['mountain'].isnull() ]
mountain = meta[ ~ meta['mountain'].isnull() ]
In [18]:
len(meta), len(nomountain), len(mountain)
Out[18]:
(34846, 29271, 5575)

Yeah, that adds up.

Download

Ok, so what's left to do?

  • Download all image data into two separate directories for training
  • Resize the images for the CNN used

VGG16 and VGG19 expect 224x224 pixel RGB images.

Luckily, IIIF allows us to request images already resized to our demands. That saves on bandwidth, time and code complexity.

According to the standard we can use the size parameter to resize the image exactly to the dimensions we need.

The links, before and after, would be:

https://iiif.onb.ac.at/images/AKON/AK024_176/176/full/full/0/native.jpg

https://iiif.onb.ac.at/images/AKON/AK024_176/176/full/224,224/0/native.jpg

Let's try it:

In [19]:
r = requests.get('https://iiif.onb.ac.at/images/AKON/AK024_176/176/full/224,224/0/native.jpg')
In [20]:
from IPython.display import display, Image
In [21]:
display(Image(r.content))

That looks about right.

Download to file:

In [22]:
import shutil

def download_to_file(url, filename):
    with requests.get(url, stream=True) as r:
        with open(filename, 'wb') as fh:
            shutil.copyfileobj(r.raw, fh)

def sized_link(iiif_url, size='224,224'):
    frags = iiif_url.split('/')
    frags[-3] = size
    return '/'.join(frags)

Test that:

In [23]:
link = sized_link('https://iiif.onb.ac.at/images/AKON/AK024_176/176/full/full/0/native.jpg')
download_to_file(link, 'testimg.jpg')
In [24]:
with open('testimg.jpg', 'rb') as fh:
    display(Image(fh.read()))

Create directories:

In [25]:
import os

os.mkdir('./images')
os.mkdir('./images/mountain')
os.mkdir('./images/nomountain')

Now let's download!

In [26]:
# For this demonstration we'll just take 10 images each
for idx, row in mountain.sample(10).iterrows():
    akon_id = row['akon_id']
    for n, link in enumerate(row['image_links']):
        small_image_link = sized_link(link)
        file_name = f'./images/mountain/{akon_id}_{n}.jpg'
        download_to_file(small_image_link, file_name)
        print('.', end='')
for idx, row in nomountain.sample(10).iterrows():
    akon_id = row['akon_id']
    for n, link in enumerate(row['image_links']):
        small_image_link = sized_link(link)
        file_name = f'./images/nomountain/{akon_id}_{n}.jpg'
        download_to_file(small_image_link, file_name)
        print('.', end='')    
....................