Skip to content
Commits on Source (3)
%% Cell type:markdown id: tags:
## Shout it Out: LOUD (by Rob Sanderson, EuropeanaTech 2018)
%% Cell type:markdown id: tags:
* Linked
* Open
* **Usable (for developers, DX)**
* Data
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Five Stars of LOUD
1. right **A**bstraction for der Audience
2. few **B**arriers to enter
3. **C**omprehensible by introspection
4. **D**ocumentation with working example
5. few **E**xceptions, many consistent patterns
%% Cell type:markdown id: tags:
# Five Stars of LOUD
1. **A**bstraction: der Zielgruppe angemessene Abstraktion
2. **B**arriers: niedrige Einstiegshürden
3. **C**omprehensible: unmittelbar verständliche Daten
4. **D**ocumentation: Dokumentation mit funktionierenden Beispielen
5. **E**xceptions: wenig Ausnahmen, möglichst einheitliche Struktur
%% Cell type:markdown id: tags:
# Free Stuff For Devs
**Use Images, Text and Catalogue Data from the Austrian National Library in Jupyter Notebooks**
**LOD-Workshop: Vertiefung - „Unlock the Libraries: Offene Daten von und für Bibliotheken“**
*Georg Petz - Austrian National Library*
%% Cell type:markdown id: tags:
# 1 - Overview
%% Cell type:markdown id: tags:
### What's gonna happen here?
* Part 1: Overview
* What's this all about?
* Who are these people?
* What do I need?
* How do we want to do this?
* Part 2: Metadata and Catalogue
* Part 3: Images and Text
%% Cell type:markdown id: tags:
### What's this all about?
* The Austrian National Library offers data, free to use
* We want to show you roughly what data you can expect
* We want to show you how to work with the data interfaces
%% Cell type:markdown id: tags:
#### Why?
* If you ever need this kind of data
* For fun
%% Cell type:markdown id: tags:
#### Data and Interfaces
* **Metadata**: catalogue data, metadata for historic postcards and historic newspapers, SPARQL
* **Images and text**: Text for historic newspapers, images and text for historic newspapers and historic postcards
%% Cell type:markdown id: tags:
### Who are these people?
%% Cell type:markdown id: tags:
* Presenter
%% Cell type:markdown id: tags:
* Participants
%% Cell type:markdown id: tags:
#### What are you interested in?
%% Cell type:markdown id: tags:
### What do I need?
* The repository at []( in its freshest form
* A working Python3 installation
* A venv with the requirements installed
* A `jupyter notebook` running inside the venv
%% Cell type:markdown id: tags:
### How do we want to do this?
%% Cell type:markdown id: tags:
## Used Libraries
* **requests** - HTTP for humans [](
* **pandas** - spreadsheets for Python on steroids [](
* **jsonpath** - xpath for json [](
* **lxml** - xml parser and xpath (version 1) implementation [](
* **sickle** - OAI-PMH for humans [](
* **pyswagger** - dynamic OpenAPI / Swagger client [](
* **sparqlwrapper** - SPARQL endpoint interface to python [](
%% Cell type:markdown id: tags:
# 3 - Images and Text
%% Cell type:markdown id: tags:
### In this block
* Overview IIIF
* Overview OCR formats
%% Cell type:markdown id: tags:
* Example: Create IIIF collection from SPARQL query result
* Example: Download pre-downsized images for machine learning
* Example: Download OCR text
%% Cell type:markdown id: tags:
## Overview IIIF
%% Cell type:markdown id: tags:
### What is IIIF?
%% Cell type:markdown id: tags:
* International Image Interoperability Framework ([]( - well written, worth a read)
* Standardised method of **describing and delivering images over the web**
* Community that develops APIs and implements them in Software
%% Cell type:markdown id: tags:
<img src="./media/api_puzzle_pieces.png" style="max-height: 500px;" />
*Image courtesy of [](, CC-BY 4.0*
%% Cell type:markdown id: tags:
### Why would I use this?
%% Cell type:markdown id: tags:
#### If you want to display images
* If you want to use one of several nice viewers for images (zoom, rotate, fullscreen ootb)
* If you want to include image data hosted elsewhere
%% Cell type:markdown id: tags:
#### If you want to process images
* If you want structured access to potentially huge sets of images
* If you want included metadata
* If you want to resize images *before* downloading
%% Cell type:markdown id: tags:
### How would I use this?
%% Cell type:markdown id: tags:
* You could access an **image** directly (Image API)
* Parameters can be changed in the URL
* Parameters can be changed in the URL (
* [](
%% Cell type:markdown id: tags:
* You could get a **manifest** JSON (Presentation API)
* Contains images and metadata
* [](
%% Cell type:markdown id: tags:
* You could get a **collection** JSON (Presentation API)
* Contains manifests and possibly other collections
* [](
%% Cell type:markdown id: tags:
### Pics or didn't happen!
%% Cell type:markdown id: tags:
* The ONB Labs viewers use IIIF: [](
%% Cell type:markdown id: tags:
* Awesome IIIF-related resources : [](
%% Cell type:markdown id: tags:
* [](
%% Cell type:markdown id: tags:
* IIIF Presentation API
* returns JSON-LD structured documents that together describe the structure and layout of a digitized object or other collection of images and related content
* [](
* [](
%% Cell type:markdown id: tags:
* IIIF Image API
* requesting and delivering images on the Web
* [](
* Image Request URI Syntax
* {scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}
* [](
* 90 degree rotation: [](
* "Articulus Quartus": [,0,100,33/full/0/native.jpg](,0,100,33/full/0/native.jpg)
%% Cell type:markdown id: tags:
## Overview OCR formats
%% Cell type:markdown id: tags:
* ALTO (Analyzed Layout and Text Object)
* OCR (Optical Character Recognition) data representation format
* XML Schema
* [](
* [](
%% Cell type:markdown id: tags:
* 3 ALTO main elements
* `<Description>`
* metadata and general settings (e.g. measurement units) about the ALTO file
* `<Styles>`
* text and paragraph styles
* `<Layout>`
* content information
* subdivided into `<Page>` elements
%% Cell type:markdown id: tags:
![ALTO page element](./media/alto.png)
%% Cell type:markdown id: tags:
* hOCR
* alternative to ALTO
* based on XHTML
* not used in the ONB Labs
%% Cell type:code id: tags:
``` python
%% Cell type:code id: tags:
``` python
import requests
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON
import json
%% Cell type:markdown id: tags:
Set the SPARQL-Endpoint:
* for ANNO
* for AKON
%% Cell type:code id: tags:
``` python
anno_lod_endpoint = ""
%% Cell type:markdown id: tags:
Methods to query the endpoint and build the dataframe:
%% Cell type:code id: tags:
``` python
def get_sparql_result(service, query):
sparql = SPARQLWrapper(service)
return sparql.query()
def get_sparql_dataframe(service, query):
result = get_sparql_result(service, query)
processed_results = result.convert()
cols = processed_results['head']['vars']
out = []
for row in processed_results['results']['bindings']:
item = []
for c in cols:
item.append(row.get(c, {}).get('value'))
return pd.DataFrame(out, columns=cols)
%% Cell type:markdown id: tags:
Select all newspapers and periodicals with subjectheading Statistik:
%% Cell type:code id: tags:
``` python
query = '''
PREFIX dc: <>
PREFIX edm: <>
PREFIX dcterms: <>
SELECT ?title ?subjectURI ?manifest
WHERE {?subjectURI dc:subject <> .
?subjectURI dc:title ?title .
?subjectURI edm:isShownBy ?firstpage .
?subjectURI edm:rights <> .
?firstpage dcterms:isReferencedBy ?manifest
%% Cell type:markdown id: tags:
Get list of IIIF Manifests URLs:
%% Cell type:code id: tags:
``` python
df = get_sparql_dataframe(anno_lod_endpoint, query)
manifests = list(df['manifest'])
%% Output
%% Cell type:markdown id: tags:
Function to create a SACHA Collection (
%% Cell type:code id: tags:
``` python
def create_collection(description, list_of_manifest_ids_or_ids):
j = {
"description": description,
"elements": list_of_manifest_ids_or_ids
creation_link = ''
result =, json=j)
if result.status_code == 201:
print('SUCCESS: Create collection {}'.format(result.json()['url']))
print('View collection in Mirador:' + result.json()['url'].split('/').pop())
elif result.status_code == 400:
print('ERROR: Request error creating collection')
elif result.status_code == 500:
print('ERROR: Server error creating collection')
print('ERROR: General error creating collection, HTTP status = {}'.format(result.status_code))
%% Cell type:markdown id: tags:
Create the SACHA Collection:
%% Cell type:code id: tags:
``` python
create_collection("newspaper with subject heading Statistik", manifests)
%% Output
SUCCESS: Create collection
View collection in Mirador:
ERROR: Request error creating collection
"status code" : 400,
"message" : "There is no query nor any list of elements in the request."
%% Cell type:code id: tags:
``` python
This diff is collapsed.