Skip to content
Commits on Source (3)
%% Cell type:markdown id: tags:
## Shout it Out: LOUD (by Rob Sanderson, EuropeanaTech 2018)
%% Cell type:markdown id: tags:
* Linked
* Open
* **Usable (for developers, DX)**
* Data
%% Cell type:markdown id: tags:
* https://youtu.be/r4afi8mGVAY
* https://www.slideshare.net/azaroth42/europeanatech-keynote-shout-it-out-loud
%% Cell type:markdown id: tags:
# Five Stars of LOUD
1. right **A**bstraction for der Audience
2. few **B**arriers to enter
3. **C**omprehensible by introspection
4. **D**ocumentation with working example
5. few **E**xceptions, many consistent patterns
%% Cell type:markdown id: tags:
# Five Stars of LOUD
1. **A**bstraction: der Zielgruppe angemessene Abstraktion
2. **B**arriers: niedrige Einstiegshürden
3. **C**omprehensible: unmittelbar verständliche Daten
4. **D**ocumentation: Dokumentation mit funktionierenden Beispielen
5. **E**xceptions: wenig Ausnahmen, möglichst einheitliche Struktur
%% Cell type:markdown id: tags:
# Free Stuff For Devs
**Use Images, Text and Catalogue Data from the Austrian National Library in Jupyter Notebooks**
**LOD-Workshop: Vertiefung - „Unlock the Libraries: Offene Daten von und für Bibliotheken“**
[https://labs.onb.ac.at](https://labs.onb.ac.at)
https://labs.onb.ac.at/gitlab/labs-team/pydays19/-/tree/UnlocktheLibraries
*Georg Petz - Austrian National Library*
%% Cell type:markdown id: tags:
# 1 - Overview
[https://labs.onb.ac.at](https://labs.onb.ac.at)
https://labs.onb.ac.at/gitlab/labs-team/pydays19/-/tree/UnlocktheLibraries
%% Cell type:markdown id: tags:
### What's gonna happen here?
* Part 1: Overview
* What's this all about?
* Who are these people?
* What do I need?
* How do we want to do this?
* Part 2: Metadata and Catalogue
* Part 3: Images and Text
%% Cell type:markdown id: tags:
### What's this all about?
* The Austrian National Library offers data, free to use
* We want to show you roughly what data you can expect
* We want to show you how to work with the data interfaces
%% Cell type:markdown id: tags:
#### Why?
* If you ever need this kind of data
* For fun
%% Cell type:markdown id: tags:
#### Data and Interfaces
* **Metadata**: catalogue data, metadata for historic postcards and historic newspapers, SPARQL
* **Images and text**: Text for historic newspapers, images and text for historic newspapers and historic postcards
%% Cell type:markdown id: tags:
### Who are these people?
%% Cell type:markdown id: tags:
* Presenter
%% Cell type:markdown id: tags:
* Participants
%% Cell type:markdown id: tags:
#### What are you interested in?
%% Cell type:markdown id: tags:
### What do I need?
* The repository at [https://labs.onb.ac.at/gitlab/labs-team/pydays19/-/tree/UnlocktheLibraries](https://labs.onb.ac.at/gitlab/labs-team/pydays19/-/tree/UnlocktheLibraries) in its freshest form
* A working Python3 installation
* A venv with the requirements installed
* A `jupyter notebook` running inside the venv
%% Cell type:markdown id: tags:
### How do we want to do this?
%% Cell type:markdown id: tags:
## Used Libraries
* **requests** - HTTP for humans [https://2.python-requests.org/en/master/](https://2.python-requests.org/en/master/)
* **pandas** - spreadsheets for Python on steroids [https://pandas.pydata.org/](https://pandas.pydata.org/)
* **jsonpath** - xpath for json [https://github.com/h2non/jsonpath-ng](https://github.com/h2non/jsonpath-ng)
* **lxml** - xml parser and xpath (version 1) implementation [https://lxml.de/](https://lxml.de/)
* **sickle** - OAI-PMH for humans [https://sickle.readthedocs.io/en/latest/](https://sickle.readthedocs.io/en/latest/)
* **pyswagger** - dynamic OpenAPI / Swagger client [https://github.com/pyopenapi/pyswagger](https://github.com/pyopenapi/pyswagger)
* **sparqlwrapper** - SPARQL endpoint interface to python [https://rdflib.github.io/sparqlwrapper/](https://rdflib.github.io/sparqlwrapper/)
......
%% Cell type:markdown id: tags:
# 3 - Images and Text
[https://labs.onb.ac.at/en/tool/sacha/](https://labs.onb.ac.at/en/tool/sacha/)
[https://labs.onb.ac.at/en/dataset/akon/](https://labs.onb.ac.at/en/dataset/akon/)
[https://labs.onb.ac.at/en/dataset/anno/](https://labs.onb.ac.at/en/dataset/anno/)
%% Cell type:markdown id: tags:
### In this block
* Overview IIIF
* Overview OCR formats
%% Cell type:markdown id: tags:
* Example: Create IIIF collection from SPARQL query result
* Example: Download pre-downsized images for machine learning
* Example: Download OCR text
%% Cell type:markdown id: tags:
## Overview IIIF
[http://iiif.io/](http://iiif.io/)
%% Cell type:markdown id: tags:
### What is IIIF?
%% Cell type:markdown id: tags:
* International Image Interoperability Framework ([http://iiif.io/](http://iiif.io/) - well written, worth a read)
* Standardised method of **describing and delivering images over the web**
* Community that develops APIs and implements them in Software
%% Cell type:markdown id: tags:
<img src="./media/api_puzzle_pieces.png" style="max-height: 500px;" />
*Image courtesy of [https://github.com/IIIF/training](https://github.com/IIIF/training), CC-BY 4.0*
%% Cell type:markdown id: tags:
### Why would I use this?
%% Cell type:markdown id: tags:
#### If you want to display images
* If you want to use one of several nice viewers for images (zoom, rotate, fullscreen ootb)
* If you want to include image data hosted elsewhere
%% Cell type:markdown id: tags:
#### If you want to process images
* If you want structured access to potentially huge sets of images
* If you want included metadata
* If you want to resize images *before* downloading
%% Cell type:markdown id: tags:
### How would I use this?
%% Cell type:markdown id: tags:
* You could access an **image** directly (Image API)
* Parameters can be changed in the URL
* Parameters can be changed in the URL (https://app.digitale-sammlungen.de/demo/iiif/image-request-url.html)
* [https://iiif.onb.ac.at/images/AKON/AK035_199/199/full/full/0/native.jpg](https://iiif.onb.ac.at/images/AKON/AK035_199/199/full/full/0/native.jpg)
%% Cell type:markdown id: tags:
* You could get a **manifest** JSON (Presentation API)
* Contains images and metadata
* [https://iiif.onb.ac.at/presentation/AKON/AK035_199/manifest/](https://iiif.onb.ac.at/presentation/AKON/AK035_199/manifest/)
%% Cell type:markdown id: tags:
* You could get a **collection** JSON (Presentation API)
* Contains manifests and possibly other collections
* [https://iiif.onb.ac.at/presentation/collection/pydays19](https://iiif.onb.ac.at/presentation/collection/pydays19)
%% Cell type:markdown id: tags:
### Pics or didn't happen!
%% Cell type:markdown id: tags:
* The ONB Labs viewers use IIIF: [https://labs.onb.ac.at/en/dataset/akon/](https://labs.onb.ac.at/en/dataset/akon/)
%% Cell type:markdown id: tags:
* Awesome IIIF-related resources : [https://github.com/IIIF/awesome-iiif](https://github.com/IIIF/awesome-iiif)
%% Cell type:markdown id: tags:
* [https://showcase.iiif.io/](https://showcase.iiif.io/)
%% Cell type:markdown id: tags:
* IIIF Presentation API
* returns JSON-LD structured documents that together describe the structure and layout of a digitized object or other collection of images and related content
* [https://iiif.io/api/presentation/2.1/](https://iiif.io/api/presentation/2.1/)
* [https://iiif.onb.ac.at/presentation/ABO/+Z196807705/manifest/](https://iiif.onb.ac.at/presentation/ABO/+Z196807705/manifest/)
%% Cell type:markdown id: tags:
* IIIF Image API
* requesting and delivering images on the Web
* [https://iiif.io/api/image/2.1/](https://iiif.io/api/image/2.1/)
* Image Request URI Syntax
* {scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}
* [http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/0/native.jpg](http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/0/native.jpg)
* 90 degree rotation: [http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/90/native.jpg](http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/90/native.jpg)
* "Articulus Quartus": [https://iiif.onb.ac.at/images/ABO/Z196807705/00000009/pct:0,0,100,33/full/0/native.jpg](https://iiif.onb.ac.at/images/ABO/Z196807705/00000009/pct:0,0,100,33/full/0/native.jpg)
%% Cell type:markdown id: tags:
## Overview OCR formats
%% Cell type:markdown id: tags:
* ALTO (Analyzed Layout and Text Object)
* OCR (Optical Character Recognition) data representation format
* XML Schema
* [https://github.com/altoxml](https://github.com/altoxml)
* [https://www.loc.gov/standards/alto/](https://www.loc.gov/standards/alto/)
%% Cell type:markdown id: tags:
* 3 ALTO main elements
* `<Description>`
* metadata and general settings (e.g. measurement units) about the ALTO file
* `<Styles>`
* text and paragraph styles
* `<Layout>`
* content information
* subdivided into `<Page>` elements
%% Cell type:markdown id: tags:
![ALTO page element](./media/alto.png)
%% Cell type:markdown id: tags:
* hOCR
* alternative to ALTO
* based on XHTML
* not used in the ONB Labs
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:code id: tags:
``` python
import requests
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON
import json
```
%% Cell type:markdown id: tags:
Set the SPARQL-Endpoint:
* https://lod.onb.ac.at/sparql/anno for ANNO
* https://lod.onb.ac.at/sparql/akon for AKON
%% Cell type:code id: tags:
``` python
anno_lod_endpoint = "https://lod.onb.ac.at/sparql/anno"
```
%% Cell type:markdown id: tags:
Methods to query the endpoint and build the dataframe:
%% Cell type:code id: tags:
``` python
def get_sparql_result(service, query):
sparql = SPARQLWrapper(service)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
return sparql.query()
def get_sparql_dataframe(service, query):
result = get_sparql_result(service, query)
processed_results = result.convert()
cols = processed_results['head']['vars']
out = []
for row in processed_results['results']['bindings']:
item = []
for c in cols:
item.append(row.get(c, {}).get('value'))
out.append(item)
return pd.DataFrame(out, columns=cols)
```
%% Cell type:markdown id: tags:
Select all newspapers and periodicals with subjectheading Statistik:
%% Cell type:code id: tags:
``` python
query = '''
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX edm: <http://www.europeana.eu/schemas/edm/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?title ?subjectURI ?manifest
WHERE {?subjectURI dc:subject <http://d-nb.info/gnd/4056995-0> .
?subjectURI dc:title ?title .
?subjectURI edm:isShownBy ?firstpage .
?subjectURI edm:rights <http://creativecommons.org/publicdomain/mark/1.0/> .
?firstpage dcterms:isReferencedBy ?manifest
}'''
```
%% Cell type:markdown id: tags:
Get list of IIIF Manifests URLs:
%% Cell type:code id: tags:
``` python
df = get_sparql_dataframe(anno_lod_endpoint, query)
manifests = list(df['manifest'])
manifests
```
%% Output
['http://iiif.onb.ac.at/presentation/ANNO/stm1875ag0001/manifest',
'http://iiif.onb.ac.at/presentation/ANNO/stm1876ag0001/manifest',
'http://iiif.onb.ac.at/presentation/ANNO/stm1877ag0001/manifest',
'http://iiif.onb.ac.at/presentation/ANNO/stm1878ag0001/manifest']
[]
%% Cell type:markdown id: tags:
Function to create a SACHA Collection (https://iiif.onb.ac.at/api#_collectionspostjsonprocessor):
%% Cell type:code id: tags:
``` python
def create_collection(description, list_of_manifest_ids_or_ids):
j = {
"description": description,
"elements": list_of_manifest_ids_or_ids
}
creation_link = 'https://iiif.onb.ac.at/presentation/collection'
result = requests.post(creation_link, json=j)
if result.status_code == 201:
print('SUCCESS: Create collection {}'.format(result.json()['url']))
print('View collection in Mirador: https://iiif.onb.ac.at/view/collection/mirador/' + result.json()['url'].split('/').pop())
elif result.status_code == 400:
print('ERROR: Request error creating collection')
print(result.text)
elif result.status_code == 500:
print('ERROR: Server error creating collection')
print(result.text)
else:
print('ERROR: General error creating collection, HTTP status = {}'.format(result.status_code))
```
%% Cell type:markdown id: tags:
Create the SACHA Collection:
%% Cell type:code id: tags:
``` python
create_collection("newspaper with subject heading Statistik", manifests)
```
%% Output
SUCCESS: Create collection https://iiif.onb.ac.at/presentation/collection/R9kE0IcrIE
View collection in Mirador: https://iiif.onb.ac.at/view/collection/mirador/R9kE0IcrIE
ERROR: Request error creating collection
{
"status code" : 400,
"message" : "There is no query nor any list of elements in the request."
}
%% Cell type:code id: tags:
``` python
```
......
This diff is collapsed.