Loading 3 - Images and Text.ipynb +36 −7 Original line number Diff line number Diff line %% Cell type:markdown id: tags: # 3 - Images and Text [https://labs.onb.ac.at/en/tool/sacha/](https://labs.onb.ac.at/en/tool/sacha/) [https://labs.onb.ac.at/en/dataset/akon/](https://labs.onb.ac.at/en/dataset/akon/) [https://labs.onb.ac.at/en/dataset/anno/](https://labs.onb.ac.at/en/dataset/anno/) %% Cell type:markdown id: tags: ### In this block * Overview IIIF * Overview OCR formats %% Cell type:markdown id: tags: * Example: Create IIIF collection from SPARQL query result * Example: Download pre-downsized images for machine learning * Example: Download OCR text %% Cell type:markdown id: tags: ## Overview IIIF [http://iiif.io/](http://iiif.io/) %% Cell type:markdown id: tags: ### What is IIIF? %% Cell type:markdown id: tags: * International Image Interoperability Framework ([http://iiif.io/](http://iiif.io/) - well written, worth a read) * Standardised method of **describing and delivering images over the web** * Community that develops APIs and implements them in Software %% Cell type:markdown id: tags: <img src="./media/api_puzzle_pieces.png" style="max-height: 500px;" /> *Image courtesy of [https://github.com/IIIF/training](https://github.com/IIIF/training), CC-BY 4.0* %% Cell type:markdown id: tags: ### Why would I use this? %% Cell type:markdown id: tags: #### If you want to display images * If you want to use one of several nice viewers for images (zoom, rotate, fullscreen ootb) * If you want to include image data hosted elsewhere %% Cell type:markdown id: tags: #### If you want to process images * If you want structured access to potentially huge sets of images * If you want included metadata * If you want to resize images *before* downloading %% Cell type:markdown id: tags: ### How would I use this? %% Cell type:markdown id: tags: * You could access an **image** directly (Image API) * Parameters can be changed in the URL * [https://iiif.onb.ac.at/images/AKON/AK035_199/199/full/full/0/native.jpg](https://iiif.onb.ac.at/images/AKON/AK035_199/199/full/full/0/native.jpg) %% Cell type:markdown id: tags: * You could get a **manifest** JSON (Presentation API) * Contains images and metadata * [https://iiif.onb.ac.at/presentation/AKON/AK035_199/manifest/](https://iiif.onb.ac.at/presentation/AKON/AK035_199/manifest/) %% Cell type:markdown id: tags: * You could get a **collection** JSON (Presentation API) * Contains manifests and possibly other collections * [https://iiif.onb.ac.at/presentation/collection/pydays19](https://iiif.onb.ac.at/presentation/collection/pydays19) %% Cell type:markdown id: tags: ### Pics or didn't happen! %% Cell type:markdown id: tags: * The ONB Labs viewers use IIIF: [https://labs.onb.ac.at/en/dataset/akon/](https://labs.onb.ac.at/en/dataset/akon/) **TODO**: Available viewers, available data sources (europeana, ?), applications %% Cell type:code id: tags: ``` python ``` %% Cell type:markdown id: tags: * Awesome IIIF-related resources : [https://github.com/IIIF/awesome-iiif](https://github.com/IIIF/awesome-iiif) %% Cell type:markdown id: tags: * [https://showcase.iiif.io/](https://showcase.iiif.io/) %% Cell type:markdown id: tags: * IIIF Presentation API * returns JSON-LD structured documents that together describe the structure and layout of a digitized object or other collection of images and related content * [https://iiif.io/api/presentation/2.1/](https://iiif.io/api/presentation/2.1/) * [https://iiif.onb.ac.at/presentation/ABO/+Z196807705/manifest/](https://iiif.onb.ac.at/presentation/ABO/+Z196807705/manifest/) %% Cell type:markdown id: tags: * IIIF Image API * requesting and delivering images on the Web * [https://iiif.io/api/image/2.1/](https://iiif.io/api/image/2.1/) * Image Request URI Syntax * {scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format} * [http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/0/native.jpg](http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/0/native.jpg) * 90 degree rotation: [http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/90/native.jpg](http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/90/native.jpg) * "Articulus Quartus": [https://iiif.onb.ac.at/images/ABO/Z196807705/00000009/pct:0,0,100,33/full/0/native.jpg](https://iiif.onb.ac.at/images/ABO/Z196807705/00000009/pct:0,0,100,33/full/0/native.jpg) %% Cell type:markdown id: tags: ## Overview OCR formats %% Cell type:markdown id: tags: * ALTO (Analyzed Layout and Text Object) * OCR (Optical Character Recognition) data representation format * XML Schema * [https://github.com/altoxml](https://github.com/altoxml) * [https://www.loc.gov/standards/alto/](https://www.loc.gov/standards/alto/) %% Cell type:markdown id: tags: * 3 ALTO main elements * `<Description>` * metadata and general settings (e.g. measurement units) about the ALTO file * `<Styles>` * text and paragraph styles * `<Layout>` * content information * subdivided into `<Page>` elements %% Cell type:markdown id: tags:  %% Cell type:markdown id: tags: * hOCR * alternative to ALTO * based on XHTML * not used in the ONB Labs %% Cell type:code id: tags: ``` python ``` media/alto.png 0 → 100644 +43.3 KiB Loading image diff... Loading
3 - Images and Text.ipynb +36 −7 Original line number Diff line number Diff line %% Cell type:markdown id: tags: # 3 - Images and Text [https://labs.onb.ac.at/en/tool/sacha/](https://labs.onb.ac.at/en/tool/sacha/) [https://labs.onb.ac.at/en/dataset/akon/](https://labs.onb.ac.at/en/dataset/akon/) [https://labs.onb.ac.at/en/dataset/anno/](https://labs.onb.ac.at/en/dataset/anno/) %% Cell type:markdown id: tags: ### In this block * Overview IIIF * Overview OCR formats %% Cell type:markdown id: tags: * Example: Create IIIF collection from SPARQL query result * Example: Download pre-downsized images for machine learning * Example: Download OCR text %% Cell type:markdown id: tags: ## Overview IIIF [http://iiif.io/](http://iiif.io/) %% Cell type:markdown id: tags: ### What is IIIF? %% Cell type:markdown id: tags: * International Image Interoperability Framework ([http://iiif.io/](http://iiif.io/) - well written, worth a read) * Standardised method of **describing and delivering images over the web** * Community that develops APIs and implements them in Software %% Cell type:markdown id: tags: <img src="./media/api_puzzle_pieces.png" style="max-height: 500px;" /> *Image courtesy of [https://github.com/IIIF/training](https://github.com/IIIF/training), CC-BY 4.0* %% Cell type:markdown id: tags: ### Why would I use this? %% Cell type:markdown id: tags: #### If you want to display images * If you want to use one of several nice viewers for images (zoom, rotate, fullscreen ootb) * If you want to include image data hosted elsewhere %% Cell type:markdown id: tags: #### If you want to process images * If you want structured access to potentially huge sets of images * If you want included metadata * If you want to resize images *before* downloading %% Cell type:markdown id: tags: ### How would I use this? %% Cell type:markdown id: tags: * You could access an **image** directly (Image API) * Parameters can be changed in the URL * [https://iiif.onb.ac.at/images/AKON/AK035_199/199/full/full/0/native.jpg](https://iiif.onb.ac.at/images/AKON/AK035_199/199/full/full/0/native.jpg) %% Cell type:markdown id: tags: * You could get a **manifest** JSON (Presentation API) * Contains images and metadata * [https://iiif.onb.ac.at/presentation/AKON/AK035_199/manifest/](https://iiif.onb.ac.at/presentation/AKON/AK035_199/manifest/) %% Cell type:markdown id: tags: * You could get a **collection** JSON (Presentation API) * Contains manifests and possibly other collections * [https://iiif.onb.ac.at/presentation/collection/pydays19](https://iiif.onb.ac.at/presentation/collection/pydays19) %% Cell type:markdown id: tags: ### Pics or didn't happen! %% Cell type:markdown id: tags: * The ONB Labs viewers use IIIF: [https://labs.onb.ac.at/en/dataset/akon/](https://labs.onb.ac.at/en/dataset/akon/) **TODO**: Available viewers, available data sources (europeana, ?), applications %% Cell type:code id: tags: ``` python ``` %% Cell type:markdown id: tags: * Awesome IIIF-related resources : [https://github.com/IIIF/awesome-iiif](https://github.com/IIIF/awesome-iiif) %% Cell type:markdown id: tags: * [https://showcase.iiif.io/](https://showcase.iiif.io/) %% Cell type:markdown id: tags: * IIIF Presentation API * returns JSON-LD structured documents that together describe the structure and layout of a digitized object or other collection of images and related content * [https://iiif.io/api/presentation/2.1/](https://iiif.io/api/presentation/2.1/) * [https://iiif.onb.ac.at/presentation/ABO/+Z196807705/manifest/](https://iiif.onb.ac.at/presentation/ABO/+Z196807705/manifest/) %% Cell type:markdown id: tags: * IIIF Image API * requesting and delivering images on the Web * [https://iiif.io/api/image/2.1/](https://iiif.io/api/image/2.1/) * Image Request URI Syntax * {scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format} * [http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/0/native.jpg](http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/0/native.jpg) * 90 degree rotation: [http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/90/native.jpg](http://iiif.onb.ac.at/images/ABO/Z196807705/00000009/full/full/90/native.jpg) * "Articulus Quartus": [https://iiif.onb.ac.at/images/ABO/Z196807705/00000009/pct:0,0,100,33/full/0/native.jpg](https://iiif.onb.ac.at/images/ABO/Z196807705/00000009/pct:0,0,100,33/full/0/native.jpg) %% Cell type:markdown id: tags: ## Overview OCR formats %% Cell type:markdown id: tags: * ALTO (Analyzed Layout and Text Object) * OCR (Optical Character Recognition) data representation format * XML Schema * [https://github.com/altoxml](https://github.com/altoxml) * [https://www.loc.gov/standards/alto/](https://www.loc.gov/standards/alto/) %% Cell type:markdown id: tags: * 3 ALTO main elements * `<Description>` * metadata and general settings (e.g. measurement units) about the ALTO file * `<Styles>` * text and paragraph styles * `<Layout>` * content information * subdivided into `<Page>` elements %% Cell type:markdown id: tags:  %% Cell type:markdown id: tags: * hOCR * alternative to ALTO * based on XHTML * not used in the ONB Labs %% Cell type:code id: tags: ``` python ```