Run the notebooks in this repo in your browser by clicking the following link: 

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Flabs.onb.ac.at%2Fgitlab%2Flabs-team%2Fdigital-methods-of-newspaper-analysis/main)

# Digital methods of newspaper analysis

This is the public repository for the presentation *ANNO – von Daten zur Forschung. Arbeiten mit dem Zeitschriftenportal der Österreichischen Nationalbibliothek* given by staff of ONB Labs and the ONB Digitization Department at the summer school *Digitale Methoden der Zeitungsanalyse* (see https://www.zb.uzh.ch/en/events/summer-school-digitale-methoden-der-zeitungsanalyse). You will find here the Jupyter notebooks presented, text data as well as sample images.

## Installation

Install the required packages with pip into your local Python environment (Python version 3.12) via `pip install -r requirements.txt`. Then start your jupyter server via `jupyter lab`.

## Contents of notebooks

### [ONB_IIIF_API](ONB_IIIF_API.ipynb)

Here we present how to access ONB's newspaper data using Python and an API. The API follows the specification of the IIIF (https://iiif.io) and supplies images, metadata and text annotations.

### [OCR_samples](OCR_samples.ipynb)

Here we talk about two methods for creating and improving your own OCR starting from images using [Tesseract OCR](https://github.com/tesseract-ocr/tesseract). Firstly, improving the image quality and orientation of the text. Secondly, we use specifically trained language data adatped for German Fraktur script.