Skip to content
README.md 1.54 KiB
Newer Older
Run the notebooks in this repo in your browser by clicking the following link: 
smayer's avatar
smayer committed

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Flabs.onb.ac.at%2Fgitlab%2Flabs-team%2Fdigital-methods-of-newspaper-analysis/main)
smayer's avatar
smayer committed

# Digital methods of newspaper analysis
smayer's avatar
smayer committed

This is the public repository for the presentation *ANNO – von Daten zur Forschung. Arbeiten mit dem Zeitschriftenportal der Österreichischen Nationalbibliothek* given by staff of ONB Labs and the ONB Digitization Department at the summer school *Digitale Methoden der Zeitungsanalyse* (see https://www.zb.uzh.ch/en/events/summer-school-digitale-methoden-der-zeitungsanalyse). You will find here the Jupyter notebooks presented, text data as well as sample images.
smayer's avatar
smayer committed

## Installation

Install the required packages with pip into your local Python environment (Python version 3.12) via `pip install -r requirements.txt`. Then start your jupyter server via `jupyter lab`.
smayer's avatar
smayer committed

## Contents of notebooks
smayer's avatar
smayer committed

### [ONB_IIIF_API](ONB_IIIF_API.ipynb)
smayer's avatar
smayer committed

Here we present how to access ONB's newspaper data using Python and an API. The API follows the specification of the IIIF (https://iiif.io) and supplies images, metadata and text annotations.
smayer's avatar
smayer committed

### [OCR_samples](OCR_samples.ipynb)
smayer's avatar
smayer committed

Here we talk about two methods for creating and improving your own OCR starting from images using [Tesseract OCR](https://github.com/tesseract-ocr/tesseract). Firstly, improving the image quality and orientation of the text. Secondly, we use specifically trained language data adatped for German Fraktur script.