Annolyzer released

We are very happy to announce the successful integration of the NewsEye demonstrator platform into ONB Labs! It is a tool that “provid[es] enhanced access to historical newspapers for a wide range of users” and was developed during the NewsEye project. Since it will only operate on the part of the NewsEye corpus which is included in ANNO, we have dubbed it Annolyzer. Click here to access the Annolyzer directly but note that prior login on our Labs page is required.

In this topic we want to give an overview of (some of) the results from the NewsEye project and our endeavors to include them sustainably into the ONB Labs infrastructure, as well as present a plan for future developments that can benefit both projects simultaneously.

1. What is the NewsEye demonstrator platform?

Historical newspapers in the NewsEye project

The NewsEye project (full title: “NewsEye: A Digital Investigator for Historical Newspapers”) was a cooperation between multiple European universities (University of La Rochelle, University of Helsinki, University of Innsbruck, University of Rostock, University Paul-Valéry Montpellier and University of Vienna) and national libraries (Austrian National Library, National Library of Finland and National Library of France), funded by the European Union's Horizon 2020 research and innovation programme. During the project's runtime (from May 2018 to January 2022) the goal was to develop tools and methods that will “change the way European digital heritage data is (re)searched, accessed, used and analysed”.

The project focused on selected issues from historical newspapers in four languages (Finnish, French, German and Swedish) from the late 19^th to the mid 20^th century. The main objective was to develop a set of tools and methods for effective exploration and exploitation of the rich resource of newspapers by means of new technologies and big data approaches. The project aimed at improving the users' capability to access, analyse and use the content contained in the vast corpus of digitized historical newspapers.

The researchers' problem

The teams of researchers in the project started out with scanned images and by means of optical character recognition (OCR) produced text files of the newspaper issues in the corpus. Performing (for example) the tasks of article segmentation and named entity recognition they generated large amounts of data, which they then needed to access, visualize and publish for further research. Due to the heterogeneity and size of the produced data set, this posed a formidable problem.

The solution: the NewsEye demonstrator platform

The NewsEye demonstrator platform is a web-based application which was developed in order to solve the problem above. It links the data from the project with the original images and allows to view and browse them in an interactive way. Furthermore, the content is searchable and becomes highly structured via the possibility to create and enrich your own collections.

2. Why did we integrate it into ONB Labs?

Conclusion of NewsEye project

As the NewsEye project has officially ended in the beginning of 2022, we discussed together with our colleagues from the project in which way it could be possible that the results (including research data as well as software created during the project) are sustainably transferred to the separate project partners.

It became clear during the discussions that the data extracted from Austrian newspapers (article segmentation, contained text, recognized named entities) and the demonstrator platform itself would be good candidates for a transfer as the ONB can't publicly release the images of the four Austrian newspapers (Neue Freie Presse, Illustrierte Kronen Zeitung, Innsbrucker Nachrichten and Arbeiter Zeitung) in the project due to copyright reasons.

Thematical fit with our interests at ONB Labs

Furthermore, working on topics related to layout analysis, OCR and Natural Language Processing (NLP) techniques at ONB Labs fits well with our general strategy. As we are always looking for ways to expand our knowledge and skills in this direction, integrating part of the project was a welcome addition to our tool arsenal.

3. Core features of the platform

Searching the (digitized) text of all issues contained

The central and most prominent feature is the ability to perform search queries on the complete text corpus. The Solr index in the backend contains text for each segmented article as well as recognized named entities, which are further linked to a wikidata entry where possible. This means that the user can perform complex queries via Solr's query parser, see the Solr Query Documentation for more information.

An example Solr query that searches the platform for the word Donauschifffahrt (shipping on the Danube) but excludes the term Wien and limits the results to the period from 1900 to 1939 would be:

/search?f[date_created_dtsi][from]=1900-01-30&f[date_created_dtsi][to]=1939-06-16&q=Donauschifffahrt+NOT+Wien

The results can be ranked by relevance (default) or by publication date (ascending or descending), and a preview image of the article containing the found term is displayed next to the title of the article. The user can then browse the newspaper in which the article is located.

Article separation and manual alignment of compound articles

In the project, the scanned newspaper images were segmented into separate articles with the help of Machine Learning pipelines. During layout analysis, the columns were split up into smaller units and a bounding box around each segmented article, together with the contained text, was stored. Often, this meant a segmentation of a column on paragraph level which turned out to be over-segmented, as one article can consist of multiple paragraphs. Therefore, the compound article feature was implemented in order to allow the user to manually combine and group smaller segments into a bigger segment containing the whole article.

Collection building of issues, articles and compound articles

For storing and easier access to search results, the user is able to add single articles, compound articles as well as whole issues to their collections. These user collections are initially private and can then be published by the user for all other users of the platform. These collections, or datasets as they are called in the platform, can be assembled manually or via search query.

Semantic connection of recognized named entities in the text

Each searchable document in the Solr index is enriched with recognized named entities. These will be displayed next to the search results, when browsing a dataset or when the user views an article, compound article or issue. In a post processing step, the named entities were linked to a wikidata database entry, whenever uniquely possible.

4. Future direction and development

Prototype stage of the Annolyzer

Note that the current release of the Annolyzer is in a prototype stage and the current feature set might change in the future. We will continue to develop and expand further features of the platform in a collaboration with our project partners at the University of La Rochelle. This way we will benefit from each other's future developments.

Additional features and improvements

Our current plan is to include more historical newspaper issues from the ANNO corpus and perform article segmentation as well as named entity recognition on them. We want to improve the feature set for collections in order for the users to be able to search in them and do further data analysis on them.

The source code for the application is publicly available on our GitLab repository and we welcome contributions from the Labs Community. Drop us an email if you would like to get involved or want to provide some feedback.