3.3 - Text - Download OCR Text

I need loads of text from old newspapers, preferably with loads of errors due to bad OCR.

https://labs.onb.ac.at/en/dataset/anno/

https://github.com/cneud/alto-tools

In order to get to this text, we have to

  • Find a newspaper issue we'd like to harvest
  • Download the IIIF manifest for this newspaper issue
  • Download the ALTO-XML files for this newspaper issue
  • Convert the ALTO-XML to TXT

Find a Newspaper Issue

Let's take a look at the ONB Labs' historic newspapers

In [1]:
import pandas as pd

meta = pd.read_csv('https://labs.onb.ac.at/gitlab/labs-team/raw-metadata/raw/master/anno_labs_issues.csv.bz2', compression='bz2')
In [2]:
meta.sample(10)
Out[2]:
manifest_id aid year day dc_title dc_title_additional subjects place_of_publications languages dc_type ... meta_type ini_type modification_datetime longer_page_id dc_date link_pdf link_old has_ocr meta_id page_count
47898 bor18220712 bor 1822 18220712 Amtliches Cursblatt der Wiener Börse NaN Wirtschaft Wien de newspaper ... zeitungen anno 2013-04-22 14:12:28 0 1822-07-12 http://anno.onb.ac.at/cgi-content/anno_pdf.pl?... http://anno.onb.ac.at/cgi-content/anno?aid=bor... 0 81750 2
149247 lvb18740301 lvb 1874 18740301 Linzer Volksblatt NaN Tageszeitung Linz de newspaper ... zeitungen anno 2010-11-29 09:39:04 0 1874-03-01 http://anno.onb.ac.at/cgi-content/anno_pdf.pl?... http://anno.onb.ac.at/cgi-content/anno?aid=lvb... 1 776681 6
48126 bor18230417 bor 1823 18230417 Amtliches Cursblatt der Wiener Börse NaN Wirtschaft Wien de newspaper ... zeitungen anno 2013-04-22 14:13:30 0 1823-04-17 http://anno.onb.ac.at/cgi-content/anno_pdf.pl?... http://anno.onb.ac.at/cgi-content/anno?aid=bor... 0 81978 2
80491 neu18670531 neu 1867 18670531 Die Neuzeit NaN Tageszeitung Wien de newspaper ... zeitungen anno 2012-11-19 11:38:11 0 1867-05-31 http://anno.onb.ac.at/cgi-content/anno_pdf.pl?... http://anno.onb.ac.at/cgi-content/anno?aid=neu... 1 309977 12
138634 joe18710415 joe 1871 18710415 Jörgel Briefe NaN Wochenzeitung Wien de newspaper ... zeitungen anno 2009-04-02 10:54:27 0 1871-04-15 http://anno.onb.ac.at/cgi-content/anno_pdf.pl?... http://anno.onb.ac.at/cgi-content/anno?aid=joe... 1 762724 16
75189 iwe18730304 iwe 1873 18730304 Illustrirtes Wiener Extrablatt NaN Tageszeitung Wien de newspaper ... zeitungen anno 2014-07-25 11:26:13 0 1873-03-04 http://anno.onb.ac.at/cgi-content/anno_pdf.pl?... http://anno.onb.ac.at/cgi-content/anno?aid=iwe... 1 220835 8
193395 wtz18680917 wtz 1868 18680917 Theaterzettel (Oper und Burgtheater in Wien) NaN Kultur, Kunst, Theater, Musik Wien de newspaper ... zeitungen anno 2014-03-21 10:28:59 0 1868-09-17 http://anno.onb.ac.at/cgi-content/anno_pdf.pl?... http://anno.onb.ac.at/cgi-content/anno?aid=wtz... 0 973845 1
69095 ode18630730 ode 1863 18630730 Ost-Deutsche Post NaN Tageszeitung Wien de newspaper ... zeitungen anno 2018-08-29 09:15:11 0 1863-07-30 http://anno.onb.ac.at/cgi-content/anno_pdf.pl?... http://anno.onb.ac.at/cgi-content/anno?aid=ode... 1 183458 4
125850 hum18450125 hum 1845 18450125 Der Humorist NaN Humor, Satire, Geschichte Wien de newspaper ... zeitungen anno 2003-11-20 12:06:01 0 1845-01-25 http://anno.onb.ac.at/cgi-content/anno_pdf.pl?... http://anno.onb.ac.at/cgi-content/anno?aid=hum... 1 729148 12
152963 mop18540720 mop 1854 18540720 Morgen-Post NaN Tageszeitung Wien de newspaper ... zeitungen anno 2012-12-11 13:45:49 0 1854-07-20 http://anno.onb.ac.at/cgi-content/anno_pdf.pl?... http://anno.onb.ac.at/cgi-content/anno?aid=mop... 1 801195 4

10 rows × 21 columns

Let's go with the Ost-Deutsche Post issue from the 30th of July 1863

In [3]:
manifest_id = 'ode18630730'

Download the IIIF Manifest

If we look at the SACHA API description, we see that the link for the IIIF manifest has to look like this:

http://iiif.onb.ac.at/presentation/ANNO/ode18630730/manifest

In [4]:
import requests
In [5]:
r = requests.get(f'http://iiif.onb.ac.at/presentation/ANNO/{manifest_id}/manifest')
In [6]:
r.json()
Out[6]:
{'@context': 'https://iiif.io/api/presentation/2/context.json',
 '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/manifest',
 '@type': 'sc:Manifest',
 'label': 'Ost-Deutsche Post 1863-07-30',
 'metadata': [{'label': [{'@value': 'Id', '@language': 'en'},
    {'@value': 'Id', '@language': 'ger'}],
   'value': 'ode18630730'},
  {'label': [{'@value': 'Title', '@language': 'en'},
    {'@value': 'Titel', '@language': 'ger'}],
   'value': 'Ost-Deutsche Post'},
  {'label': [{'@value': 'Type', '@language': 'en'},
    {'@value': 'Typ', '@language': 'ger'}],
   'value': 'newspaper'},
  {'label': [{'@value': 'Place of Publications', '@language': 'en'},
    {'@value': 'Erscheinungsort', '@language': 'ger'}],
   'value': "<a href='http://d-nb.info/gnd/4066009-6'>Wien</a>"},
  {'label': [{'@value': 'Date Issued', '@language': 'en'},
    {'@value': 'Erscheinungsdatum', '@language': 'ger'}],
   'value': '1863-07-30'},
  {'label': [{'@value': 'Subject Heading', '@language': 'en'},
    {'@value': 'Schlagworte', '@language': 'ger'}],
   'value': "<a href='http://d-nb.info/gnd/4067510-5'>Tageszeitung</a>"},
  {'label': [{'@value': 'Disseminator', '@language': 'en'},
    {'@value': 'Anbieter', '@language': 'ger'}],
   'value': "<a href='http://anno.onb.ac.at/'>Austrian Newspapers Online</a>"},
  {'label': [{'@value': 'Languages', '@language': 'en'},
    {'@value': 'Sprachen', '@language': 'ger'}],
   'value': 'ger'}],
 'description': 'Ost-Deutsche Post 1863-07-30',
 'viewingDirection': 'left-to-right',
 'viewingHint': 'paged',
 'license': 'http://creativecommons.org/publicdomain/mark/1.0/',
 'attribution': [{'@value': 'Austrian National Library', '@language': 'en'},
  {'@value': 'Österreichische Nationalbibliothek', '@language': 'ger'}],
 'logo': 'https://iiif.onb.ac.at/logo/',
 'seeAlso': [{'@id': 'http://anno.onb.ac.at/cgi-content/anno_pdf.pl?aid=ode&datum=18630730',
   'format': 'application/pdf'},
  {'@id': 'http://anno.onb.ac.at/cgi-content/anno?aid=ode&datum=18630730',
   'format': 'text/html'},
  {'@id': 'http://data.onb.ac.at/ANNO/ode18630730.rdf',
   'format': 'application/rdf+xml'}],
 'sequences': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
   '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/sequence/normal',
   '@type': 'sc:Sequence',
   'startCanvas': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000001',
   'canvases': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
     '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000001',
     '@type': 'sc:Canvas',
     'label': '00000001',
     'height': 6148,
     'width': 4456,
     'metadata': [{'label': 'Resolution', 'value': '300dpi'},
      {'label': 'Color Depth', 'value': '8bpp'}],
     'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/annotation/00000001',
       '@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/ode18630730/00000001/full/full/0/default.jpg',
        '@type': 'dctypes:Image',
        'height': 6148,
        'width': 4456,
        'format': 'image/jpeg',
        'service': {'@context': 'https://iiif.io/api/image/2/context.json',
         '@id': 'https://iiif.onb.ac.at/images/ANNO/ode18630730/00000001',
         'profile': 'https://iiif.io/api/image/2/level2.json'}},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000001'}],
     'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000001.json',
       '@type': 'sc:AnnotationList',
       'resources': [{'@type': 'oa:Annotation',
         'motivation': 'sc:painting',
         'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000001.xml',
          '@type': 'dctypes:Text',
          'format': 'application/xml+alto'},
         'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000001'}]}]},
    {'@context': 'https://iiif.io/api/presentation/2/context.json',
     '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000002',
     '@type': 'sc:Canvas',
     'label': '00000002',
     'height': 6176,
     'width': 4444,
     'metadata': [{'label': 'Resolution', 'value': '300dpi'},
      {'label': 'Color Depth', 'value': '8bpp'}],
     'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/annotation/00000002',
       '@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/ode18630730/00000002/full/full/0/default.jpg',
        '@type': 'dctypes:Image',
        'height': 6176,
        'width': 4444,
        'format': 'image/jpeg',
        'service': {'@context': 'https://iiif.io/api/image/2/context.json',
         '@id': 'https://iiif.onb.ac.at/images/ANNO/ode18630730/00000002',
         'profile': 'https://iiif.io/api/image/2/level2.json'}},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000002'}],
     'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000002.json',
       '@type': 'sc:AnnotationList',
       'resources': [{'@type': 'oa:Annotation',
         'motivation': 'sc:painting',
         'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000002.xml',
          '@type': 'dctypes:Text',
          'format': 'application/xml+alto'},
         'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000002'}]}]},
    {'@context': 'https://iiif.io/api/presentation/2/context.json',
     '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000003',
     '@type': 'sc:Canvas',
     'label': '00000003',
     'height': 6148,
     'width': 4456,
     'metadata': [{'label': 'Resolution', 'value': '300dpi'},
      {'label': 'Color Depth', 'value': '8bpp'}],
     'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/annotation/00000003',
       '@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/ode18630730/00000003/full/full/0/default.jpg',
        '@type': 'dctypes:Image',
        'height': 6148,
        'width': 4456,
        'format': 'image/jpeg',
        'service': {'@context': 'https://iiif.io/api/image/2/context.json',
         '@id': 'https://iiif.onb.ac.at/images/ANNO/ode18630730/00000003',
         'profile': 'https://iiif.io/api/image/2/level2.json'}},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000003'}],
     'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000003.json',
       '@type': 'sc:AnnotationList',
       'resources': [{'@type': 'oa:Annotation',
         'motivation': 'sc:painting',
         'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000003.xml',
          '@type': 'dctypes:Text',
          'format': 'application/xml+alto'},
         'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000003'}]}]},
    {'@context': 'https://iiif.io/api/presentation/2/context.json',
     '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000004',
     '@type': 'sc:Canvas',
     'label': '00000004',
     'height': 6176,
     'width': 4416,
     'metadata': [{'label': 'Resolution', 'value': '300dpi'},
      {'label': 'Color Depth', 'value': '8bpp'}],
     'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/annotation/00000004',
       '@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/ode18630730/00000004/full/full/0/default.jpg',
        '@type': 'dctypes:Image',
        'height': 6176,
        'width': 4416,
        'format': 'image/jpeg',
        'service': {'@context': 'https://iiif.io/api/image/2/context.json',
         '@id': 'https://iiif.onb.ac.at/images/ANNO/ode18630730/00000004',
         'profile': 'https://iiif.io/api/image/2/level2.json'}},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000004'}],
     'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000004.json',
       '@type': 'sc:AnnotationList',
       'resources': [{'@type': 'oa:Annotation',
         'motivation': 'sc:painting',
         'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000004.xml',
          '@type': 'dctypes:Text',
          'format': 'application/xml+alto'},
         'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000004'}]}]}]}]}

There's a lot of information in there. We need the info blocks with links to ALTO-XML resources.

Let's use jsonpath-ng for that.

In [7]:
from jsonpath_ng import parse
In [8]:
def jp(http_response, parser):
    return [match.value for match in parser.find(http_response.json())]
In [9]:
resource_parser = parse('$.sequences[*].canvases[*].otherContent[*].resources')
In [10]:
jp(r, resource_parser)
Out[10]:
[[{'@type': 'oa:Annotation',
   'motivation': 'sc:painting',
   'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000001.xml',
    '@type': 'dctypes:Text',
    'format': 'application/xml+alto'},
   'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000001'}],
 [{'@type': 'oa:Annotation',
   'motivation': 'sc:painting',
   'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000002.xml',
    '@type': 'dctypes:Text',
    'format': 'application/xml+alto'},
   'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000002'}],
 [{'@type': 'oa:Annotation',
   'motivation': 'sc:painting',
   'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000003.xml',
    '@type': 'dctypes:Text',
    'format': 'application/xml+alto'},
   'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000003'}],
 [{'@type': 'oa:Annotation',
   'motivation': 'sc:painting',
   'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000004.xml',
    '@type': 'dctypes:Text',
    'format': 'application/xml+alto'},
   'on': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/canvas/00000004'}]]

Not quite there yet.

In [11]:
all_resources = parse('$.sequences[*].canvases[*].otherContent[*].resources[*].resource')
In [12]:
jp(r, all_resources)
Out[12]:
[{'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000001.xml',
  '@type': 'dctypes:Text',
  'format': 'application/xml+alto'},
 {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000002.xml',
  '@type': 'dctypes:Text',
  'format': 'application/xml+alto'},
 {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000003.xml',
  '@type': 'dctypes:Text',
  'format': 'application/xml+alto'},
 {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000004.xml',
  '@type': 'dctypes:Text',
  'format': 'application/xml+alto'}]

Filter just the ones with format application/xml+alto, and there only the @id:

In [13]:
ids = [d['@id'] for d in jp(r, all_resources) if d['format'] == 'application/xml+alto']
In [14]:
ids
Out[14]:
['https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000001.xml',
 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000002.xml',
 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000003.xml',
 'https://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000004.xml']

Download the ALTO Files

In [15]:
alto_storage = {}

for xml_link in ids:
    r = requests.get(xml_link)
    if r.ok:
        alto_storage[xml_link] = r.text
In [16]:
alto_storage
Out[16]:
{}
In [17]:
r
Out[17]:
<Response [400]>
In [18]:
r.ok
Out[18]:
False

Uh oh.

Convert the ALTO-XML to TXT

In [19]:
import alto_tools

def alto_extract_text_lines(xml, xmlns):
    text_lines = []
    nsdict = {'alto': xmlns}
    for lines in xml.iterfind('.//alto:TextLine', nsdict):
        words = [line.attrib.get('CONTENT') for line in lines.findall('alto:String', nsdict)]
        text_lines.append(' '.join(words))
    return '\n'.join(text_lines)

def alto_to_text(raw_alto_text):
    alto, xml, xmlns = alto_tools.alto_parse(raw_alto_text)
    return alto_extract_text_lines(xml, xmlns)
In [20]:
print(alto_to_text(f'http://iiif.onb.ac.at/presentation/ANNO/{manifest_id}/resource/00000002.xml'))
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-20-939f29efb91d> in <module>
----> 1 print(alto_to_text(f'http://iiif.onb.ac.at/presentation/ANNO/{manifest_id}/resource/00000002.xml'))

<ipython-input-19-bfbcde0cf185> in alto_to_text(raw_alto_text)
     10 
     11 def alto_to_text(raw_alto_text):
---> 12     alto, xml, xmlns = alto_tools.alto_parse(raw_alto_text)
     13     return alto_extract_text_lines(xml, xmlns)

~/labs/pydays19/alto_tools.py in alto_parse(alto)
     17     """ Convert ALTO xml file to element tree """
     18     try:
---> 19         xml = etree.parse(alto)
     20     except etree.ParseError as e:
     21         sys.stdout.write('\nERROR: Failed parsing "%s" - '

src/lxml/etree.pyx in lxml.etree.parse()

src/lxml/parser.pxi in lxml.etree._parseDocument()

src/lxml/parser.pxi in lxml.etree._parseDocumentFromURL()

src/lxml/parser.pxi in lxml.etree._parseDocFromFile()

src/lxml/parser.pxi in lxml.etree._BaseParser._parseDocFromFile()

src/lxml/parser.pxi in lxml.etree._ParserContext._handleParseResultDoc()

src/lxml/parser.pxi in lxml.etree._handleParseResult()

src/lxml/parser.pxi in lxml.etree._raiseParseError()

OSError: Error reading file 'http://iiif.onb.ac.at/presentation/ANNO/ode18630730/resource/00000002.xml': failed to load HTTP resource