I need loads of text from old newspapers, preferably with loads of errors due to bad OCR.
In order to get to this text, we have to
Let's take a look at the ONB Labs' historic newspapers
import pandas as pd
meta = pd.read_csv('https://labs.onb.ac.at/gitlab/labs-team/raw-metadata/raw/master/anno_labs_issues.csv.bz2', compression='bz2')
meta.sample(10)
Let's go with the Ost-Deutsche Post issue from the 30th of July 1863
manifest_id = 'ode18630730'
If we look at the SACHA API description, we see that the link for the IIIF manifest has to look like this:
http://iiif.onb.ac.at/presentation/ANNO/ode18630730/manifest
import requests
r = requests.get(f'http://iiif.onb.ac.at/presentation/ANNO/{manifest_id}/manifest')
r.json()
There's a lot of information in there. We need the info blocks with links to ALTO-XML resources.
Let's use jsonpath-ng for that.
from jsonpath_ng import parse
def jp(http_response, parser):
return [match.value for match in parser.find(http_response.json())]
resource_parser = parse('$.sequences[*].canvases[*].otherContent[*].resources')
jp(r, resource_parser)
Not quite there yet.
all_resources = parse('$.sequences[*].canvases[*].otherContent[*].resources[*].resource')
jp(r, all_resources)
Filter just the ones with format application/xml+alto
, and there only the @id
:
ids = [d['@id'] for d in jp(r, all_resources) if d['format'] == 'application/xml+alto']
ids
alto_storage = {}
for xml_link in ids:
r = requests.get(xml_link)
if r.ok:
alto_storage[xml_link] = r.text
alto_storage
r
r.ok
Uh oh.
import alto_tools
def alto_extract_text_lines(xml, xmlns):
text_lines = []
nsdict = {'alto': xmlns}
for lines in xml.iterfind('.//alto:TextLine', nsdict):
words = [line.attrib.get('CONTENT') for line in lines.findall('alto:String', nsdict)]
text_lines.append(' '.join(words))
return '\n'.join(text_lines)
def alto_to_text(raw_alto_text):
alto, xml, xmlns = alto_tools.alto_parse(raw_alto_text)
return alto_extract_text_lines(xml, xmlns)
print(alto_to_text(f'http://iiif.onb.ac.at/presentation/ANNO/{manifest_id}/resource/00000002.xml'))