Commit a8e5fcf9 authored by Stefan Karner's avatar Stefan Karner
Browse files

Add ALTO extraction functions to 3.3

parent 90b1fbf7
Loading
Loading
Loading
Loading
+61 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# 3.3 - Text - Download OCR Text

*I need loads of text from old newspapers, preferably with loads of errors due to bad OCR.*

[https://labs.onb.ac.at/en/dataset/anno/](https://labs.onb.ac.at/en/dataset/anno/)

[https://github.com/cneud/alto-tools](https://github.com/cneud/alto-tools)

%% Cell type:markdown id: tags:

In order to get to this text, we have to

* Find a newspaper issue we'd like to harvest
* Download the IIIF manifest for this newspaper issue
* Download the ALTO-XML files for this newspaper issue
* Convert the ALTO-XML to TXT

%% Cell type:markdown id: tags:

### Find a Newspaper Issue

%% Cell type:markdown id: tags:

Let's take a look at the [ONB Labs' historic newspapers](https://labs.onb.ac.at/en/dataset/anno/)

%% Cell type:code id: tags:

``` python
import pandas as pd

meta = pd.read_csv('https://labs.onb.ac.at/gitlab/labs-team/raw-metadata/raw/master/anno_labs_issues.csv.bz2', compression='bz2')
```

%% Cell type:code id: tags:

``` python
meta.sample(10)
```

%% Output

              manifest_id  aid  year       day  \
    140468    kfz18410324  kfz  1841  18410324
    181233    wtz18281211  wtz  1828  18281211
    28465     apr18640419  apr  1864  18640419
    181260    wtz18290111  wtz  1829  18290111
    14201   bdc1868ag0123  bdc  1868  18680123
    143594    kfz18610212  kfz  1861  18610212
    47299     bor18200703  bor  1820  18200703
    218786    wrz18510418  wrz  1851  18510418
    77174     lmz18710902  lmz  1871  18710902
    153225    mop18550426  mop  1855  18550426
    
                                                dc_title dc_title_additional  \
    140468                          Klagenfurter Zeitung                 NaN
    181233  Theaterzettel (Oper und Burgtheater in Wien)                 NaN
    28465                                     Die Presse                 NaN
    181260  Theaterzettel (Oper und Burgtheater in Wien)                 NaN
    14201        Ordinariats-Blatt der Budweiser Diöcese                 NaN
    143594                          Klagenfurter Zeitung                 NaN
    47299           Amtliches Cursblatt der Wiener Börse                 NaN
    218786                                Wiener Zeitung                 NaN
    77174                           Leitmeritzer Zeitung                 NaN
    153225                                   Morgen-Post                 NaN
    
                                 subjects  \
    140468                   Tageszeitung
    181233  Kultur, Kunst, Theater, Musik
    28465                    Tageszeitung
    181260  Kultur, Kunst, Theater, Musik
    14201                        Religion
    143594                   Tageszeitung
    47299                      Wirtschaft
    218786                   Tageszeitung
    77174                    Tageszeitung
    153225                   Tageszeitung
    
                                        place_of_publications languages  \
    140468                                         Klagenfurt        de
    181233                                               Wien        de
    28465                                  Wien, Brno (Brünn)        de
    181260                                               Wien        de
    14201   Budweis (BudÄ›jovice, Budovicium, ÄŒeské BudÄ...        de
    143594                                         Klagenfurt        de
    47299                                                Wien        de
    218786                                               Wien        de
    77174                             Litoměřice (Leitmeritz)        de
    153225                                               Wien        de
    
               dc_type  ...  meta_type  ini_type modification_datetime  \
    140468   newspaper  ...  zeitungen      anno   2003-12-02 19:06:09
    181233   newspaper  ...  zeitungen      anno   2010-12-21 02:37:44
    28465    newspaper  ...  zeitungen      anno   2010-12-07 15:20:30
    181260   newspaper  ...  zeitungen      anno   2010-12-21 02:37:45
    14201   periodical  ...  periodika  annoplus   2015-06-12 07:57:34
    143594   newspaper  ...  zeitungen      anno   2013-09-20 15:24:13
    47299    newspaper  ...  zeitungen      anno   2013-04-22 14:09:40
    218786   newspaper  ...  zeitungen      anno   2010-12-28 15:27:30
    77174    newspaper  ...  zeitungen      anno   2011-01-26 09:31:40
    153225   newspaper  ...  zeitungen      anno   2012-12-11 13:46:38
    
           longer_page_id     dc_date  \
    140468              0  1841-03-24
    181233              0  1828-12-11
    28465               0  1864-04-19
    181260              0  1829-01-11
    14201               0        1868
    143594              0  1861-02-12
    47299               0  1820-07-03
    218786              0  1851-04-18
    77174               0  1871-09-02
    153225              0  1855-04-26
    
                                                     link_pdf  \
    140468  http://anno.onb.ac.at/cgi-content/anno_pdf.pl?...
    181233  http://anno.onb.ac.at/cgi-content/anno_pdf.pl?...
    28465   http://anno.onb.ac.at/cgi-content/anno_pdf.pl?...
    181260  http://anno.onb.ac.at/cgi-content/anno_pdf.pl?...
    14201   http://anno.onb.ac.at/cgi-content/anno_pdf.pl?...
    143594  http://anno.onb.ac.at/cgi-content/anno_pdf.pl?...
    47299   http://anno.onb.ac.at/cgi-content/anno_pdf.pl?...
    218786  http://anno.onb.ac.at/cgi-content/anno_pdf.pl?...
    77174   http://anno.onb.ac.at/cgi-content/anno_pdf.pl?...
    153225  http://anno.onb.ac.at/cgi-content/anno_pdf.pl?...
    
                                                     link_old has_ocr  meta_id  \
    140468  http://anno.onb.ac.at/cgi-content/anno?aid=kfz...       1   766277
    181233  http://anno.onb.ac.at/cgi-content/anno?aid=wtz...       0   961683
    28465   http://anno.onb.ac.at/cgi-content/anno?aid=apr...       1    14304
    181260  http://anno.onb.ac.at/cgi-content/anno?aid=wtz...       0   961710
    14201   http://anno.onb.ac.at/cgi-content/anno-plus?ai...       1  1048842
    143594  http://anno.onb.ac.at/cgi-content/anno?aid=kfz...       1   769403
    47299   http://anno.onb.ac.at/cgi-content/anno?aid=bor...       0    81151
    218786  http://anno.onb.ac.at/cgi-content/anno?aid=wrz...       1  1012484
    77174   http://anno.onb.ac.at/cgi-content/anno?aid=lmz...       1   269657
    153225  http://anno.onb.ac.at/cgi-content/anno?aid=mop...       1   801457
    
            page_count
    140468          20
    181233           1
    28465           12
    181260           1
    14201            8
    143594           6
    47299            2
    218786          28
    77174           12
    153225           4
    
    [10 rows x 21 columns]

%% Cell type:markdown id: tags:

Let's go with the *Leitmeritzer Zeitung* issue from the 2nd of September 1871

%% Cell type:code id: tags:

``` python
manifest_id = 'lmz18710902'
```

%% Cell type:markdown id: tags:

### Download the IIIF Manifest

%% Cell type:markdown id: tags:

If we look at the [SACHA API description](https://iiif.onb.ac.at/api#_manifestrequestprocessor), we see that the link for the IIIF manifest has to look like this:

`http://iiif.onb.ac.at/presentation/ANNO/lmz18710902/manifest`

%% Cell type:code id: tags:

``` python
import requests
```

%% Cell type:code id: tags:

``` python
r = requests.get('http://iiif.onb.ac.at/presentation/ANNO/lmz18710902/manifest')
r.json()
```

%% Output

    {'@context': 'https://iiif.io/api/presentation/2/context.json',
     '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/manifest',
     '@type': 'sc:Manifest',
     'label': 'Leitmeritzer Zeitung 1871-09-02',
     'metadata': [{'label': [{'@value': 'Id', '@language': 'en'},
        {'@value': 'Id', '@language': 'ger'}],
       'value': 'lmz18710902'},
      {'label': [{'@value': 'Title', '@language': 'en'},
        {'@value': 'Titel', '@language': 'ger'}],
       'value': 'Leitmeritzer Zeitung'},
      {'label': [{'@value': 'Type', '@language': 'en'},
        {'@value': 'Typ', '@language': 'ger'}],
       'value': 'newspaper'},
      {'label': [{'@value': 'Place of Publications', '@language': 'en'},
        {'@value': 'Erscheinungsort', '@language': 'ger'}],
       'value': "<a href='http://d-nb.info/gnd/4074136-9'>Litoměřice (Leitmeritz)</a>"},
      {'label': [{'@value': 'Date Issued', '@language': 'en'},
        {'@value': 'Erscheinungsdatum', '@language': 'ger'}],
       'value': '1871-09-02'},
      {'label': [{'@value': 'Subject Heading', '@language': 'en'},
        {'@value': 'Schlagworte', '@language': 'ger'}],
       'value': "<a href='http://d-nb.info/gnd/4067510-5'>Tageszeitung</a>"},
      {'label': [{'@value': 'Disseminator', '@language': 'en'},
        {'@value': 'Anbieter', '@language': 'ger'}],
       'value': "<a href='http://anno.onb.ac.at/'>Austrian Newspapers Online</a>"},
      {'label': [{'@value': 'Languages', '@language': 'en'},
        {'@value': 'Sprachen', '@language': 'ger'}],
       'value': 'ger'}],
     'description': 'Leitmeritzer Zeitung 1871-09-02',
     'viewingDirection': 'left-to-right',
     'viewingHint': 'paged',
     'license': 'http://creativecommons.org/publicdomain/mark/1.0/',
     'attribution': [{'@value': 'Austrian National Library', '@language': 'en'},
      {'@value': 'Österreichische Nationalbibliothek', '@language': 'ger'}],
     'logo': 'https://iiif.onb.ac.at/logo/',
     'seeAlso': [{'@id': 'http://anno.onb.ac.at/cgi-content/anno_pdf.pl?aid=lmz&datum=18710902',
       'format': 'application/pdf'},
      {'@id': 'http://anno.onb.ac.at/cgi-content/anno?aid=lmz&datum=18710902',
       'format': 'text/html'},
      {'@id': 'http://data.onb.ac.at/ANNO/lmz18710902.rdf',
       'format': 'application/rdf+xml'}],
     'sequences': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
       '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/sequence/normal',
       '@type': 'sc:Sequence',
       'startCanvas': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000001',
       'canvases': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000001',
         '@type': 'sc:Canvas',
         'label': '00000001',
         'height': 3788,
         'width': 2819,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000001',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000001/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3788,
            'width': 2819,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000001',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000001'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000001.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000001.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000001'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000002',
         '@type': 'sc:Canvas',
         'label': '00000002',
         'height': 3802,
         'width': 2822,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000002',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000002/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3802,
            'width': 2822,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000002',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000002'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000002.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000002.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000002'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000003',
         '@type': 'sc:Canvas',
         'label': '00000003',
         'height': 3788,
         'width': 2819,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000003',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000003/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3788,
            'width': 2819,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000003',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000003'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000003.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000003.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000003'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000004',
         '@type': 'sc:Canvas',
         'label': '00000004',
         'height': 3802,
         'width': 2822,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000004',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000004/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3802,
            'width': 2822,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000004',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000004'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000004.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000004.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000004'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000005',
         '@type': 'sc:Canvas',
         'label': '00000005',
         'height': 3788,
         'width': 2819,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000005',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000005/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3788,
            'width': 2819,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000005',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000005'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000005.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000005.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000005'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000006',
         '@type': 'sc:Canvas',
         'label': '00000006',
         'height': 3802,
         'width': 2822,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000006',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000006/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3802,
            'width': 2822,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000006',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000006'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000006.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000006.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000006'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000007',
         '@type': 'sc:Canvas',
         'label': '00000007',
         'height': 3788,
         'width': 2819,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000007',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000007/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3788,
            'width': 2819,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000007',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000007'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000007.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000007.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000007'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000008',
         '@type': 'sc:Canvas',
         'label': '00000008',
         'height': 3802,
         'width': 2822,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000008',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000008/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3802,
            'width': 2822,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000008',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000008'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000008.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000008.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000008'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000009',
         '@type': 'sc:Canvas',
         'label': '00000009',
         'height': 3788,
         'width': 2819,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000009',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000009/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3788,
            'width': 2819,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000009',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000009'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000009.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000009.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000009'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000010',
         '@type': 'sc:Canvas',
         'label': '00000010',
         'height': 3802,
         'width': 2822,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000010',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000010/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3802,
            'width': 2822,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000010',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000010'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000010.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000010.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000010'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000011',
         '@type': 'sc:Canvas',
         'label': '00000011',
         'height': 3788,
         'width': 2819,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000011',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000011/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3788,
            'width': 2819,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000011',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000011'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000011.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000011.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000011'}]}]},
        {'@context': 'https://iiif.io/api/presentation/2/context.json',
         '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000012',
         '@type': 'sc:Canvas',
         'label': '00000012',
         'height': 3802,
         'width': 2822,
         'metadata': [{'label': 'Resolution', 'value': '0dpi'},
          {'label': 'Color Depth', 'value': '0bpp'}],
         'images': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/annotation/00000012',
           '@type': 'oa:Annotation',
           'motivation': 'sc:painting',
           'resource': {'@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000012/full/full/0/default.jpg',
            '@type': 'dctypes:Image',
            'height': 3802,
            'width': 2822,
            'format': 'image/jpeg',
            'service': {'@context': 'https://iiif.io/api/image/2/context.json',
             '@id': 'https://iiif.onb.ac.at/images/ANNO/lmz18710902/00000012',
             'profile': 'https://iiif.io/api/image/2/level2.json'}},
           'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000012'}],
         'otherContent': [{'@context': 'https://iiif.io/api/presentation/2/context.json',
           '@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000012.json',
           '@type': 'sc:AnnotationList',
           'resources': [{'@type': 'oa:Annotation',
             'motivation': 'sc:painting',
             'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000012.xml',
              '@type': 'dctypes:Text',
              'format': 'application/xml+alto'},
             'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000012'}]}]}]}]}

%% Cell type:markdown id: tags:

There's a lot of information in there. We need the info blocks with links to ALTO-XML resources.

Let's use jsonpath-ng for that.

%% Cell type:code id: tags:

``` python
from jsonpath_ng import parse
```

%% Cell type:code id: tags:

``` python
def jp(http_response, parser):
    return [match.value for match in parser.find(http_response.json())]
```

%% Cell type:code id: tags:

``` python
resource_parser = parse('$.sequences[*].canvases[*].otherContent[*].resources')
```

%% Cell type:code id: tags:

``` python
jp(r, resource_parser)
```

%% Output

    [[{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000001.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000001'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000002.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000002'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000003.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000003'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000004.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000004'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000005.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000005'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000006.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000006'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000007.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000007'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000008.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000008'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000009.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000009'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000010.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000010'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000011.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000011'}],
     [{'@type': 'oa:Annotation',
       'motivation': 'sc:painting',
       'resource': {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000012.xml',
        '@type': 'dctypes:Text',
        'format': 'application/xml+alto'},
       'on': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/canvas/00000012'}]]

%% Cell type:markdown id: tags:

Not quite there yet.

%% Cell type:code id: tags:

``` python
all_resources = parse('$.sequences[*].canvases[*].otherContent[*].resources[*].resource')
```

%% Cell type:code id: tags:

``` python
jp(r, all_resources)
```

%% Output

    [{'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000001.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000002.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000003.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000004.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000005.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000006.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000007.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000008.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000009.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000010.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000011.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'},
     {'@id': 'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000012.xml',
      '@type': 'dctypes:Text',
      'format': 'application/xml+alto'}]

%% Cell type:markdown id: tags:

Filter just the ones with format `application/xml+alto`, and there only the `@id`:

%% Cell type:code id: tags:

``` python
ids = [d['@id'] for d in jp(r, all_resources) if d['format'] == 'application/xml+alto']
```

%% Cell type:code id: tags:

``` python
ids
```

%% Output

    ['https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000001.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000002.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000003.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000004.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000005.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000006.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000007.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000008.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000009.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000010.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000011.xml',
     'https://iiif.onb.ac.at/presentation/ANNO/lmz18710902/resource/00000012.xml']

%% Cell type:markdown id: tags:

### Download the ALTO Files

%% Cell type:code id: tags:

``` python
alto_storage = {}

for xml_link in ids:
    r = requests.get(xml_link)
    if r.ok:
        alto_storage[xml_link] = r.text
```

%% Cell type:code id: tags:

``` python
alto_storage
```

%% Output

    {}

%% Cell type:code id: tags:

``` python
r
```

%% Output

    <Response [400]>

%% Cell type:code id: tags:

``` python
r.ok
```

%% Output

    False

%% Cell type:markdown id: tags:

Uh oh.

%% Cell type:markdown id: tags:

### Convert the ALTO-XML to TXT

%% Cell type:code id: tags:

``` python
import alto_tools

def alto_extract_text_lines(xml, xmlns):
    text_lines = []
    nsdict = {'alto': xmlns}
    for lines in xml.iterfind('.//alto:TextLine', nsdict):
        words = [line.attrib.get('CONTENT') for line in lines.findall('alto:String', nsdict)]
        text_lines.append(' '.join(words))
    return '\n'.join(text_lines)

def alto_to_text(raw_alto_text):
    alto, xml, xmlns = alto_tools.alto_parse(raw_alto_text)
    return alto_extract_text_lines(xml, xmlns)
```

%% Cell type:code id: tags:

``` python
print(alto_to_text('http://iiif.onb.ac.at/presentation/ANNO/apr18750223/resource/00000002.xml'))
```

%% Output

    ---------------------------------------------------------------------------
    OSError                                   Traceback (most recent call last)
    <ipython-input-31-163d551895d7> in <module>
    ----> 1 print(alto_to_text('http://iiif.onb.ac.at/presentation/ANNO/apr18750223/resource/00000002.xml'))

    <ipython-input-30-bfbcde0cf185> in alto_to_text(raw_alto_text)
         10
         11 def alto_to_text(raw_alto_text):
    ---> 12     alto, xml, xmlns = alto_tools.alto_parse(raw_alto_text)
         13     return alto_extract_text_lines(xml, xmlns)
    ~/labs/pydays19/alto_tools.py in alto_parse(alto)
         17     """ Convert ALTO xml file to element tree """
         18     try:
    ---> 19         xml = etree.parse(alto)
         20     except etree.ParseError as e:
         21         sys.stdout.write('\nERROR: Failed parsing "%s" - '
    src/lxml/etree.pyx in lxml.etree.parse()
    src/lxml/parser.pxi in lxml.etree._parseDocument()
    src/lxml/parser.pxi in lxml.etree._parseDocumentFromURL()
    src/lxml/parser.pxi in lxml.etree._parseDocFromFile()
    src/lxml/parser.pxi in lxml.etree._BaseParser._parseDocFromFile()
    src/lxml/parser.pxi in lxml.etree._ParserContext._handleParseResultDoc()
    src/lxml/parser.pxi in lxml.etree._handleParseResult()
    src/lxml/parser.pxi in lxml.etree._raiseParseError()
    OSError: Error reading file 'http://iiif.onb.ac.at/presentation/ANNO/apr18750223/resource/00000002.xml': failed to load HTTP resource

%% Cell type:code id: tags:

``` python
```