Commit de3e74dc authored by Georg Petz's avatar Georg Petz
Browse files

JSON-LD Processing Algorithms and RDF introduction

parent b9b656d0
Loading
Loading
Loading
Loading
+25 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# 2 - Metadata and Catalogue

[https://labs.onb.ac.at/en/dataset/lod/](https://labs.onb.ac.at/en/dataset/lod/)

[https://labs.onb.ac.at/en/tool/sparql/](https://labs.onb.ac.at/en/tool/sparql/)

%% Cell type:markdown id: tags:

### In this block:

* Overview data formats
* Overview container formats
* Overview protocols

%% Cell type:markdown id: tags:

* Example: SRU (2.1)
* Example: data harvesting OAI-PMH (2.2)
* Example: SPARQL (2.3)

%% Cell type:markdown id: tags:

## Overview data formats

%% Cell type:markdown id: tags:

* Dublin Core
    * set of vocabulary terms to describe digital resources
    * 15 classic metadata terms, known as the Dublin Core Metadata Element Set (DCMES)
    * [Dublin Core Metadata Initiative](http://dublincore.org/)

%% Cell type:markdown id: tags:

* MARC
    * MARC (MAchine-Readable Cataloging) standards
        * developed in the 1960s to create records that could be read by computers and shared among libraries
    * MARC 21, MARC record format for the 21st century

%% Cell type:markdown id: tags:

* Dublin Core Metadata Element Set (DCMES) 1.1

    1. Title: The name of the object
    2. Creator: An entity primarily responsible for making the resource
    3. Subject: The topic addressed by the work
    4. Description: An account of the resource
    5. Publisher: The agent or agency responsible for making the object available
    6. Contributor: An entity responsible for making contributions to the resource
    7. Date: The date of publication
    8. Type: The nature or genre of the resource
    9. Format: The file format, physical medium, or dimensions of the resource
    10. Identifier: String or number used to uniquely identify the object

%% Cell type:markdown id: tags:

* Dublin Core Metadata Element Set (DCMES) 1.1

    11. Source: Objects, either print or electronic, from which this object is derived, if applicable
    12. Language: Language of the intellectual content
    13. Relation: Relationship to other objects
    14. Coverage: The spatial locations and temporal durations characteristic of the object
    15. Rights: Information about rights held in and over the resource

%% Cell type:markdown id: tags:

## Overview container formats

%% Cell type:markdown id: tags:

* Simple DC container XML Schema [http://www.dublincore.org/schemas/xmls/](http://www.dublincore.org/schemas/xmls/)
![simpledc xml schema](./media/simpledc.png)

%% Cell type:markdown id: tags:

* JSON
    * a string = { "name":"John" }
    * a number = { "age":30 }
    * an object (JSON object) = {"employee":{ "name":"John", "age":30, "city":"New York" }}
    * an array = {"employees":[ "John", "Anna", "Peter" ]}
    * a boolean = { "sale":true }
    * null = { "middlename":null }

%% Cell type:markdown id: tags:

* JSON-LD
    * JSON for Linked Data
    * keywords
        * @context to provide additional mappings from JSON to an RDF model (map terms to IRIs)
        * @id to uniquely identify things
        * @type to set the data type of a node or typed value
        * @container to set the default container type for a term
        * "@container": "@set" defines a container as an unordered set

%% Cell type:markdown id: tags:

```javascript
{
  "@context": {
    "name": "http://xmlns.com/foaf/0.1/name",
    "homepage": {
      "@id": "http://xmlns.com/foaf/0.1/workplaceHomepage",
      "@type": "@id"
    },
    "Person": "http://xmlns.com/foaf/0.1/Person"
  },
  "@id": "https://me.example.com",
  "@type": "Person",
  "name": "John Smith",
  "homepage": "https://www.example.com/"
}
```

%% Cell type:markdown id: tags:

* RDF (Resource Description Framework) W3C standard for modeling information
* RDF Triples
    * describe everything as subject, predicate and object expression
        * subject denotes the resource
        * predicate, a term used to describe the subject
        * object, the thing that the verb is acting upon, can be another resource, or just a literal value

%% Cell type:markdown id: tags:

[https://json-ld.org/playground/](https://json-ld.org/playground/)

%% Cell type:markdown id: tags:

* JSON-LD Processing Algorithms
    * JSON-LD Expaneded
        * replaces terms with the URIs they expand to
        * necessary for further transformations
        * removes context
    * JSOM-LD Compacted
        * removes context
        * makes it easier to read
    * JSON-LD Flattened
        * all properties of a node are collected in a __single__ JSON object
        * a labeled directed graph ()
    * JSON-LD Framing
        * [https://w3c.github.io/json-ld-framing/](https://w3c.github.io/json-ld-framing/)

%% Cell type:markdown id: tags:

* DCMES and DCMI Metadata Terms [http://www.dublincore.org/specifications/dublin-core/dcmi-terms/](http://www.dublincore.org/specifications/dublin-core/dcmi-terms/) within JSON-LD

```javascript
{
...
    "publisher": "Arn. Giull. de Brocario",
    "place_of_publication": "Compluti",
    "language": "http://id.loc.gov/vocabulary/iso639-2/mul",
    "@id": "https://open-na.hosted.exlibrisgroup.com/alma/43ACC_ONB/bibs/990028618530603338",
    "title": "Biblia polyglotta",
    "@context": "https://open-na.hosted.exlibrisgroup.com/alma/contexts/bib"
}
```

* [https://open-na.hosted.exlibrisgroup.com/alma/contexts/bib](https://open-na.hosted.exlibrisgroup.com/alma/contexts/bib)

%% Cell type:code id: tags:

``` python
import requests
cont=requests.get("https://open-na.hosted.exlibrisgroup.com/alma/43ACC_NETWORK/bibs/990106901740203331")
cont.json()
```

%% Output

    {'date': '9999',
     'note': 'Aus: (Sammelband von 63 Hochzeitsgedichten).',
     'identifier': [{'label': '(DE-599)OBVAC10480601'},
      {'label': '(Aleph)010690174ACC01'},
      {'label': '(AT-OBV)AC10480601'},
      {'label': 'AC10480601'}],
     '@type': 'Book',
     'place_of_publication': 's.l.',
     'language': 'http://id.loc.gov/vocabulary/iso639-2/ger',
     '@id': 'https://open-na.hosted.exlibrisgroup.com/alma/43ACC_NETWORK/bibs/990106901740203331',
     'title': 'Bey dem hochadelichen Helmrich- und Bassronischen Beylager, welches ... zu sonderbahren Ehren beyder Vermählten ...',
     '@context': 'https://open-na.hosted.exlibrisgroup.com/alma/contexts/bib'}

%% Cell type:markdown id: tags:

* MARCXML
    * MARCXML is an XML schema based on the common MARC21 standards
    * [http://www.loc.gov/standards/marcxml/](http://www.loc.gov/standards/marcxml/)

%% Cell type:markdown id: tags:

## Overview protocols

%% Cell type:markdown id: tags:

* SRU
    * SRU (Search/Retrieve via URL) permits targeted searches within the Catalogue based on well established internet standards.
    * [https://developers.exlibrisgroup.com/alma/integrations/SRU/](https://developers.exlibrisgroup.com/alma/integrations/SRU/)
    * [http://www.loc.gov/standards/sru/](http://www.loc.gov/standards/sru/)
    * based on CQL (Contextual Query Language) to search within the catalogue
    * for retrieval of a bibliographic record the Barcode or Metadata Management System ID (MMS-ID) is used
    * CQL query
        * alma.mms_id=990055772160603338  ([https://obv-at-oenb.alma.exlibrisgroup.com/view/sru/43ACC_ONB?version=1.2&query=alma.mms_id=990055772160603338&startRecord=0&maximumRecords=1&operation=searchRetrieve&recordSchema=marcxml](https://obv-at-oenb.alma.exlibrisgroup.com/view/sru/43ACC_ONB?version=1.2&query=alma.mms_id=990055772160603338&startRecord=0&maximumRecords=1&operation=searchRetrieve&recordSchema=marcxml))


%% Cell type:markdown id: tags:

* CQL query
    * alma.title=transzendental ([https://obv-at-oenb.alma.exlibrisgroup.com/view/sru/43ACC_ONB?version=1.2&query=alma.title=transzendental&startRecord=0&maximumRecords=5&operation=searchRetrieve&recordSchema=dc](https://obv-at-oenb.alma.exlibrisgroup.com/view/sru/43ACC_ONB?version=1.2&query=alma.title=transzendental&startRecord=0&maximumRecords=5&operation=searchRetrieve&recordSchema=dc))
    * alma.barcode=%2BZ199052304 ([https://obv-at-oenb.alma.exlibrisgroup.com/view/sru/43ACC_ONB?version=1.2&query=alma.barcode=%2BZ199052304&startRecord=0&maximumRecords=1&operation=searchRetrieve&recordSchema=marcxml](https://obv-at-oenb.alma.exlibrisgroup.com/view/sru/43ACC_ONB?version=1.2&query=alma.barcode=%2BZ199052304&startRecord=0&maximumRecords=1&operation=searchRetrieve&recordSchema=marcxml))
    * alma.mmsid=990034300920603338%20or%20alma.mmsid=990028618530603338 ([https://obv-at-oenb.alma.exlibrisgroup.com/view/sru/43ACC_ONB?version=1.2&query=alma.mms_id=990034300920603338%20or%20alma.mms_id=990028618530603338&startRecord=1&maximumRecords=5&operation=searchRetrieve&recordSchema=dc](https://obv-at-oenb.alma.exlibrisgroup.com/view/sru/43ACC_ONB?version=1.2&query=alma.mms_id=990034300920603338%20or%20alma.mms_id=990028618530603338&startRecord=1&maximumRecords=5&operation=searchRetrieve&recordSchema=dc))

%% Cell type:code id: tags:

``` python
import requests
from lxml import etree
cont=requests.get("https://obv-at-oenb.alma.exlibrisgroup.com/view/sru/43ACC_ONB?version=1.2&query=alma.barcode=%2BZ199052304&startRecord=0&maximumRecords=1&operation=searchRetrieve&recordSchema=marcxml").content
e = etree.XML(cont)
print(etree.tostring(e, encoding='unicode', pretty_print=True))
```

%% Output

    <searchRetrieveResponse xmlns="http://www.loc.gov/zing/srw/">
      <version>1.2</version>
      <numberOfRecords>1</numberOfRecords>
      <records>
        <record>
          <recordSchema>marcxml</recordSchema>
          <recordPacking>xml</recordPacking>
          <recordData>
            <record xmlns="http://www.loc.gov/MARC21/slim">
              <leader>00000nam a2200000 c 4500</leader>
              <controlfield tag="001">990030217420603338</controlfield>
              <controlfield tag="005">20180123084300.0</controlfield>
              <controlfield tag="007">cr#|||||||||||</controlfield>
              <controlfield tag="007">tu</controlfield>
              <controlfield tag="008">000101|1814####xx############|||#|#ger#u</controlfield>
              <controlfield tag="009">AC09865194</controlfield>
              <datafield tag="035" ind1=" " ind2=" ">
                <subfield code="a">AC09865194</subfield>
              </datafield>
              <datafield tag="035" ind1=" " ind2=" ">
                <subfield code="a">(Aleph)009871525ACC01</subfield>
              </datafield>
              <datafield tag="035" ind1=" " ind2=" ">
                <subfield code="a">(DE-599)OBVAC09865194</subfield>
              </datafield>
              <datafield tag="035" ind1=" " ind2=" ">
                <subfield code="a">(AT-OBV)AC09865194</subfield>
              </datafield>
              <datafield tag="035" ind1=" " ind2=" ">
                <subfield code="a">(EXLNZ-43ACC_NETWORK)990098715250203331</subfield>
              </datafield>
              <datafield tag="040" ind1=" " ind2=" ">
                <subfield code="a">ONB</subfield>
                <subfield code="b">ger</subfield>
                <subfield code="c">ONB-AK-RETRO</subfield>
                <subfield code="d">AT-OeNB</subfield>
                <subfield code="e">pi</subfield>
              </datafield>
              <datafield tag="041" ind1=" " ind2=" ">
                <subfield code="a">ger</subfield>
              </datafield>
              <datafield tag="044" ind1=" " ind2=" ">
                <subfield code="c">XA-DXDE</subfield>
              </datafield>
              <datafield tag="245" ind1="0" ind2="0">
                <subfield code="a">&lt;&lt;Die&gt;&gt; Flucht über den Rhein odar Das unverhoffte Wiedersehen</subfield>
                <subfield code="b">Ein erlustirend historisch-rührendes Familiengemälde mit Erscheinungen und vollstimmigen Chören von Baschkiren und Cosaken, und allen Batterien der Deutschen</subfield>
              </datafield>
              <datafield tag="264" ind1=" " ind2="1">
                <subfield code="a">[Meißen]</subfield>
                <subfield code="b">[Gödsche]</subfield>
                <subfield code="c">1814</subfield>
              </datafield>
              <datafield tag="300" ind1=" " ind2=" ">
                <subfield code="a">32 S.</subfield>
              </datafield>
              <datafield tag="689" ind1="0" ind2="0">
                <subfield code="a">Deutschland</subfield>
                <subfield code="D">g</subfield>
                <subfield code="0">(DE-588)4011882-4</subfield>
              </datafield>
              <datafield tag="689" ind1="0" ind2="1">
                <subfield code="a">Krieg</subfield>
                <subfield code="D">s</subfield>
                <subfield code="0">(DE-588)4033114-3</subfield>
              </datafield>
              <datafield tag="689" ind1="0" ind2="3">
                <subfield code="a">Belletristische Darstellung</subfield>
                <subfield code="A">f</subfield>
              </datafield>
              <datafield tag="689" ind1="0" ind2=" ">
                <subfield code="5">AT-OBV</subfield>
                <subfield code="5">ONB-AK</subfield>
              </datafield>
              <datafield tag="689" ind1="1" ind2="0">
                <subfield code="a">Drama</subfield>
                <subfield code="D">s</subfield>
                <subfield code="0">(DE-588)4012899-4</subfield>
              </datafield>
              <datafield tag="689" ind1="1" ind2="1">
                <subfield code="a">Deutsch</subfield>
                <subfield code="D">s</subfield>
                <subfield code="0">(DE-588)4113292-0</subfield>
              </datafield>
              <datafield tag="689" ind1="1" ind2=" ">
                <subfield code="5">AT-OBV</subfield>
                <subfield code="5">ONB-AK</subfield>
              </datafield>
              <datafield tag="710" ind1="2" ind2=" ">
                <subfield code="a">Goedsche, Friedrich Wilhelm</subfield>
                <subfield code="4">pbl</subfield>
              </datafield>
              <datafield tag="856" ind1="4" ind2=" ">
                <subfield code="u">http://data.onb.ac.at/imgk/AZ00308934SZ00220134SZ00628562</subfield>
                <subfield code="z">Zettel</subfield>
                <subfield code="o">Katalogkarte</subfield>
              </datafield>
              <datafield tag="856" ind1="4" ind2="0">
                <subfield code="m">V:AT-OBV;B:AT-OeNB</subfield>
                <subfield code="q">application/html</subfield>
                <subfield code="u">http://data.onb.ac.at/ABO/%2BZ182067107</subfield>
                <subfield code="x">ONB-ABO</subfield>
                <subfield code="3">Volltext</subfield>
                <subfield code="o">OBV-ONB-ABO</subfield>
              </datafield>
              <datafield tag="856" ind1="4" ind2="0">
                <subfield code="m">V:AT-OBV;B:AT-OeNB</subfield>
                <subfield code="q">application/html</subfield>
                <subfield code="u">http://data.onb.ac.at/ABO/%2BZ199052304</subfield>
                <subfield code="x">ONB-ABO</subfield>
                <subfield code="3">Volltext</subfield>
                <subfield code="o">OBV-ONB-ABO</subfield>
              </datafield>
              <datafield tag="974" ind1="0" ind2="s">
                <subfield code="V">029</subfield>
                <subfield code="a">LZ01187985</subfield>
              </datafield>
              <datafield tag="974" ind1="0" ind2="s">
                <subfield code="F">030</subfield>
                <subfield code="A">u|1uf||||||37</subfield>
              </datafield>
              <datafield tag="974" ind1="0" ind2="s">
                <subfield code="F">050</subfield>
                <subfield code="A">a|a|||||g|||||</subfield>
              </datafield>
              <datafield tag="974" ind1="0" ind2="s">
                <subfield code="F">051</subfield>
                <subfield code="A">m|||||||</subfield>
              </datafield>
              <datafield tag="980" ind1="0" ind2=" ">
                <subfield code="a">0</subfield>
                <subfield code="9">LOCAL</subfield>
              </datafield>
              <datafield tag="980" ind1="0" ind2=" ">
                <subfield code="a">ONB-AK-RETRO</subfield>
                <subfield code="9">LOCAL</subfield>
              </datafield>
              <datafield tag="982" ind1=" " ind2=" ">
                <subfield code="f">Drama</subfield>
                <subfield code="9">LOCAL</subfield>
              </datafield>
              <datafield tag="982" ind1=" " ind2=" ">
                <subfield code="f">Dramen / deutsche / 19. Jh.</subfield>
                <subfield code="9">LOCAL</subfield>
              </datafield>
              <datafield tag="AVA" ind1=" " ind2=" ">
                <subfield code="0">990030217420603338</subfield>
                <subfield code="8">22288570940003338</subfield>
                <subfield code="a">43ACC_ONB</subfield>
                <subfield code="b">ZALT</subfield>
                <subfield code="c">State Hall at Josefsplatz</subfield>
                <subfield code="d">80.J.58</subfield>
                <subfield code="e">available</subfield>
                <subfield code="f">1</subfield>
                <subfield code="g">0</subfield>
                <subfield code="i">ONB</subfield>
                <subfield code="j">PRUNK</subfield>
                <subfield code="p">1</subfield>
                <subfield code="q">Department of Manuscripts and Rare Books (ALT)</subfield>
              </datafield>
              <datafield tag="AVA" ind1=" " ind2=" ">
                <subfield code="0">990030217420603338</subfield>
                <subfield code="8">22288570920003338</subfield>
                <subfield code="a">43ACC_ONB</subfield>
                <subfield code="b">ZFID</subfield>
                <subfield code="c">Bildarchiv und Grafiksammlung</subfield>
                <subfield code="d">288765-B</subfield>
                <subfield code="e">available</subfield>
                <subfield code="f">1</subfield>
                <subfield code="g">0</subfield>
                <subfield code="i">ONB</subfield>
                <subfield code="j">MAG</subfield>
                <subfield code="p">2</subfield>
                <subfield code="q">Picture Archives and Graphics Department (FID)</subfield>
              </datafield>
            </record>
          </recordData>
          <recordIdentifier>990030217420603338</recordIdentifier>
          <recordPosition>0</recordPosition>
        </record>
      </records>
      <extraResponseData xmlns:xb="http://www.exlibris.com/repository/search/xmlbeans/">
        <xb:exact>true</xb:exact>
        <xb:responseDate>2019-04-27T11:10:11+0200</xb:responseDate>
      </extraResponseData>
    </searchRetrieveResponse>
    

%% Cell type:markdown id: tags:

* OAI-PMH
    * OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) is used for metadata harvesting
    * 6 verbs
        * GetRecord – Used to retrieve an individual metadata record.
        * Identify – Used to retrieve repository information (ex. name, version).
        * ListIdentifiers – Used to retrieve only headers.
        * ListMetadataFormats – Used to retrieve the available metadata formats.
        * ListRecords – Used to retrieve actual item metadata records.
        * ListSets – Used to retrieve the set structure of a repository




%% Cell type:code id: tags:

``` python
from sickle import Sickle
sickle = Sickle('https://eu02.alma.exlibrisgroup.com/view/oai/43ACC_ONB/request')
oai_sets = sickle.ListSets()
for oai_set in oai_sets:
    print('setSpec value for selective harvesting: ' + oai_set.setSpec)
    print('Name of the set (setName): ' + oai_set.setName + '\n')
```

%% Output

    setSpec value for selective harvesting: PAPYRUSDC
    Name of the set (setName): Papyri records in DC simple
    
    setSpec value for selective harvesting: FULLMARC
    Name of the set (setName): Complete set of ONB records in MARC
    
    setSpec value for selective harvesting: HANNAMARC
    Name of the set (setName): HANNA records in MARC
    
    setSpec value for selective harvesting: ESPERANTOMARC
    Name of the set (setName): Esperanto records in MARC
    
    setSpec value for selective harvesting: ESPERANTODC
    Name of the set (setName): Esperanto Records in DC simple
    
    setSpec value for selective harvesting: PAPYRUSMARC
    Name of the set (setName): Papyri records in MARC
    
    setSpec value for selective harvesting: HANNADC
    Name of the set (setName): HANNA records in DC simple
    
    setSpec value for selective harvesting: ABODC
    Name of the set (setName): Austrian Books Online in DC simple
    
    setSpec value for selective harvesting: ARIADNEDC
    Name of the set (setName): Ariadne records in DC simple
    
    setSpec value for selective harvesting: ARIADNEMARC
    Name of the set (setName): Ariadne records in MARC
    
    setSpec value for selective harvesting: MAPMARC
    Name of the set (setName): Maps and Globes records in MARC
    
    setSpec value for selective harvesting: FULLDC
    Name of the set (setName): Complete set of ONB records in DC simple
    
    setSpec value for selective harvesting: MAPDC
    Name of the set (setName): Maps and Globes records in DC simple
    
    setSpec value for selective harvesting: ABOMARC
    Name of the set (setName): Austrian Books Online in MARC
    
    setSpec value for selective harvesting: OAIBIBLIOA
    Name of the set (setName): Austrian Bibliography A
    
    setSpec value for selective harvesting: MUSHANDC
    Name of the set (setName): Musikhandschriften in DC
    
    setSpec value for selective harvesting: MUSHANMARC
    Name of the set (setName): Music Manuscripts
    
    setSpec value for selective harvesting: CERLMARC
    Name of the set (setName): Old prints and manuscripts for CERL portal