Skip to content
Snippets Groups Projects
Commit cbbdcd17 authored by gabriele-h's avatar gabriele-h
Browse files

Beautify print statement (f-string)

parent 86bca5d4
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Extract Bibliographic Data by Unique ID
## Introduction
This notebook assumes that the user has a list of unique IDs for bibliographic records stored in the library software system [Alma](https://knowledge.exlibrisgroup.com/Alma/Product_Documentation/010Alma_Online_Help_(English)/010Getting_Started/010Alma_Introduction/010Alma_Overview).
These records are then filtered by categories contained in MARC-XML. Find documentation on the MARC-XML-format through the website of the [Library of Congress](https://www.loc.gov/marc/bibliographic/) or specifically for Austrian cataloging standards refer to the second and third column of the (german-only) [Konkordanz](https://wiki.obvsg.at/Katalogisierungshandbuch/KonKordanz).
In the following code the unique IDs are MMS-IDs, which are a special unique identifier within Alma-records. You could also provide other unique IDs like barcodes or any ID from MARC 009 (e. g. for the Austrian Library Network: AC-numbers). In case you need to use another unique ID do the following:
* find and replace the function *by_mms_id()* with one of the other two functions provided by the catalogue submodule: *by_barcode()* or *by_marc_009()*
* replace the *regex_pattern*
In this example the catalogue of the Austrian National Library is the source. We use SRU to fetch the data and python's pandas module to export the data to Excel.
%% Cell type:markdown id: tags:
## Setup
%% Cell type:markdown id: tags:
Necessary imports of standard, third party and local modules.
The local modules *almasru* and *marc_extractor* were taken from submodules in [catalogue](https://labs.onb.ac.at/gitlab/labs-team/catalogue/).
%% Cell type:code id: tags:
``` python
import datetime
import re
import sys
from collections import OrderedDict
import pandas as pd
from catalogue.sru import almasru
from catalogue.marc_tools import marc_extractor
```
%% Cell type:markdown id: tags:
Basic setup of almasru for ONB. If you want to fetch the data for a different institution, change the following line. For example for a search within the Austrian Library Network replace *43ACC_ONB* with *43ACC_NETWORK*.
%% Cell type:code id: tags:
``` python
alma = almasru.RecordRetriever('obv-at-oenb', '43ACC_ONB', 'marcxml')
```
%% Cell type:markdown id: tags:
*ac_pattern* is needed for hierarchies between records within the Austrian Library Network (OBV). Within OBV the hierarchies are linked using MARC categories 773 and 830, identfying the parent by their AC-number.
If you want to query Alma instances outside OBV, get in contact with your local consortium or institution to find out how hierarchies are linked in the MARC-record. You will need to change the function *find_parent_id_in_child_xml()* accordingly.
%% Cell type:code id: tags:
``` python
ac_pattern = re.compile(r'(AC\d{8})')
```
%% Cell type:markdown id: tags:
## Load Mapping
%% Cell type:markdown id: tags:
If necessary, make changes to *mapping.csv*. Keep in mind was said in the [introduction](#Introduction) about the MARC-format.
Take note that in the current version *mapping.csv* has no influence on the information inherited from parent records. If you need to change the inherited info, take a look at the function *inherit_from_parent()* and *add_inheritance_to_columns()* in the cells below.
To get an idea of how the mapping works, take a look at the *build_extractor* function in the local *marc_extractor* module.
%% Cell type:code id: tags:
``` python
mapping = pd.read_csv('mapping.csv')
mapping = mapping.where((pd.notnull(mapping)), None)
```
%% Cell type:markdown id: tags:
Check if mapping looks okay, meaning it is adopted correctly in pandas. Show first and last entry for visual control.
%% Cell type:code id: tags:
``` python
pd.concat((mapping.iloc[:1], mapping.iloc[-1:]))
```
%% Output
MARC controlfield MARC extra selector Liste \
0 009 None None
33 AVA _ _ $$d ; AVA _ _ $$i ; AVA _ _ $$j None None
MARC-XML controlfield Label \
0 <controlfield tag="009"> Systemnummer
33 <datafield tag="AVA" ind1=" " ind2=" "><subfie... Signatur
Comment
0 AC-Nummer
33 Signatur aus Subfield $$d, danach ohne Trennze...
%% Cell type:markdown id: tags:
## Create Extractors from Mapping
%% Cell type:markdown id: tags:
First column (panda's index) is ignored. The relevant parts of the table are used for the extractors.
%% Cell type:code id: tags:
``` python
column_extractors = OrderedDict()
for _, row in mapping.iterrows():
column_extractors[row[4]] = marc_extractor.build_extractor(row[0], row[1], row[2])
```
%% Cell type:markdown id: tags:
Print-statement to look at an example of what is done above:
%% Cell type:code id: tags:
``` python
print("Column '" + str(row[4]) + "' with build_extractor parameters\n\tpattern '" + str(row[0]) + "'\n\tselector '" + str(row[1]) + "'\n\tcollector_character '" + str(row[2]) + "'\n")
print(f"Column '{str(row[4])}' with build_extractor parameters\n\tpattern '{str(row[0])}'\n\tselector '{str(row[1])}'\n\tcollector_character '{str(row[2])}'\n")
```
%% Output
Column 'Signatur' with build_extractor parameters
pattern 'AVA _ _ $$d ; AVA _ _ $$i ; AVA _ _ $$j'
selector 'None'
collector_character 'None'
%% Cell type:markdown id: tags:
## Prepare Postprocessing
%% Cell type:markdown id: tags:
* Remove semicola from the Signatur column
* This is specific for ACC43_ONB: Extract the Barcode from ABO-links if available
%% Cell type:code id: tags:
``` python
def post(df):
df_out = df
df_out['Signatur'] = df_out['Signatur'].str.replace(';', '')
df_out['Barcode'] = '+Z' + df_out['Volltext'].str.extract(r'http://data.onb.ac.at/ABO/%2BZ(.*)')
return df_out
```
%% Cell type:markdown id: tags:
## Prepare All Functions Necessary for Extraction
%% Cell type:markdown id: tags:
With a given list of unique identifiers, create an Excel-file of the bibliographic data for the records.
%% Cell type:code id: tags:
``` python
def uid_list_to_excel(uid_list, excel_file_name_stem):
data = [get_bibliographic_for_uid(uid) for uid in uid_list]
df = pd.DataFrame(data)
df_post = post(df)
df_post.to_excel(f'Output/{excel_file_name_stem}_{now()}.xlsx', index=False)
```
%% Cell type:markdown id: tags:
Create timestamp of current date for file creation.
%% Cell type:code id: tags:
``` python
def now():
now = datetime.datetime.now().replace(microsecond=0)
now_without_colons = now.isoformat().replace(':', '')
return now_without_colons
```
%% Cell type:markdown id: tags:
Extract unique IDs from a given Excel-file. In this example we used a search-export done in Alma, where the MMS-ID is listed in the column 'MMS ID'.
For any other Excel-file use the header of the column. This means your Excel-file may not contain data in the first row, but must have a distinct name for the data listed in the column below.
%% Cell type:code id: tags:
``` python
def load_uid_list(file_name):
try:
record_numbers = pd.read_excel(file_name)['MMS ID']
except Exception as e:
print(f'Exception encountered while reading Excel-file: {str(e)}', file=sys.stderr)
else:
return record_numbers
```
%% Cell type:markdown id: tags:
### Fetching XML and Extracting Data
%% Cell type:markdown id: tags:
Fetch MARC-XML of both the requested records (potential children) and - if present - of their parent records. Parent records may contain information that is viable for all their children and should be added or appended accordingly. The function *add_inheritance_to_columns()* will add the parent's information to the pandas dataframe.
%% Cell type:code id: tags:
``` python
def get_bibliographic_for_uid(uid):
try:
marc_xml = alma.by_mms_id(uid)
parent_uid = find_parent_id_in_child_xml(marc_xml)
if parent_uid:
parent_xml = fetch_parent_xml(parent_uid)
parent_title, parent_categories, parent_contents = inherit_from_parent(parent_xml)
except almasru.NoRecord:
print(f'No record for unique ID "{uid}" found.', file=sys.stderr)
d = OrderedDict()
for column, _ in column_extractors.items():
d[column] = None
d["Systemnummer"] = uid
return d
except Exception as e:
print(f'Exception encountered while fetching bibliographic data: {str(e)}', file=sys.stderr)
else:
d = OrderedDict()
for column, extractor in column_extractors.items():
d[column] = extractor.parse(marc_xml)
if 'parent_title' in locals():
add_inheritance_to_columns(d, parent_title, parent_categories, parent_contents)
return d
```
%% Cell type:markdown id: tags:
### Fetch Data of Parent Record
Parents can be referenced by unique ID either in MARC 773 Subfield w or 830 Subfield w. In our case references can only be resolved if they are AC-numbers.
%% Cell type:code id: tags:
``` python
def find_parent_id_in_child_xml(marc_xml):
for datafield in marc_xml:
if datafield.attrib.items() >= {"tag": "773"}.items() or \
datafield.attrib.items() >= {"tag": "830"}.items():
for subfield in datafield:
if subfield.attrib.items() >= {"code": "w"}.items():
try:
parent_uid = ac_pattern.findall(subfield.text)[0]
except Exception as e:
print(f"ERROR: Couldn't find AC-Num in 773 or 830 of the child. {e}", file=sys.stderr)
return parent_uid
```
%% Cell type:markdown id: tags:
Now that we have the parent's ID we can fetch its MARC-record.
%% Cell type:code id: tags:
``` python
def fetch_parent_xml(parent_uid):
try:
parent_xml = alma.by_marc_009(parent_uid)
except Exception as e:
print(f"ERROR: Fetching XML of parent {parent_uid} caused an error. {e}", file=sys.stderr)
return parent_xml
```
%% Cell type:markdown id: tags:
Fetch the inheritance (uniform title, subject, genre) from the parent record or return empty strings/lists if not present.
%% Cell type:code id: tags:
``` python
def inherit_from_parent(parent_xml):
parent_title = ""
parent_categories = []
parent_contents = []
for p_datafield in parent_xml:
if p_datafield.attrib.items() >= {"tag": "240"}.items() or \
p_datafield.attrib.items() >= {"tag": "130"}.items():
for p_subfield in p_datafield:
if p_subfield.attrib.items() >= {"code": "a"}.items() or \
p_subfield.attrib.items() >= {"code": "t"}.items():
parent_title = p_subfield.text
elif p_datafield.attrib.items() >= {"tag": "689"}.items():
for p_subfield in p_datafield:
if p_subfield.attrib.items() >= {"code": "a"}.items():
parent_category = p_subfield.text
parent_categories.append(parent_category)
elif p_datafield.attrib.items() >= {"tag": "655"}.items():
for p_subfield in p_datafield:
if p_subfield.attrib.items() >= {"ind1": " ", "ind2": "7", "tag": "a"}.items():
parent_content = p_subfield.text
parent_contents.append(parent_content)
return parent_title, parent_categories, parent_contents
```
%% Cell type:markdown id: tags:
Add information to columns if available.
%% Cell type:code id: tags:
``` python
def add_inheritance_to_columns(d, parent_title, parent_categories, parent_contents):
if parent_title != "" and not d["Werktitel"]:
d["Werktitel"] = parent_title
elif parent_categories:
child_categories_string = d["Schlagworte"]
child_categories = child_categories_string.split(';')
all_categories = parent_categories + child_categories
d["Schlagworte"] = ';'.join(all_categories)
elif parent_contents:
child_contents_string = d["Art des Inhalts"]
child_contents = child_contents_string.split(';')
all_contents = parent_contents + child_contents
d["Art des Inhalts"] = ';'.join(all_contents)
```
%% Cell type:markdown id: tags:
## Do the Actual Extract
%% Cell type:markdown id: tags:
Insert your input and output file names here. You could provide more than one pair of input- and outputfiles.
%% Cell type:code id: tags:
``` python
ac_list = load_ac_list('Input/TravelogueD17_Japan.xlsx')
file_name_stem = 'Test_'
```
%% Cell type:code id: tags:
``` python
uid_list_to_excel(uid_list, file_name_stem)
```
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment