Skip to content
Snippets Groups Projects
Commit 73fc8a56 authored by gabriele-h's avatar gabriele-h
Browse files

Correct import statements for local

parent 79f6919d
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Extract Bibliographic Data by Unique ID # Extract Bibliographic Data by Unique ID
This notebook assumes that the user has a list of unique IDs for bibliographic records within the library software system [Alma](https://knowledge.exlibrisgroup.com/Alma/Product_Documentation/010Alma_Online_Help_(English)/010Getting_Started/010Alma_Introduction/010Alma_Overview). This notebook assumes that the user has a list of unique IDs for bibliographic records within the library software system [Alma](https://knowledge.exlibrisgroup.com/Alma/Product_Documentation/010Alma_Online_Help_(English)/010Getting_Started/010Alma_Introduction/010Alma_Overview).
In the following code the unique IDs are AC-numbers, which are a special identifier within the [Austrian Library Network](https://www.obvsg.at/). You could also provide other unique IDs like MMS-IDs or barcodes. In that case find and replace the function *by_marc_009()* with one of the other two functions provided by the catalogue submodule: *by_barcode()* or *by_mms_id()*. In the following code the unique IDs are AC-numbers, which are a special identifier within the [Austrian Library Network](https://www.obvsg.at/). You could also provide other unique IDs like MMS-IDs or barcodes. In that case find and replace the function *by_marc_009()* with one of the other two functions provided by the catalogue submodule: *by_barcode()* or *by_mms_id()*.
In this example the catalogue of the Austrian National Library is the source. We use SRU to fetch the data and python's pandas module to export the data to Excel. In this example the catalogue of the Austrian National Library is the source. We use SRU to fetch the data and python's pandas module to export the data to Excel.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Setup ## Setup
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Necessary imports of standard, third party and local modules. Necessary imports of standard, third party and local modules.
The local modules *almasru* and *marc_extractor* were taken from submodules in [catalogue](https://labs.onb.ac.at/gitlab/labs-team/catalogue/). The local modules *almasru* and *marc_extractor* were taken from submodules in [catalogue](https://labs.onb.ac.at/gitlab/labs-team/catalogue/).
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import datetime import datetime
import re import re
import sys import sys
from collections import OrderedDict from collections import OrderedDict
import pandas as pd import pandas as pd
import almasru from catalogue.sru import almasru
import marc_extractor from catalogue.marc_tools import marc_extractor
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Basic setup of almasru for ONB. If you want to fetch the data for a different institution, change the following line. Basic setup of almasru for ONB. If you want to fetch the data for a different institution, change the following line.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
alma = almasru.RecordRetriever('obv-at-oenb', '43ACC_ONB', 'marcxml') alma = almasru.RecordRetriever('obv-at-oenb', '43ACC_ONB', 'marcxml')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
RegEx-pattern for AC-numbers: RegEx-pattern for AC-numbers:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
ac_pattern = re.compile(r'(AC\d{8})') ac_pattern = re.compile(r'(AC\d{8})')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Load Mapping ## Load Mapping
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If necessary, make changes to *mapping.csv* If necessary, make changes to *mapping.csv*
Take note that in the current version *mapping.csv* has no influence on the information inherited from parent records. If you need to change the inherited info, take a look at the function *inherit_from_parent()* and *add_inheritance_to_columns()* in the cells below. Take note that in the current version *mapping.csv* has no influence on the information inherited from parent records. If you need to change the inherited info, take a look at the function *inherit_from_parent()* and *add_inheritance_to_columns()* in the cells below.
To get an idea of how the mapping works, take a look at the *build_extractor* function in the local *marc_extractor* module. To get an idea of how the mapping works, take a look at the *build_extractor* function in the local *marc_extractor* module.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
mapping = pd.read_csv('mapping.csv') mapping = pd.read_csv('mapping.csv')
mapping = mapping.where((pd.notnull(mapping)), None) mapping = mapping.where((pd.notnull(mapping)), None)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Check if mapping looks okay, meaning it is adopted correctly in pandas. Show first and last entry for visual control. Check if mapping looks okay, meaning it is adopted correctly in pandas. Show first and last entry for visual control.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
pd.concat((mapping.iloc[:1], mapping.iloc[-1:])) pd.concat((mapping.iloc[:1], mapping.iloc[-1:]))
``` ```
%% Output %% Output
MARC controlfield MARC extra selector Liste \ MARC controlfield MARC extra selector Liste \
0 009 None None 0 009 None None
33 AVA _ _ $$d ; AVA _ _ $$i ; AVA _ _ $$j None None 33 AVA _ _ $$d ; AVA _ _ $$i ; AVA _ _ $$j None None
MARC-XML controlfield Label \ MARC-XML controlfield Label \
0 <controlfield tag="009"> Systemnummer 0 <controlfield tag="009"> Systemnummer
33 <datafield tag="AVA" ind1=" " ind2=" "><subfie... Signatur 33 <datafield tag="AVA" ind1=" " ind2=" "><subfie... Signatur
Comment Comment
0 AC-Nummer 0 AC-Nummer
33 Signatur aus Subfield $$d, danach ohne Trennze... 33 Signatur aus Subfield $$d, danach ohne Trennze...
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Create Extractors from Mapping ## Create Extractors from Mapping
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
First column (panda's index) is ignored. The relevant parts of the table are used for the extractors. First column (panda's index) is ignored. The relevant parts of the table are used for the extractors.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
column_extractors = OrderedDict() column_extractors = OrderedDict()
for _, row in mapping.iterrows(): for _, row in mapping.iterrows():
column_extractors[row[4]] = marc_extractor.build_extractor(row[0], row[1], row[2]) column_extractors[row[4]] = marc_extractor.build_extractor(row[0], row[1], row[2])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Print-statement to look at an example of what is done above: Print-statement to look at an example of what is done above:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print("Column '" + str(row[4]) + "' with build_extractor parameters\n\tpattern '" + str(row[0]) + "'\n\tselector '" + str(row[1]) + "'\n\tcollector_character '" + str(row[2]) + "'\n") print("Column '" + str(row[4]) + "' with build_extractor parameters\n\tpattern '" + str(row[0]) + "'\n\tselector '" + str(row[1]) + "'\n\tcollector_character '" + str(row[2]) + "'\n")
``` ```
%% Output %% Output
Column 'Signatur' with build_extractor parameters Column 'Signatur' with build_extractor parameters
pattern 'AVA _ _ $$d ; AVA _ _ $$i ; AVA _ _ $$j' pattern 'AVA _ _ $$d ; AVA _ _ $$i ; AVA _ _ $$j'
selector 'None' selector 'None'
collector_character 'None' collector_character 'None'
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Prepare Postprocessing ## Prepare Postprocessing
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
* Remove semicola from the Signatur column * Remove semicola from the Signatur column
* Extract the Barcode from ABO-links if available * Extract the Barcode from ABO-links if available
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def post(df): def post(df):
df_out = df df_out = df
df_out['Signatur'] = df_out['Signatur'].str.replace(';', '') df_out['Signatur'] = df_out['Signatur'].str.replace(';', '')
df_out['Barcode'] = '+Z' + df_out['Volltext'].str.extract(r'http://data.onb.ac.at/ABO/%2BZ(.*)') df_out['Barcode'] = '+Z' + df_out['Volltext'].str.extract(r'http://data.onb.ac.at/ABO/%2BZ(.*)')
return df_out return df_out
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Prepare All Functions Necessary for Extraction ## Prepare All Functions Necessary for Extraction
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
With a given list of AC-numbers, create an Excel-file of the bibliographic data for the records. With a given list of AC-numbers, create an Excel-file of the bibliographic data for the records.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def ac_list_to_excel(ac_list, excel_file_name_stem): def ac_list_to_excel(ac_list, excel_file_name_stem):
data = [get_bibliographic_for_ac(ac) for ac in ac_list] data = [get_bibliographic_for_ac(ac) for ac in ac_list]
df = pd.DataFrame(data) df = pd.DataFrame(data)
df_post = post(df) df_post = post(df)
df_post.to_excel(f'Output/{excel_file_name_stem} {now()}.xlsx') df_post.to_excel(f'Output/{excel_file_name_stem} {now()}.xlsx')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Create timestamp of current date for file creation. Create timestamp of current date for file creation.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def now(): def now():
now = datetime.datetime.now().replace(microsecond=0) now = datetime.datetime.now().replace(microsecond=0)
now_without_colons = now.isoformat().replace(':', '') now_without_colons = now.isoformat().replace(':', '')
return now_without_colons return now_without_colons
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Extract AC-numbers from a given Excel-file. Extract AC-numbers from a given Excel-file.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def load_ac_list(file_name): def load_ac_list(file_name):
return pd.read_excel(file_name)['Datensatznummer'].apply(lambda s: ac_pattern.findall(s)[0]) return pd.read_excel(file_name)['Datensatznummer'].apply(lambda s: ac_pattern.findall(s)[0])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Fetching XML and Extracting Data ### Fetching XML and Extracting Data
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Fetch MARC-XML of both the requested records (potential children) and - if present - of their parent records. Parent records may contain information that is viable for all their children and should be added or appended accordingly. The function *add_inheritance_to_columns()* will add the parent's information to the pandas dataframe. Fetch MARC-XML of both the requested records (potential children) and - if present - of their parent records. Parent records may contain information that is viable for all their children and should be added or appended accordingly. The function *add_inheritance_to_columns()* will add the parent's information to the pandas dataframe.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def get_bibliographic_for_ac(ac): def get_bibliographic_for_ac(ac):
try: try:
marc_xml = alma.by_marc_009(ac) marc_xml = alma.by_marc_009(ac)
parent_acnum = find_parent_id_in_child_xml(marc_xml) parent_acnum = find_parent_id_in_child_xml(marc_xml)
if parent_acnum: if parent_acnum:
parent_xml = fetch_parent_xml(parent_acnum) parent_xml = fetch_parent_xml(parent_acnum)
parent_title, parent_categories, parent_contents = inherit_from_parent(parent_xml) parent_title, parent_categories, parent_contents = inherit_from_parent(parent_xml)
except almasru.NoRecord: except almasru.NoRecord:
print(f'No record for AC number "{ac}" found.', file=sys.stderr) print(f'No record for AC number "{ac}" found.', file=sys.stderr)
d = OrderedDict() d = OrderedDict()
for column, _ in column_extractors.items(): for column, _ in column_extractors.items():
d[column] = None d[column] = None
d["Systemnummer"] = ac d["Systemnummer"] = ac
return d return d
except Exception as e: except Exception as e:
print(f'Exception encountered: {str(e)}', file=sys.stderr) print(f'Exception encountered: {str(e)}', file=sys.stderr)
else: else:
d = OrderedDict() d = OrderedDict()
for column, extractor in column_extractors.items(): for column, extractor in column_extractors.items():
d[column] = extractor.parse(marc_xml) d[column] = extractor.parse(marc_xml)
if 'parent_title' in locals(): if 'parent_title' in locals():
add_inheritance_to_columns(d, parent_title, parent_categories, parent_contents) add_inheritance_to_columns(d, parent_title, parent_categories, parent_contents)
return d return d
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Fetch Data of Parent Record ### Fetch Data of Parent Record
Parents can be referenced by unique ID either in MARC 773 \$\$w or 830 \$\$w. In our case references can only be resolved if they are AC-numbers. Parents can be referenced by unique ID either in MARC 773 \$\$w or 830 \$\$w. In our case references can only be resolved if they are AC-numbers.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def find_parent_id_in_child_xml(marc_xml): def find_parent_id_in_child_xml(marc_xml):
for datafield in marc_xml: for datafield in marc_xml:
if datafield.attrib.items() >= {"tag": "773"}.items() or \ if datafield.attrib.items() >= {"tag": "773"}.items() or \
datafield.attrib.items() >= {"tag": "830"}.items(): datafield.attrib.items() >= {"tag": "830"}.items():
for subfield in datafield: for subfield in datafield:
if subfield.attrib.items() >= {"code": "w"}.items(): if subfield.attrib.items() >= {"code": "w"}.items():
try: try:
parent_acnum = ac_pattern.findall(subfield.text)[0] parent_acnum = ac_pattern.findall(subfield.text)[0]
except Exception as e: except Exception as e:
print(f"ERROR: Couldn't find AC-Num in 773 or 830 of the child. {e}", file=sys.stderr) print(f"ERROR: Couldn't find AC-Num in 773 or 830 of the child. {e}", file=sys.stderr)
return parent_acnum return parent_acnum
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Now that we have the parent's ID we can fetch its MARC-record. Now that we have the parent's ID we can fetch its MARC-record.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def fetch_parent_xml(parent_acnum): def fetch_parent_xml(parent_acnum):
try: try:
parent_xml = alma.by_marc_009(parent_acnum) parent_xml = alma.by_marc_009(parent_acnum)
except Exception as e: except Exception as e:
print(f"ERROR: Fetching XML of parent {parent_acnum} caused an error. {e}", file=sys.stderr) print(f"ERROR: Fetching XML of parent {parent_acnum} caused an error. {e}", file=sys.stderr)
return parent_xml return parent_xml
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Fetch the inheritance (uniform title, subject, genre) from the parent record or return empty strings/lists if not present. Fetch the inheritance (uniform title, subject, genre) from the parent record or return empty strings/lists if not present.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def inherit_from_parent(parent_xml): def inherit_from_parent(parent_xml):
parent_title = "" parent_title = ""
parent_categories = [] parent_categories = []
parent_contents = [] parent_contents = []
for p_datafield in parent_xml: for p_datafield in parent_xml:
if p_datafield.attrib.items() >= {"tag": "240"}.items() or \ if p_datafield.attrib.items() >= {"tag": "240"}.items() or \
p_datafield.attrib.items() >= {"tag": "130"}.items(): p_datafield.attrib.items() >= {"tag": "130"}.items():
for p_subfield in p_datafield: for p_subfield in p_datafield:
if p_subfield.attrib.items() >= {"code": "a"}.items() or \ if p_subfield.attrib.items() >= {"code": "a"}.items() or \
p_subfield.attrib.items() >= {"code": "t"}.items(): p_subfield.attrib.items() >= {"code": "t"}.items():
parent_title = p_subfield.text parent_title = p_subfield.text
elif p_datafield.attrib.items() >= {"tag": "689"}.items(): elif p_datafield.attrib.items() >= {"tag": "689"}.items():
for p_subfield in p_datafield: for p_subfield in p_datafield:
if p_subfield.attrib.items() >= {"code": "a"}.items(): if p_subfield.attrib.items() >= {"code": "a"}.items():
parent_category = p_subfield.text parent_category = p_subfield.text
parent_categories.append(parent_category) parent_categories.append(parent_category)
elif p_datafield.attrib.items() >= {"tag": "655"}.items(): elif p_datafield.attrib.items() >= {"tag": "655"}.items():
for p_subfield in p_datafield: for p_subfield in p_datafield:
if p_subfield.attrib.items() >= {"ind1": " ", "ind2": "7", "tag": "a"}.items(): if p_subfield.attrib.items() >= {"ind1": " ", "ind2": "7", "tag": "a"}.items():
parent_content = p_subfield.text parent_content = p_subfield.text
parent_contents.append(parent_content) parent_contents.append(parent_content)
return parent_title, parent_categories, parent_contents return parent_title, parent_categories, parent_contents
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Add information to columns if available. Add information to columns if available.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def add_inheritance_to_columns(d, parent_title, parent_categories, parent_contents): def add_inheritance_to_columns(d, parent_title, parent_categories, parent_contents):
if parent_title != "" and not d["Werktitel"]: if parent_title != "" and not d["Werktitel"]:
d["Werktitel"] = parent_title d["Werktitel"] = parent_title
elif parent_categories: elif parent_categories:
child_categories_string = d["Schlagworte"] child_categories_string = d["Schlagworte"]
child_categories = child_categories_string.split(';') child_categories = child_categories_string.split(';')
all_categories = parent_categories + child_categories all_categories = parent_categories + child_categories
d["Schlagworte"] = ';'.join(all_categories) d["Schlagworte"] = ';'.join(all_categories)
elif parent_contents: elif parent_contents:
child_contents_string = d["Art des Inhalts"] child_contents_string = d["Art des Inhalts"]
child_contents = child_contents_string.split(';') child_contents = child_contents_string.split(';')
all_contents = parent_contents + child_contents all_contents = parent_contents + child_contents
d["Art des Inhalts"] = ';'.join(all_contents) d["Art des Inhalts"] = ';'.join(all_contents)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Do the Actual Extract ## Do the Actual Extract
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Insert your input and output file names here. You could provide more than one pair of input- and outputfiles. Insert your input and output file names here. You could provide more than one pair of input- and outputfiles.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
ac_list = load_ac_list('Input/TravelogueD17_Japan.xlsx') ac_list = load_ac_list('Input/TravelogueD17_Japan.xlsx')
file_name_stem = 'Test_' file_name_stem = 'Test_'
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
ac_list_to_excel(ac_list, file_name_stem) ac_list_to_excel(ac_list, file_name_stem)
``` ```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment