"print(f\"Column '{str(row[4])}' with build_extractor parameters\\n\\tpattern '{str(row[0])}'\\n\\tselector '{str(row[1])}'\\n\\tcollector_character '{str(row[2])}'\\n\")"
]
},
{
...
...
%% Cell type:markdown id: tags:
# Extract Bibliographic Data by Unique ID
## Introduction
This notebook assumes that the user has a list of unique IDs for bibliographic records stored in the library software system [Alma](https://knowledge.exlibrisgroup.com/Alma/Product_Documentation/010Alma_Online_Help_(English)/010Getting_Started/010Alma_Introduction/010Alma_Overview).
These records are then filtered by categories contained in MARC-XML. Find documentation on the MARC-XML-format through the website of the [Library of Congress](https://www.loc.gov/marc/bibliographic/) or specifically for Austrian cataloging standards refer to the second and third column of the (german-only) [Konkordanz](https://wiki.obvsg.at/Katalogisierungshandbuch/KonKordanz).
In the following code the unique IDs are MMS-IDs, which are a special unique identifier within Alma-records. You could also provide other unique IDs like barcodes or any ID from MARC 009 (e. g. for the Austrian Library Network: AC-numbers). In case you need to use another unique ID do the following:
* find and replace the function *by_mms_id()* with one of the other two functions provided by the catalogue submodule: *by_barcode()* or *by_marc_009()*
* replace the *regex_pattern*
In this example the catalogue of the Austrian National Library is the source. We use SRU to fetch the data and python's pandas module to export the data to Excel.
%% Cell type:markdown id: tags:
## Setup
%% Cell type:markdown id: tags:
Necessary imports of standard, third party and local modules.
The local modules *almasru* and *marc_extractor* were taken from submodules in [catalogue](https://labs.onb.ac.at/gitlab/labs-team/catalogue/).
%% Cell type:code id: tags:
``` python
importdatetime
importre
importsys
fromcollectionsimportOrderedDict
importpandasaspd
fromcatalogue.sruimportalmasru
fromcatalogue.marc_toolsimportmarc_extractor
```
%% Cell type:markdown id: tags:
Basic setup of almasru for ONB. If you want to fetch the data for a different institution, change the following line. For example for a search within the Austrian Library Network replace *43ACC_ONB* with *43ACC_NETWORK*.
*ac_pattern* is needed for hierarchies between records within the Austrian Library Network (OBV). Within OBV the hierarchies are linked using MARC categories 773 and 830, identfying the parent by their AC-number.
If you want to query Alma instances outside OBV, get in contact with your local consortium or institution to find out how hierarchies are linked in the MARC-record. You will need to change the function *find_parent_id_in_child_xml()* accordingly.
%% Cell type:code id: tags:
``` python
ac_pattern=re.compile(r'(AC\d{8})')
```
%% Cell type:markdown id: tags:
## Load Mapping
%% Cell type:markdown id: tags:
If necessary, make changes to *mapping.csv*. Keep in mind was said in the [introduction](#Introduction) about the MARC-format.
Take note that in the current version *mapping.csv* has no influence on the information inherited from parent records. If you need to change the inherited info, take a look at the function *inherit_from_parent()* and *add_inheritance_to_columns()* in the cells below.
To get an idea of how the mapping works, take a look at the *build_extractor* function in the local *marc_extractor* module.
%% Cell type:code id: tags:
``` python
mapping=pd.read_csv('mapping.csv')
mapping=mapping.where((pd.notnull(mapping)),None)
```
%% Cell type:markdown id: tags:
Check if mapping looks okay, meaning it is adopted correctly in pandas. Show first and last entry for visual control.
%% Cell type:code id: tags:
``` python
pd.concat((mapping.iloc[:1],mapping.iloc[-1:]))
```
%% Output
MARC controlfield MARC extra selector Liste \
0 009 None None
33 AVA _ _ $$d ; AVA _ _ $$i ; AVA _ _ $$j None None
Extract unique IDs from a given Excel-file. In this example we used a search-export done in Alma, where the MMS-ID is listed in the column 'MMS ID'.
For any other Excel-file use the header of the column. This means your Excel-file may not contain data in the first row, but must have a distinct name for the data listed in the column below.
%% Cell type:code id: tags:
``` python
defload_uid_list(file_name):
try:
record_numbers=pd.read_excel(file_name)['MMS ID']
exceptExceptionase:
print(f'Exception encountered while reading Excel-file: {str(e)}',file=sys.stderr)
else:
returnrecord_numbers
```
%% Cell type:markdown id: tags:
### Fetching XML and Extracting Data
%% Cell type:markdown id: tags:
Fetch MARC-XML of both the requested records (potential children) and - if present - of their parent records. Parent records may contain information that is viable for all their children and should be added or appended accordingly. The function *add_inheritance_to_columns()* will add the parent's information to the pandas dataframe.
Parents can be referenced by unique ID either in MARC 773 Subfield w or 830 Subfield w. In our case references can only be resolved if they are AC-numbers.