Correct import statements for local

73fc8a56 · gabriele-h · 79f6919d · 73fc8a56
Commit 73fc8a56 authored 5 years ago by gabriele-h
--- a/Extract_Bibliographic_Info_From_Alma.ipynb
+++ b/Extract_Bibliographic_Info_From_Alma.ipynb
@@ -42,8 +42,8 @@
    "\n",
    "import pandas as pd\n",
    "\n",
-    "import almasru\n",
+    "from catalogue.sru import almasru\n",
-    "import marc_extractor"
+    "from catalogue.marc_tools import marc_extractor"
   ]
  },
  {

 %% Cell type:markdown id: tags:
 # Extract Bibliographic Data by Unique ID
 This notebook assumes that the user has a list of unique IDs for bibliographic records within the library software system [Alma](https://knowledge.exlibrisgroup.com/Alma/Product_Documentation/010Alma_Online_Help_(English)/010Getting_Started/010Alma_Introduction/010Alma_Overview).
 In the following code the unique IDs are AC-numbers, which are a special identifier within the [Austrian Library Network](https://www.obvsg.at/). You could also provide other unique IDs like MMS-IDs or barcodes. In that case find and replace the function *by_marc_009()* with one of the other two functions provided by the catalogue submodule: *by_barcode()* or *by_mms_id()*.
 In this example the catalogue of the Austrian National Library is the source. We use SRU to fetch the data and python's pandas module to export the data to Excel.
 %% Cell type:markdown id: tags:
 ## Setup
 %% Cell type:markdown id: tags:
 Necessary imports of standard, third party and local modules.
 The local modules *almasru* and *marc_extractor* were taken from submodules in [catalogue](https://labs.onb.ac.at/gitlab/labs-team/catalogue/).
 %% Cell type:code id: tags:
 ``` python
 import datetime
 import re
 import sys
 from collections import OrderedDict
 import pandas as pd
-import almasru
+from catalogue.sru import almasru
-import marc_extractor
+from catalogue.marc_tools import marc_extractor
 ```
 %% Cell type:markdown id: tags:
 Basic setup of almasru for ONB. If you want to fetch the data for a different institution, change the following line.
 %% Cell type:code id: tags:
 ``` python
 alma = almasru.RecordRetriever('obv-at-oenb', '43ACC_ONB', 'marcxml')
 ```
 %% Cell type:markdown id: tags:
 RegEx-pattern for AC-numbers:
 %% Cell type:code id: tags:
 ``` python
 ac_pattern = re.compile(r'(AC\d{8})')
 ```
 %% Cell type:markdown id: tags:
 ## Load Mapping
 %% Cell type:markdown id: tags:
 If necessary, make changes to *mapping.csv*
 Take note that in the current version *mapping.csv* has no influence on the information inherited from parent records. If you need to change the inherited info, take a look at the function *inherit_from_parent()* and *add_inheritance_to_columns()* in the cells below.
 To get an idea of how the mapping works, take a look at the *build_extractor* function in the local *marc_extractor* module.
 %% Cell type:code id: tags:
 ``` python
 mapping = pd.read_csv('mapping.csv')
 mapping = mapping.where((pd.notnull(mapping)), None)
 ```
 %% Cell type:markdown id: tags:
 Check if mapping looks okay, meaning it is adopted correctly in pandas. Show first and last entry for visual control.
 %% Cell type:code id: tags:
 ``` python
 pd.concat((mapping.iloc[:1], mapping.iloc[-1:]))
 ```
 %% Output
                              MARC controlfield MARC extra selector Liste  \
    0                                       009                None  None
    33  AVA _ _ $$d ; AVA _ _ $$i ; AVA _ _ $$j                None  None
                                    MARC-XML controlfield         Label  \
    0                            <controlfield tag="009">  Systemnummer
    33  <datafield tag="AVA" ind1=" " ind2=" "><subfie...      Signatur
                                                  Comment
    0                                           AC-Nummer
    33  Signatur aus Subfield $$d, danach ohne Trennze...
 %% Cell type:markdown id: tags:
 ## Create Extractors from Mapping
 %% Cell type:markdown id: tags:
 First column (panda's index) is ignored. The relevant parts of the table are used for the extractors.
 %% Cell type:code id: tags:
 ``` python
 column_extractors = OrderedDict()
 for _, row in mapping.iterrows():
    column_extractors[row[4]] = marc_extractor.build_extractor(row[0], row[1], row[2])
 ```
 %% Cell type:markdown id: tags:
 Print-statement to look at an example of what is done above:
 %% Cell type:code id: tags:
 ``` python
 print("Column '" + str(row[4]) + "' with build_extractor parameters\n\tpattern '" + str(row[0]) + "'\n\tselector '" + str(row[1]) + "'\n\tcollector_character '" + str(row[2]) + "'\n")
 ```
 %% Output
    Column 'Signatur' with build_extractor parameters
    	pattern 'AVA _ _ $$d ; AVA _ _ $$i ; AVA _ _ $$j'
    	selector 'None'
    	collector_character 'None'
 %% Cell type:markdown id: tags:
 ## Prepare Postprocessing
 %% Cell type:markdown id: tags:
 * Remove semicola from the Signatur column
 * Extract the Barcode from ABO-links if available
 %% Cell type:code id: tags:
 ``` python
 def post(df):
    df_out = df
    df_out['Signatur'] = df_out['Signatur'].str.replace(';', '')
    df_out['Barcode'] = '+Z' + df_out['Volltext'].str.extract(r'http://data.onb.ac.at/ABO/%2BZ(.*)')
    return df_out
 ```
 %% Cell type:markdown id: tags:
 ## Prepare All Functions Necessary for Extraction
 %% Cell type:markdown id: tags:
 With a given list of AC-numbers, create an Excel-file of the bibliographic data for the records.
 %% Cell type:code id: tags:
 ``` python
 def ac_list_to_excel(ac_list, excel_file_name_stem):
    data = [get_bibliographic_for_ac(ac) for ac in ac_list]
    df = pd.DataFrame(data)
    df_post = post(df)
    df_post.to_excel(f'Output/{excel_file_name_stem} {now()}.xlsx')
 ```
 %% Cell type:markdown id: tags:
 Create timestamp of current date for file creation.
 %% Cell type:code id: tags:
 ``` python
 def now():
    now = datetime.datetime.now().replace(microsecond=0)
    now_without_colons = now.isoformat().replace(':', '')
    return now_without_colons
 ```
 %% Cell type:markdown id: tags:
 Extract AC-numbers from a given Excel-file.
 %% Cell type:code id: tags:
 ``` python
 def load_ac_list(file_name):
    return pd.read_excel(file_name)['Datensatznummer'].apply(lambda s: ac_pattern.findall(s)[0])
 ```
 %% Cell type:markdown id: tags:
 ### Fetching XML and Extracting Data
 %% Cell type:markdown id: tags:
 Fetch MARC-XML of both the requested records (potential children) and - if present - of their parent records. Parent records may contain information that is viable for all their children and should be added or appended accordingly. The function *add_inheritance_to_columns()* will add the parent's information to the pandas dataframe.
 %% Cell type:code id: tags:
 ``` python
 def get_bibliographic_for_ac(ac):
    try:
        marc_xml = alma.by_marc_009(ac)
        parent_acnum = find_parent_id_in_child_xml(marc_xml)
        if parent_acnum:
            parent_xml = fetch_parent_xml(parent_acnum)
            parent_title, parent_categories, parent_contents = inherit_from_parent(parent_xml)
    except almasru.NoRecord:
        print(f'No record for AC number "{ac}" found.', file=sys.stderr)
        d = OrderedDict()
        for column, _ in column_extractors.items():
            d[column] = None
        d["Systemnummer"] = ac
        return d
    except Exception as e:
        print(f'Exception encountered: {str(e)}', file=sys.stderr)
    else:
        d = OrderedDict()
        for column, extractor in column_extractors.items():
            d[column] = extractor.parse(marc_xml)
        if 'parent_title' in locals():
            add_inheritance_to_columns(d, parent_title, parent_categories, parent_contents)
        return d
 ```
 %% Cell type:markdown id: tags:
 ### Fetch Data of Parent Record
 Parents can be referenced by unique ID either in MARC 773 \$\$w or 830 \$\$w. In our case references can only be resolved if they are AC-numbers.
 %% Cell type:code id: tags:
 ``` python
 def find_parent_id_in_child_xml(marc_xml):
    for datafield in marc_xml:
        if datafield.attrib.items() >= {"tag": "773"}.items() or \
                datafield.attrib.items() >= {"tag": "830"}.items():
            for subfield in datafield:
                if subfield.attrib.items() >= {"code": "w"}.items():
                    try:
                        parent_acnum = ac_pattern.findall(subfield.text)[0]
                    except Exception as e:
                        print(f"ERROR: Couldn't find AC-Num in 773 or 830 of the child. {e}", file=sys.stderr)
                    return parent_acnum
 ```
 %% Cell type:markdown id: tags:
 Now that we have the parent's ID we can fetch its MARC-record.
 %% Cell type:code id: tags:
 ``` python
 def fetch_parent_xml(parent_acnum):
    try:
        parent_xml = alma.by_marc_009(parent_acnum)
    except Exception as e:
        print(f"ERROR: Fetching XML of parent {parent_acnum} caused an error. {e}", file=sys.stderr)
    return parent_xml
 ```
 %% Cell type:markdown id: tags:
 Fetch the inheritance (uniform title, subject, genre) from the parent record or return empty strings/lists if not present.
 %% Cell type:code id: tags:
 ``` python
 def inherit_from_parent(parent_xml):
    parent_title = ""
    parent_categories = []
    parent_contents = []
    for p_datafield in parent_xml:
        if p_datafield.attrib.items() >= {"tag": "240"}.items() or \
                p_datafield.attrib.items() >= {"tag": "130"}.items():
            for p_subfield in p_datafield:
                if p_subfield.attrib.items() >= {"code": "a"}.items() or \
                        p_subfield.attrib.items() >= {"code": "t"}.items():
                    parent_title = p_subfield.text
        elif p_datafield.attrib.items() >= {"tag": "689"}.items():
            for p_subfield in p_datafield:
                if p_subfield.attrib.items() >= {"code": "a"}.items():
                    parent_category = p_subfield.text
                    parent_categories.append(parent_category)
        elif p_datafield.attrib.items() >= {"tag": "655"}.items():
            for p_subfield in p_datafield:
                if p_subfield.attrib.items() >= {"ind1": " ", "ind2": "7", "tag": "a"}.items():
                    parent_content = p_subfield.text
                    parent_contents.append(parent_content)
    return parent_title, parent_categories, parent_contents
 ```
 %% Cell type:markdown id: tags:
 Add information to columns if available.
 %% Cell type:code id: tags:
 ``` python
 def add_inheritance_to_columns(d, parent_title, parent_categories, parent_contents):
    if parent_title != "" and not d["Werktitel"]:
        d["Werktitel"] = parent_title
    elif parent_categories:
        child_categories_string = d["Schlagworte"]
        child_categories = child_categories_string.split(';')
        all_categories = parent_categories + child_categories
        d["Schlagworte"] = ';'.join(all_categories)
    elif parent_contents:
        child_contents_string = d["Art des Inhalts"]
        child_contents = child_contents_string.split(';')
        all_contents = parent_contents + child_contents
        d["Art des Inhalts"] = ';'.join(all_contents)
 ```
 %% Cell type:markdown id: tags:
 ## Do the Actual Extract
 %% Cell type:markdown id: tags:
 Insert your input and output file names here. You could provide more than one pair of input- and outputfiles.
 %% Cell type:code id: tags:
 ``` python
 ac_list = load_ac_list('Input/TravelogueD17_Japan.xlsx')
 file_name_stem = 'Test_'
 ```
 %% Cell type:code id: tags:
 ``` python
 ac_list_to_excel(ac_list, file_name_stem)
 ```