# Alma Data Extractor Takes a list of ac-numbers, makes requests to sru/alma/önb, extracts data from the returned marc:xml, brings it in tabular form and writes it to an excel. ## Terminal Script You can run the software per terminal at `./script.py`. `cd` into this directory and run `python script.py --help` to list the options. ## Concept The program is mostly written async, so that it can process the data, while waiting for new requests. The main controller is `travelogues_extraction.controller.main.FromAlmaOutputToExcel`. 1. It takes an path to an excel and generates ac numbers from a column in that excel. 2. Makes requests to sru/önb and parses it with lxml, and take the first marc:record 3. Runs a list of "Extractors" through each record, that populates th output dataframe ### Change or Add to the software You will need a solid knowledge in python and xpath. #### 1. AC-Number-Generator If you would like to change the way the ac-numbers are generated, see here `travelogues_extraction/getrecords/acnumber_extractor.py` #### 2. Requests `travelogues_extraction.getrecords.session.RecordRetriever` You can change the url, the requests, the xpath for the records. #### 3. Extractors All Extractors inherit from `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor`. There are a bunch of more detailed Abstract extractors: - `AbstractSingleDataExtractor` - `AbstractXpathDirectlyToColumn` - `AbstractXpathJoinDirectlyToColumn` - `AbstractMultifield` - `AbstractParentAsSecondCast` and all extractors in `travelogues_extraction/dataextractors/dataextractors` inherit from them A lot of the classes actually only have properties and no methods and it would be possible to just give this properties their parent classes in the constructor. However a lot them do have custom methods. The controller only needs to rely on the interface of the top `AbstractDataExtractor` ##### 3.1. Order, Include / Exclude Extractors The columns in the dataframe will be generated in the order of the classes in `travelogues_extraction/controller/main.py:33` and will take the names of the columns in `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor.get_columns_names_I_work_on` in their order. You can change the order of, include, exclude extractors there. ##### 3.2. Write an data extractor Inherit from `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor`, implement a method write, that takes a `travelogues_extraction.getrecords.session.RecordRetriever.Record` as input and writes to `self.target_dataframe.at[record.ac_number, self.column] = 'your data'. You will find the lxml parsed xml representation in the `lxmlelement` the property of the record. There are a few parent classes you can use: ####### `AbstractXpathDirectlyToColumn` Child classes define the column a xpath object and this class writes the first found text to the target dataframe. Example: `travelogues_extraction.dataextractors.dataextractors.index.MMSID` ####### `AbstractXpathDirectlyToColumn` Child classes define the column a xpath object and this class joins the text results with property `join_string` ###### `AbstractMultifield` Looks up data in the record with `primary_xml_path: lxmletree.XPath`, than it uses each of `xpath_isgnd_tuples` for data generation. `XpathIsGnd` consists of a xpath object, and tells if this is a datum containing gnd data. If so it wil render the output as uris. The output will be rendered as a string with one of the `join_string_*_level` properties. Example: `travelogues_extraction.dataextractors.dataextractors.combinedsubfields.VerfasserGND` ###### `AbstractParentAsSecondCast` The last looks in other records, if it does not find them in the current record. It takes `parent_ac_xpath` to look for the parent, which is called so, because in all our cases, it was a parent record. Implement your data extraction in _write(). write() will take care of the rest and use your _write. Example: `travelogues_extraction.dataextractors.dataextractors.übergeordnet.Schlagworte` `