Skip to content
README.MD 4.02 KiB
Newer Older
philip.roeggla's avatar
philip.roeggla committed
# Alma Data Extractor

Takes a list of ac-numbers, makes requests to sru/alma/önb, extracts data from the returned marc:xml, brings it in tabular form and writes it to an excel.

## Terminal Script

You can run the software per terminal at `./script.py`. `cd` into this directory and run `python script.py --help` to list the options. 

philip.roeggla's avatar
philip.roeggla committed
## Concept

The program is mostly written async, so that it can process the data, while waiting for new requests.
philip.roeggla's avatar
philip.roeggla committed

The main controller is `travelogues_extraction.controller.main.FromAlmaOutputToExcel`. 

1. It takes an path to an excel and generates ac numbers from a column in that excel.
2. Makes requests to sru/önb and parses it with lxml, and take the first marc:record
3. Runs a list of "Extractors" through each record, that populates th output dataframe


### Change or Add to the software

You will need a solid knowledge in python and xpath.

#### 1. AC-Number-Generator

If you would like to change the way the ac-numbers are generated, see here `travelogues_extraction/getrecords/acnumber_extractor.py`

#### 2. Requests

`travelogues_extraction.getrecords.session.RecordRetriever`

You can change the url, the requests, the xpath for the records.

#### 3. Extractors


All Extractors inherit from `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor`.

There are a bunch of more detailed Abstract extractors:

- `AbstractSingleDataExtractor`
- `AbstractXpathDirectlyToColumn`
- `AbstractXpathJoinDirectlyToColumn`
- `AbstractMultifield`
- `AbstractParentAsSecondCast`

and all extractors in `travelogues_extraction/dataextractors/dataextractors` inherit from them

A lot of the classes actually only have properties and no methods and it would be possible to just give this properties
their parent classes in the constructor. However a lot them do have custom methods. The controller only needs to rely on 
the interface of the top `AbstractDataExtractor` 


##### 3.1. Order, Include / Exclude Extractors

The columns in the dataframe will be generated in the order of the classes in `travelogues_extraction/controller/main.py:33`
and will take the names of the columns in `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor.get_columns_names_I_work_on` in their order.

You can change the order of, include, exclude extractors there.

##### 3.2. Write an data extractor

Inherit from `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor`, implement a method write, that 
takes a `travelogues_extraction.getrecords.session.RecordRetriever.Record` as input and writes to 
`self.target_dataframe.at[record.ac_number, self.column] = 'your data'. You will find the lxml parsed xml representation
 in the `lxmlelement` the property of the record.
 
 There are a few parent classes you can use:
 
 
####### `AbstractXpathDirectlyToColumn`
 
 Child classes define the column a xpath object and this class writes the first found text to the target dataframe. 
 Example: `travelogues_extraction.dataextractors.dataextractors.index.MMSID`
 
####### `AbstractXpathDirectlyToColumn`

Child classes define the column a xpath object and this class joins the text results with property `join_string`

###### `AbstractMultifield`

Looks up data in the record with `primary_xml_path: lxmletree.XPath`, than it uses each of `xpath_isgnd_tuples`
for data generation. `XpathIsGnd` consists of a xpath object, and tells if this is a datum containing gnd data.
If so it wil render the output as uris. The output will be rendered as a string with one of the `join_string_*_level` 
properties.

Example: `travelogues_extraction.dataextractors.dataextractors.combinedsubfields.VerfasserGND`

###### `AbstractParentAsSecondCast`

The last looks in other records, if it does not find them in the current record. It takes `parent_ac_xpath`
to look for the parent, which is called so, because in all our cases, it was a parent record. Implement your data 
extraction in _write(). write() will take care of the rest and use your _write. 

Example: `travelogues_extraction.dataextractors.dataextractors.übergeordnet.Schlagworte`
`