Newer
Older
# Alma Data Extractor
Takes a list of ac-numbers, makes requests to sru/alma/önb, extracts data from the returned marc:xml, brings it in tabular form and writes it to an excel.
## Terminal Script
You can run the software per terminal at `./script.py`. `cd` into this directory and run `python script.py --help` to list the options.
The program is mostly written async, so that it can process the data, while waiting for new requests.
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
The main controller is `travelogues_extraction.controller.main.FromAlmaOutputToExcel`.
1. It takes an path to an excel and generates ac numbers from a column in that excel.
2. Makes requests to sru/önb and parses it with lxml, and take the first marc:record
3. Runs a list of "Extractors" through each record, that populates th output dataframe
### Change or Add to the software
You will need a solid knowledge in python and xpath.
#### 1. AC-Number-Generator
If you would like to change the way the ac-numbers are generated, see here `travelogues_extraction/getrecords/acnumber_extractor.py`
#### 2. Requests
`travelogues_extraction.getrecords.session.RecordRetriever`
You can change the url, the requests, the xpath for the records.
#### 3. Extractors
All Extractors inherit from `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor`.
There are a bunch of more detailed Abstract extractors:
- `AbstractSingleDataExtractor`
- `AbstractXpathDirectlyToColumn`
- `AbstractXpathJoinDirectlyToColumn`
- `AbstractMultifield`
- `AbstractParentAsSecondCast`
and all extractors in `travelogues_extraction/dataextractors/dataextractors` inherit from them
A lot of the classes actually only have properties and no methods and it would be possible to just give this properties
their parent classes in the constructor. However a lot them do have custom methods. The controller only needs to rely on
the interface of the top `AbstractDataExtractor`
##### 3.1. Order, Include / Exclude Extractors
The columns in the dataframe will be generated in the order of the classes in `travelogues_extraction/controller/main.py:33`
and will take the names of the columns in `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor.get_columns_names_I_work_on` in their order.
You can change the order of, include, exclude extractors there.
##### 3.2. Write an data extractor
Inherit from `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor`, implement a method write, that
takes a `travelogues_extraction.getrecords.session.RecordRetriever.Record` as input and writes to
`self.target_dataframe.at[record.ac_number, self.column] = 'your data'. You will find the lxml parsed xml representation
in the `lxmlelement` the property of the record.
There are a few parent classes you can use:
####### `AbstractXpathDirectlyToColumn`
Child classes define the column a xpath object and this class writes the first found text to the target dataframe.
Example: `travelogues_extraction.dataextractors.dataextractors.index.MMSID`
####### `AbstractXpathDirectlyToColumn`
Child classes define the column a xpath object and this class joins the text results with property `join_string`
###### `AbstractMultifield`
Looks up data in the record with `primary_xml_path: lxmletree.XPath`, than it uses each of `xpath_isgnd_tuples`
for data generation. `XpathIsGnd` consists of a xpath object, and tells if this is a datum containing gnd data.
If so it wil render the output as uris. The output will be rendered as a string with one of the `join_string_*_level`
properties.
Example: `travelogues_extraction.dataextractors.dataextractors.combinedsubfields.VerfasserGND`
###### `AbstractParentAsSecondCast`
The last looks in other records, if it does not find them in the current record. It takes `parent_ac_xpath`
to look for the parent, which is called so, because in all our cases, it was a parent record. Implement your data
extraction in _write(). write() will take care of the rest and use your _write.
Example: `travelogues_extraction.dataextractors.dataextractors.übergeordnet.Schlagworte`
`