Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# Alma Data Extractor
Takes a list of ac-numbers, makes requests to sru/alma/önb, extracts data from the returned marc:xml, brings it in tabular form and writes it to an excel.
## Concept
The program is mostly written async, so that it can process the data, while watiting for new requests.
The main controller is `travelogues_extraction.controller.main.FromAlmaOutputToExcel`.
1. It takes an path to an excel and generates ac numbers from a column in that excel.
2. Makes requests to sru/önb and parses it with lxml, and take the first marc:record
3. Runs a list of "Extractors" through each record, that populates th output dataframe
### Change or Add to the software
You will need a solid knowledge in python and xpath.
#### 1. AC-Number-Generator
If you would like to change the way the ac-numbers are generated, see here `travelogues_extraction/getrecords/acnumber_extractor.py`
#### 2. Requests
`travelogues_extraction.getrecords.session.RecordRetriever`
You can change the url, the requests, the xpath for the records.
#### 3. Extractors
All Extractors inherit from `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor`.
There are a bunch of more detailed Abstract extractors:
- `AbstractSingleDataExtractor`
- `AbstractXpathDirectlyToColumn`
- `AbstractXpathJoinDirectlyToColumn`
- `AbstractMultifield`
- `AbstractParentAsSecondCast`
and all extractors in `travelogues_extraction/dataextractors/dataextractors` inherit from them
A lot of the classes actually only have properties and no methods and it would be possible to just give this properties
their parent classes in the constructor. However a lot them do have custom methods. The controller only needs to rely on
the interface of the top `AbstractDataExtractor`
##### 3.1. Order, Include / Exclude Extractors
The columns in the dataframe will be generated in the order of the classes in `travelogues_extraction/controller/main.py:33`
and will take the names of the columns in `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor.get_columns_names_I_work_on` in their order.
You can change the order of, include, exclude extractors there.
##### 3.2. Write an data extractor
Inherit from `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor`, implement a method write, that
takes a `travelogues_extraction.getrecords.session.RecordRetriever.Record` as input and writes to
`self.target_dataframe.at[record.ac_number, self.column] = 'your data'. You will find the lxml parsed xml representation
in the `lxmlelement` the property of the record.
There are a few parent classes you can use:
####### `AbstractXpathDirectlyToColumn`
Child classes define the column a xpath object and this class writes the first found text to the target dataframe.
Example: `travelogues_extraction.dataextractors.dataextractors.index.MMSID`
####### `AbstractXpathDirectlyToColumn`
Child classes define the column a xpath object and this class joins the text results with property `join_string`
###### `AbstractMultifield`
Looks up data in the record with `primary_xml_path: lxmletree.XPath`, than it uses each of `xpath_isgnd_tuples`
for data generation. `XpathIsGnd` consists of a xpath object, and tells if this is a datum containing gnd data.
If so it wil render the output as uris. The output will be rendered as a string with one of the `join_string_*_level`
properties.
Example: `travelogues_extraction.dataextractors.dataextractors.combinedsubfields.VerfasserGND`
###### `AbstractParentAsSecondCast`
The last looks in other records, if it does not find them in the current record. It takes `parent_ac_xpath`
to look for the parent, which is called so, because in all our cases, it was a parent record. Implement your data
extraction in _write(). write() will take care of the rest and use your _write.
Example: `travelogues_extraction.dataextractors.dataextractors.übergeordnet.Schlagworte`
`