Skip to content
Snippets Groups Projects
Commit 050dd02f authored by philip.roeggla's avatar philip.roeggla
Browse files

README!

parent aae7cb2f
No related branches found
No related tags found
No related merge requests found
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="PySciProjectComponent">
<option name="PY_SCI_VIEW_SUGGESTED" value="true" />
</component>
</project>
\ No newline at end of file
# Alma Data Extractor
Takes a list of ac-numbers, makes requests to sru/alma/önb, extracts data from the returned marc:xml, brings it in tabular form and writes it to an excel.
## Concept
The program is mostly written async, so that it can process the data, while watiting for new requests.
The main controller is `travelogues_extraction.controller.main.FromAlmaOutputToExcel`.
1. It takes an path to an excel and generates ac numbers from a column in that excel.
2. Makes requests to sru/önb and parses it with lxml, and take the first marc:record
3. Runs a list of "Extractors" through each record, that populates th output dataframe
### Change or Add to the software
You will need a solid knowledge in python and xpath.
#### 1. AC-Number-Generator
If you would like to change the way the ac-numbers are generated, see here `travelogues_extraction/getrecords/acnumber_extractor.py`
#### 2. Requests
`travelogues_extraction.getrecords.session.RecordRetriever`
You can change the url, the requests, the xpath for the records.
#### 3. Extractors
All Extractors inherit from `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor`.
There are a bunch of more detailed Abstract extractors:
- `AbstractSingleDataExtractor`
- `AbstractXpathDirectlyToColumn`
- `AbstractXpathJoinDirectlyToColumn`
- `AbstractMultifield`
- `AbstractParentAsSecondCast`
and all extractors in `travelogues_extraction/dataextractors/dataextractors` inherit from them
A lot of the classes actually only have properties and no methods and it would be possible to just give this properties
their parent classes in the constructor. However a lot them do have custom methods. The controller only needs to rely on
the interface of the top `AbstractDataExtractor`
##### 3.1. Order, Include / Exclude Extractors
The columns in the dataframe will be generated in the order of the classes in `travelogues_extraction/controller/main.py:33`
and will take the names of the columns in `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor.get_columns_names_I_work_on` in their order.
You can change the order of, include, exclude extractors there.
##### 3.2. Write an data extractor
Inherit from `travelogues_extraction.dataextractors.abstract.AbstractDataExtractor`, implement a method write, that
takes a `travelogues_extraction.getrecords.session.RecordRetriever.Record` as input and writes to
`self.target_dataframe.at[record.ac_number, self.column] = 'your data'. You will find the lxml parsed xml representation
in the `lxmlelement` the property of the record.
There are a few parent classes you can use:
####### `AbstractXpathDirectlyToColumn`
Child classes define the column a xpath object and this class writes the first found text to the target dataframe.
Example: `travelogues_extraction.dataextractors.dataextractors.index.MMSID`
####### `AbstractXpathDirectlyToColumn`
Child classes define the column a xpath object and this class joins the text results with property `join_string`
###### `AbstractMultifield`
Looks up data in the record with `primary_xml_path: lxmletree.XPath`, than it uses each of `xpath_isgnd_tuples`
for data generation. `XpathIsGnd` consists of a xpath object, and tells if this is a datum containing gnd data.
If so it wil render the output as uris. The output will be rendered as a string with one of the `join_string_*_level`
properties.
Example: `travelogues_extraction.dataextractors.dataextractors.combinedsubfields.VerfasserGND`
###### `AbstractParentAsSecondCast`
The last looks in other records, if it does not find them in the current record. It takes `parent_ac_xpath`
to look for the parent, which is called so, because in all our cases, it was a parent record. Implement your data
extraction in _write(). write() will take care of the rest and use your _write.
Example: `travelogues_extraction.dataextractors.dataextractors.übergeordnet.Schlagworte`
`
Source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
...@@ -4,7 +4,6 @@ from dataclasses import dataclass ...@@ -4,7 +4,6 @@ from dataclasses import dataclass
import re as regex import re as regex
import typing import typing
import httpcore
import httpx import httpx
from pandas import DataFrame from pandas import DataFrame
...@@ -66,7 +65,6 @@ class AbstractXpathJoinDirectlyToColumn(AbstractSingleDataExtractor): ...@@ -66,7 +65,6 @@ class AbstractXpathJoinDirectlyToColumn(AbstractSingleDataExtractor):
return result return result
class AbstractMultifield(AbstractDataExtractor): class AbstractMultifield(AbstractDataExtractor):
column: str column: str
......
File deleted
%% Cell type:code id: tags:
``` python
import time
```
%% Cell type:code id: tags:
``` python
!pip install httpx
```
%% Output
Requirement already satisfied: httpx in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (0.13.3)
Requirement already satisfied: certifi in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from httpx) (2020.6.20)
Requirement already satisfied: rfc3986<2,>=1.3 in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from httpx) (1.4.0)
Requirement already satisfied: httpcore==0.9.* in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from httpx) (0.9.1)
Requirement already satisfied: hstspreload in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from httpx) (2020.7.14)
Requirement already satisfied: chardet==3.* in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from httpx) (3.0.4)
Requirement already satisfied: sniffio in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from httpx) (1.1.0)
Requirement already satisfied: idna==2.* in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from httpx) (2.10)
Requirement already satisfied: h11<0.10,>=0.8 in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from httpcore==0.9.*->httpx) (0.9.0)
Requirement already satisfied: h2==3.* in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from httpcore==0.9.*->httpx) (3.2.0)
Requirement already satisfied: hpack<4,>=3.0 in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from h2==3.*->httpcore==0.9.*->httpx) (3.0.0)
Requirement already satisfied: hyperframe<6,>=5.2.0 in /home/phylogram/Documents/onb-homeoffice-local/TraveloguesExtraktion/venv/lib/python3.7/site-packages (from h2==3.*->httpcore==0.9.*->httpx) (5.2.0)
%% Cell type:code id: tags:
``` python
from controller.main import FromAlmaOutputToExcel
```
%% Output
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-4-c0104c72085e> in <module>
----> 1 from controller.main import FromAlmaOutputToExcel
~/Documents/onb-homeoffice-local/TraveloguesExtraktion/travelogues_extraction/controller/main.py in <module>
3 import typing
4
----> 5 import httpx
6
7 if typing.TYPE_CHECKING:
ModuleNotFoundError: No module named 'httpx'
%% Cell type:code id: tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment