# Data Processing Pipeline

This pipeline processes data from raw Excel format into a TSV file ready for application use.
Placenames are corrected, shelves data is extracted and enriched and a new index is generated. 

## Prerequisites

- Python 3.8+
- JupyterLab
- Required Python packages (install via `pip install -r requirements.txt`)

## Setup

1. Navigate to the data directory:
   ```bash
   cd /data
   ```

2. Copy your source data:
    - Copy your data file as `BE_final.xlsx` into the `/data` directory
    - This filename is required by the processing scripts

3. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

4. Start JupyterLab:
   ```bash
   jupyter lab
   ```

5. Open the processing notebook:
   ```
   /data/data.ipynb
   ```

## Processing Steps

### Section 1: Place Name Correction
- Processes and standardizes place names in the dataset
- **Output**: `/data/data_ort_corr.tsv`

### Section 2: ZZ Section Year Computation

There is a huge amount of journal volumes in the top row in a timespan of nearly 100 years.
With process.sh you can compute the year for each volume.

In the next section you will find a description how to recompute the years of publication for the top rows of the first 6 shelves.

**Note**: Usually you can skip this section if you already have a `resultzz.tsv` file and no changes have been made to the top rows of BE.1-6.ZZ

If changes are needed:
1. Make a backup of your existing `resultzz.tsv`
2. Navigate to `Zz_codes` directory
3. Run `./process.sh` and wait for downloads to complete
4. Verify the new `resultzz.tsv`

**Output**: `/data/resultzz.tsv`

### Section 3: Section and Row Computation
- Calculates section and row assignments
- Displays section computation results for verification

### Section 4: Final Index Creation / Cleansing
- Generates the final index
- Removes temporary processing columns
- **Output**: `/data/data_final.tsv`

## Verification

To verify the processed data:
1. Run `compare2tsvFiles.ipynb`
2. Review `output_report.xlsx`
3. If results are satisfactory, copy `data_final.tsv` to your `/app` directory

## File Structure
```
/data/
├── BE_final.xlsx               # Input file
├── data.ipynb                  # Main processing notebook
├── data_ort_corr.tsv           # Corrected place names
├── resultzz.tsv                # ZZ section years
├── data_final.tsv              # Final processed data
├── compare2tsvFiles.ipynb      # Verification notebook
├── Zz_codes/
│   └── process.sh              # Script to compute years for ZZ rows
├── resources/                  # Supporting documents
│   └── *.pdf, *.jpg            # PDFs with additional information and some images 
└── geocoded_places/
    └── BED_geocoded_places.tsv # File with geocodes
```



