Skip to content
Snippets Groups Projects
Commit 007f037d authored by smayer's avatar smayer
Browse files

Add forced encoding for txt files while running the notebook on Windows machines

parent e0f91c02
Branches main
No related tags found
No related merge requests found
%% Cell type:markdown id:cb89a18e-4bc5-45ba-aac3-660b53841d2f tags:
# Creating your own text using [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
If available OCR is too bad, we can manually create our own. Here are some ideas on what to try:
**Improve image quality and text orientation**
We will use a custom Python script developed at ONB for use in two projects supported by CLARIAH-AT (see https://clariah.at/en/projects/machine-learning-suite-iiif-resources/ and https://clariah.at/en/projects/esperanto-newspaper-excerpts/). The script is intended for use on historic scans and will remove scanning borders, deskew the image as well as convert to gray scale. See also the demonstration of this script at the ANNO event 2023 (https://labs.onb.ac.at/gitlab/labs-team/anno-event-2023).
**Improve data basis of OCR software**
For German Fraktur script Tesseract with default data will not give good results. We can use specialized models trained for Fraktur from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/. For example:
- https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/Fraktur_5000000/Fraktur_5000000_0.584_102422.traineddata
- https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/tessdata_best/frak2021-0.905.traineddata
- https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/german_print/tessdata_best/german_print_0.877_1254744_7309067.traineddata
After downloading move these files to the folder specified in the command `tesseract --list-langs`
# Improve image quality and text orientation
Here, we don't show additional debug data, but the corresponding images are supplied in the folder `img/debug`.
%% Cell type:code id:195c1f7d-9056-40f7-86cb-835d9a5052a9 tags:
``` python
from preprocessing import *
import cv2 as cv
```
%% Cell type:code id:54580312-c90d-49ec-a26d-aaaad34d7409 tags:
``` python
img_paths = ['img/kfz18151101_00000001.jpg', 'img/kfz18700224_00000001.jpg']
for path in img_paths:
preprocessed_img = preprocess_pipeline(cv.imread(path), path, debug=False, debug_path='img/debug')
cv.imwrite(path.replace('.jpg', '_preprocessed.jpg'), preprocessed_img)
```
%% Cell type:code id:c47423fe-02aa-4d61-bdd5-2c589f57f5ee tags:
``` python
from pathlib import Path
paths = sorted(Path('img').glob('*.jpg'))
gallery(paths, row_height='800px')
```
%% Output
<IPython.core.display.HTML object>
%% Cell type:markdown id:3bf8fe16-169e-46ce-8c77-eea4c4feac80 tags:
### Apply Tesseract to original and preprocessed images
This assumes your system has Tesseract installed and that it is available (i.e. added to the PATH variable) under the command `tesseract`. We make the call with the following options:
- language data `-l deu+Fraktur+frk`: provided by Tesseract for German and Fraktur scripts (see https://github.com/tesseract-ocr/tessdata_best)
- page segmentation mode `--psm 3`: Fully automatic page segmentation, but no OSD. (Default). You can see other available modes using the command `tesseract --help-extra`
%% Cell type:code id:83847ba8-9c30-4499-9dbe-fc8552555194 tags:
``` python
paths = sorted(Path('img').glob('*.jpg'))
for img in paths:
out_path = str(img).replace('img', 'data').replace('.jpg', '_tess')
os.system(f"tesseract -l deu+Fraktur+frk --psm 3 {img} {out_path} alto txt")
```
%% Output
Estimating resolution as 358
Estimating resolution as 359
Estimating resolution as 396
%% Cell type:markdown id:c16d1c55-b033-4124-a250-c0a00c3adc98 tags:
### Compare original OCR (made with ABBYY FineReader) with Tesseract applied to original and preprocessed images
%% Cell type:code id:1ac51927-4cba-457a-a8e9-9fd9d7b52f65 tags:
``` python
import pandas as pd
txtpaths = sorted(Path('data').glob('*1815*.txt'))
strpaths = list(filter(lambda x: 'new' not in x, [str(p) for p in txtpaths]))
txts = [open(txt).read() for txt in strpaths]
txts = [open(txt, 'r', encoding='utf-8').read() for txt in strpaths]
txt_df = pd.DataFrame([txts], columns=strpaths)
pd.set_option("display.max_colwidth", None)
display(HTML(txt_df.to_html().replace("\\n","<br>")))
```
%% Output
%% Cell type:code id:b9542753-a848-421e-81f3-65b07f3818b4 tags:
``` python
txtpaths = sorted(Path('data').glob('*1870*.txt'))
strpaths = list(filter(lambda x: 'new' not in x, [str(p) for p in txtpaths]))
txts = [open(txt).read() for txt in strpaths]
txts = [open(txt, 'r', encoding='utf-8').read() for txt in strpaths]
txt_df = pd.DataFrame([txts], columns=strpaths)
pd.set_option("display.max_colwidth", None)
display(HTML(txt_df.to_html().replace("\\n","<br>")))
```
%% Output
%% Cell type:markdown id:468939f5-c54c-4010-aac9-c61097e4a114 tags:
# Improve data basis of Tesseract OCR
After downloading the models written above and moving them to the correct folder we renamed them:
- `Fraktur_5000000_0.584_102422` to `Fraktur_new`
- `frak2021-0.905.traineddata` to `frak2021`
- `german_print_0.877_1254744_7309067.traineddata` to `german_print`
Then we apply Tesseract again (to the preprocessed images) for the hopefully best result yet!
%% Cell type:code id:cc69ad26-523f-450a-aabd-3dd30ab748dd tags:
``` python
preprocessed_paths = sorted(Path('img').glob('*preprocessed.jpg'))
for img in preprocessed_paths:
os.system(f"tesseract -l Fraktur_new+frak2021+german_print --psm 3 {img} {str(img).replace('img', 'data').replace('.jpg', '_tess_new')} alto txt")
```
%% Output
Estimating resolution as 359
Estimating resolution as 396
%% Cell type:code id:22f88237-22da-4fe1-8546-86ee4070f743 tags:
``` python
txtpaths = sorted(Path('data').glob('*new.txt'))
txts = [open(txt).read() for txt in txtpaths]
txts = [open(txt, 'r', encoding='utf-8').read() for txt in txtpaths]
txt_df = pd.DataFrame([txts], columns=txtpaths)
pd.set_option("display.max_colwidth", None)
display(HTML(txt_df.to_html().replace("\\n","<br>")))
```
%% Output
%% Cell type:code id:4d5f862a-b6c2-44e1-8449-2eb828a5f809 tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment