notes-on-optical-printer-te.../README.md

32 lines
889 B
Markdown
Raw Permalink Normal View History

# NOTES ON OPTICAL PRINTER TECHNIQUE
Reproduction on the guide written by Dennis Couzin.
2022-07-26 20:53:44 +00:00
Loses some of the charm of the photocopied original floating around the internet, but this reproduction is done for the sake of readability/searchability of the text.
Tesseract does a majority of the heavy lifting, making about a 85% transcription with minor changes needed to spelling and slightly more effort formatting it into markdown for rendering.
Pre-processing using OpenCV and tuning tesseract for the typewritten font may produce even better text.
Preserving alternate spellings not created in the OCR process.
### PDF + HTML Dependencies
* pandoc
* wkhtmltopdf
```bash
bash compile.sh
```
### Text extraction dependencies
* Python3.7
* OpenCV 2
* Tesseract
* PIL
2022-07-26 20:53:44 +00:00
* PyMuPDF
```bash
cd extract
python3 pdf.py > ../ocr/pdf_output.txt
python3 ocr.py > ../ocr/tesseract_output.txt
```