32 lines
889 B
Markdown
32 lines
889 B
Markdown
# NOTES ON OPTICAL PRINTER TECHNIQUE
|
|
|
|
Reproduction on the guide written by Dennis Couzin.
|
|
Loses some of the charm of the photocopied original floating around the internet, but this reproduction is done for the sake of readability/searchability of the text.
|
|
|
|
Tesseract does a majority of the heavy lifting, making about a 85% transcription with minor changes needed to spelling and slightly more effort formatting it into markdown for rendering.
|
|
Pre-processing using OpenCV and tuning tesseract for the typewritten font may produce even better text.
|
|
|
|
Preserving alternate spellings not created in the OCR process.
|
|
|
|
### PDF + HTML Dependencies
|
|
|
|
* pandoc
|
|
* wkhtmltopdf
|
|
|
|
```bash
|
|
bash compile.sh
|
|
```
|
|
|
|
### Text extraction dependencies
|
|
|
|
* Python3.7
|
|
* OpenCV 2
|
|
* Tesseract
|
|
* PIL
|
|
* PyMuPDF
|
|
|
|
```bash
|
|
cd extract
|
|
python3 pdf.py > ../ocr/pdf_output.txt
|
|
python3 ocr.py > ../ocr/tesseract_output.txt
|
|
``` |