Skip to content

° display as � + some other scrambled chars #869

Discussion options

You must be logged in to vote

Coming back to this after quite some time.
I have been experimenting with using Tesseract OCR together with PyMuPDF and tried your document with it again. Remember that all the °C were incorrectly coded in it?
Well here is a script that extracts the text, detects whether a line contains uninterpreted characters and if so, it invokes OCR to make that line readable again.
A dependency is that Tesseract OCR is installed and can be invoked via Python's subprocess module.
Here is the material. Because of the OCR invocations (ca. 80 times across all pages), the total duration (my machine) is about 30 seconds.
Maybe it helps.
issue-869.zip

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@gbrault
Comment options

Comment options

You must be logged in to vote
0 replies
Answer selected by JorjMcKie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants