-
Description of the bugDescriptionI want to extract accurate character in pdf. First, I tried to extract words using The Greek letter lambda was replaced by l, and epsilon was replaced by 3. Then, I tried to extract characters using Next, I tried Finally, I tried to It seems difficult to get the exact characters from a PDF. In particular, the code to convert to U+FFFD is in the executable file(.so file) when the g_use_extra option is enabled, so I couldn't check the source code. Seeing simple characters like minus character(-) fail, I suspect it's an error in the font. According to this issue, it seems to be possible to restore characters using glyphs even if they are broken. So, I run
But I have no idea How to get glyph_id from fonts. ExpectationI want to get the font information and its glyphs. Finally, I want to restore original character using the glyphs. Environment
How to reproduce the bugHow to reproduce the bugHere's my pdf and I am working on page 6. (with table1) PyMuPDF version1.24.0 Operating systemLinux Python version3.10 |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 11 replies
-
This is not a bug - everything seems to work as designed. |
Beta Was this translation helpful? Give feedback.
-
Wrapping up:
You have all font information. You can extract each font by its xref, take the returned font binary as the font buffer to create a As explained above, there is no such things as an "original character" beyond what the font delivers to us. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your swift and comprehensive response. Here's the code snippet I used:
It's output:
Despite this approach, no glyphs were found. I have several questions regarding this issue:
Thank you for reviewing my questions. Any insights or suggestions you could provide would be greatly appreciated. |
Beta Was this translation helpful? Give feedback.
Here is a quick test run for OCR-ing this:
As you see, even a top resolution of 300 dpi will not deliver what you hope to get!