How to get glyph and convert to character? #3320

june6423 · 2024-03-28T06:36:35Z

june6423
Mar 28, 2024

Description of the bug

Description

I want to extract accurate character in pdf.
To make it easier to see, let's only discuss the first line of table1.

First, I tried to extract words using page.get_text("words").
The return value is 'Dye lmax [nm]a ( 3 /104 M 1 cm 1) lmax [nm]b Eox [V]c E0e0 [eV]d Eox * [V]e'
(corresponding page.get_text("words")[923:942])

The Greek letter lambda was replaced by l, and epsilon was replaced by 3.
The minus character on superscript is missing, and minus character on subscript was replaced by e.

Then, I tried to extract characters using page.get_textpage().extractRAWDICT()['blocks'].
This time, The Greek letter lambda was replaced by l, and epsilon was replaced by 3.
The minus characters are replaced by U+FFFD (�).

Next, I tried page.get_text("variant")
The Greek letter lambda was replaced by l, and epsilon was replaced by 3.
The minus character on superscript was replaced by '\x01', and minus character on subscript was replaced by e.

Finally, I tried to page.get_texttrace().
I got the same result as second try. Instead of U+FFFD, I got Unicode 65533, which is chr(65533) = U+FFFD.

It seems difficult to get the exact characters from a PDF. In particular, the code to convert to U+FFFD is in the executable file(.so file) when the g_use_extra option is enabled, so I couldn't check the source code. Seeing simple characters like minus character(-) fail, I suspect it's an error in the font.

According to this issue, it seems to be possible to restore characters using glyphs even if they are broken.

So, I run page.get_fonts() to get font information.
I got 10 fonts and here's a list of fonts.

0:(479, 'cff', 'Type1', 'IHAPDB+AdvOT863180fb', 'F1', 'WinAnsiEncoding')
1:(480, 'cff', 'Type1', 'IHAPEJ+AdvOTb83ee1dd.B', 'F10', 'WinAnsiEncoding')
2:(733, 'cff', 'Type1', 'IHAPJK+AdvPS4721B4', 'F13', 'WinAnsiEncoding')
3:(734, 'cff', 'Type1', 'IHAPJL+AdvP4C4E51', 'F14', 'WinAnsiEncoding')
4:(483, 'cff', 'Type1', 'IHAPDC+AdvOTb92eb7df.I', 'F2', 'WinAnsiEncoding')
5:(484, 'cff', 'Type1', 'IHAPDD+AdvP4C4E59', 'F3', 'WinAnsiEncoding')
6:(486, 'cff', 'Type1', 'IHAPEE+AdvPS44A44B', 'F5', 'WinAnsiEncoding')
7:(487, 'cff', 'Type1', 'IHAPEF+AdvPS3F4C13', 'F6', 'WinAnsiEncoding')
8:(488, 'cff', 'Type1', 'IHAPEG+AdvOT863180fb+fb', 'F7', '')
9:(489, 'cff', 'Type1', 'IHAPEH+AdvP4C4E74', 'F8', '')

But I have no idea How to get glyph_id from fonts.
Googling the name of the font turns up nothing, and it's not the default PDF font. I want to know how I can get the font information and its glyphs.

Expectation

I want to get the font information and its glyphs. Finally, I want to restore original character using the glyphs.

Environment

print(sys.version,"\n", sys.platform, "\n", fitz.__doc__)

3.10.8 (main, Nov  4 2022, 13:48:29) [GCC 11.2.0]  

linux 

PyMuPDF 1.24.0: Python bindings for the MuPDF 1.24.0 library (rebased implementation).

Python 3.10 running on linux (64-bit).

How to reproduce the bug

Here's my pdf and I am working on page 6. (with table1)
DyesandPigments2014102196_ZhuWong.pdf

PyMuPDF version

1.24.0

Operating system

Linux

Python version

3.10

Answered by JorjMcKie

Apr 1, 2024

Here is a quick test run for OCR-ing this:

import fitz

doc = fitz.open("test.pdf")
page = doc[31]
clip = page.rect / 2
rect = page.search_for("hepg-2", clip=clip)[0]
clip.y0 = rect.y1 + 5
rect = page.search_for("Table 2", clip=clip)[0]
clip.y1 = rect.y0
clip.x0 = 36
pix = page.get_pixmap(clip=clip, dpi=300)
ocr = fitz.open("pdf", pix.pdfocr_tobytes())
print(ocr[0].get_text())

R=—H
58.98 + 0.89
71.55 + 2.91
R—Me
49.60 + 2.03
63.48 + 2.11
R—OMe
49.65 + 2.08
62.41 + 2.23
R=Cl
44.71 + 1.92
43.81 + 1.83

As you see, even a top resolution of 300 dpi will not deliver what you hope to get!

View full answer

JorjMcKie · 2024-03-28T08:53:29Z

JorjMcKie
Mar 28, 2024
Maintainer

This is not a bug - everything seems to work as designed.
I am going to convert this to a Discussions item.

0 replies

JorjMcKie · 2024-03-28T09:24:37Z

JorjMcKie
Mar 28, 2024
Maintainer

The text line contains characters from multiple fonts in a mixture of normal, super-script and sub-script text.
Not all fonts contain a complete reverse mapping from the visual appearance of a character (the "glyph") to the original Unicode point. Whenever this is the case, PyMuPDF by default looks at the glyph number and returns this as the character Unicode point. This glyph number may not deliver a printable character either - which is the case for minus signs you complained about. When toggling this behavior (= setting off extraction flag bit TEXT_CID_FOR_UNKNOWN_UNICODE), the character � is returned.
Method page.get_texttrace() never does this sort of automatic character replacement. When looking at that line, it will deliver the following text: Dyelmax[nm]a(3/104M�1cm�1)lmax[nm]bEox[V]cE0e0[eV]dEox*[V]e. As you can see, the font "AdvP4C4E74" has no back-reference for the minus sign and thus correctly returns the "Undefined Unicode" character. Otherwise the returned text is correct.
Other phenomenons are caused by the fonts themselves, for instance font 'AdvOTb92eb7df.I' encodes character "3" into the visual appearance of an epsilon - for reasons which only the font creator can explain. The same sort of argument applies to encoding character "l" as a lambda by font 'AdvPS4721B4'.

Wrapping up:

I want to get the font information and its glyphs. Finally, I want to restore original character using the glyphs.

You have all font information. You can extract each font by its xref, take the returned font binary as the font buffer to create a fitz.Font object which you can inspect with Font methods.

As explained above, there is no such things as an "original character" beyond what the font delivers to us.

0 replies

june6423 · 2024-04-01T01:20:43Z

june6423
Apr 1, 2024
Author

Thank you for your swift and comprehensive response.
Based on your guidance, I attempted to analyze the glyphs within the fonts of a PDF.
However, I encountered an issue where no glyphs were extracted.
This leads me to question whether the fonts in my PDF actually contain any glyphs.

Here's the code snippet I used:

font_xref = [x[0] for x in page.get_fonts()]
for xref in font_xref:
    name, ext, _, content = doc.extract_font(xref)
    target_font = fitz.Font(fontbuffer=content)
    vuc = target_font.valid_codepoints()
    print(target_font.name, target_font.glyph_count, len(vuc))

It's output:

AdvOT863180fb Regular 80 0
AdvOTb83ee1dd.B Regular 44 0
AdvPS4721B4 Regular 3 0
AdvP4C4E51 Regular 3 0
AdvOTb92eb7df.I Regular 63 0
AdvP4C4E59 Regular 2 0
AdvPS44A44B Regular 2 0
AdvPS3F4C13 Regular 5 0
AdvOT863180fb+fb Regular 3 0
AdvP4C4E74 Regular 6 0

Despite this approach, no glyphs were found. I have several questions regarding this issue:

Glyph Extraction: Is my method for extracting glyphs flawed, or do the fonts in my document genuinely lack glyphs?
Font Substitution: To circumvent issues with unrecognized characters (resulting in U+FFFD), is it possible to force the substitution of all document fonts with a well-known font, such as Helvetica?
Font Names: I was unable to find any information on fonts named 'AdvOTb92eb7df.I' and 'AdvPS4721B4' online. Are the font names retrieved by PyMuPDF identical to their actual, commonly recognized names?
Extracting Epsilon: Ultimately, I aim to extract the character epsilon based on its visual representation within the document. Is there an alternative method to achieve this without resorting to OCR?

Thank you for reviewing my questions. Any insights or suggestions you could provide would be greatly appreciated.

11 replies

JorjMcKie Apr 1, 2024
Maintainer

No, you are on the wrong track:
If a font writes visible things then it always has glyphs. A glyph is a little program containing draw instructions - roughly speaking: to show a capital "D" the corresponding glyph program would contain commands drawing a vertical line "|" following by a left-open semi-circle.
If the programmer requests to write a "D", then these two draw commands are executed. There is however no rule or law for this: if the font creator would decide in his glyph program to draw a "Z" (not anything looking like a "D" at all), then there is no way to find out.
The awkward glyph looking like a chemical double bind ("==") for example is drawn if the programmer writes a "Q" using a certain font.

When later extracting text, success depends on the presence of a back-translation table mapping "glyph program ==> Unicode number". If this table is missing or incomplete, character "�" is returned.
Since a few versions PyMuPDF/MuPDF contains logic that returns the number of the glyph program whenever it encounters this situation: often (not always!) this delivers a usable result.

june6423 Apr 1, 2024
Author

Hmm... I understood.
If we can see visualized characeter, the font always has glyphs but we cannot guarantee the font has inverse mapping.
And that's why I got U+FFFD and the font is the problem.

I have found this to be a very difficult task and am considering partial OCR.
I have just two more questions for you.

If I fix all the fonts on a single page to be well-known fonts, will this solve the problem that the back-translate mapping does not exist? (Maybe it will not work because glyph also changes)
If it is difficult to restore characters that are invalid (visualized character and unicode do not match), is it possible to determine which characters are invalid? (I am considering checking U+FFFD or checking suspected fonts)

Thank you for your detailed responese.

JorjMcKie Apr 1, 2024
Maintainer

Point 1 above will not work because you do not know and cannot know what to take instead of �. You also have things like "RQH" instead of "R==H" on that other file - and you will never know that this is nonsense.

This script does a partial OCR for text pieces containing �.

JorjMcKie Apr 1, 2024
Maintainer

Here is a quick test run for OCR-ing this:

import fitz

doc = fitz.open("test.pdf")
page = doc[31]
clip = page.rect / 2
rect = page.search_for("hepg-2", clip=clip)[0]
clip.y0 = rect.y1 + 5
rect = page.search_for("Table 2", clip=clip)[0]
clip.y1 = rect.y0
clip.x0 = 36
pix = page.get_pixmap(clip=clip, dpi=300)
ocr = fitz.open("pdf", pix.pdfocr_tobytes())
print(ocr[0].get_text())

R=—H
58.98 + 0.89
71.55 + 2.91
R—Me
49.60 + 2.03
63.48 + 2.11
R—OMe
49.65 + 2.08
62.41 + 2.23
R=Cl
44.71 + 1.92
43.81 + 1.83

As you see, even a top resolution of 300 dpi will not deliver what you hope to get!

Answer selected by june6423

serhii-brovarnyk Jul 25, 2024

Hello @JorjMcKie !
Can you give some advice on how to get glyph IDs of the undefined symbols in PyMuPDF?
I have encountered multiple undefined symbols during SVG parsing via get_svg_image(text_as_path=False) method.
So I want to insert them into another PDF document with exactly the same font but cannot due to unrecognizable characters.

Thank you for any help in advance!

JorjMcKie Jul 25, 2024
Maintainer

This is a no-can-do!
We (= nobody!) cannot know which original Unicode value has resulted in selecting a certain glyph.
The standard text extraction flags contain the option to deliver the glyph id whenever the backtranslation information "glyph-to-unicode" is missing.
This is the name of that flag bit: pymupdf.TEXT_CID_FOR_UNKNOWN_UNICODE.

JorjMcKie Jul 25, 2024
Maintainer

This "CID" is a fallback! There is no guarantee that it is correct or even anywhere close.

serhii-brovarnyk Jul 25, 2024

Thank you very much for the answer! I will try it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get glyph and convert to character? #3320

{{title}}

Replies: 3 comments 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to get glyph and convert to character? #3320

june6423 Mar 28, 2024

Description of the bug

Description

Expectation

Environment

How to reproduce the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 3 comments · 11 replies

JorjMcKie Mar 28, 2024 Maintainer

JorjMcKie Mar 28, 2024 Maintainer

june6423 Apr 1, 2024 Author

JorjMcKie Apr 1, 2024 Maintainer

june6423 Apr 1, 2024 Author

JorjMcKie Apr 1, 2024 Maintainer

JorjMcKie Apr 1, 2024 Maintainer

serhii-brovarnyk Jul 25, 2024

JorjMcKie Jul 25, 2024 Maintainer

JorjMcKie Jul 25, 2024 Maintainer

serhii-brovarnyk Jul 25, 2024

june6423
Mar 28, 2024

Replies: 3 comments 11 replies

JorjMcKie
Mar 28, 2024
Maintainer

JorjMcKie
Mar 28, 2024
Maintainer

june6423
Apr 1, 2024
Author

JorjMcKie Apr 1, 2024
Maintainer

june6423 Apr 1, 2024
Author

JorjMcKie Apr 1, 2024
Maintainer

JorjMcKie Apr 1, 2024
Maintainer

JorjMcKie Jul 25, 2024
Maintainer

JorjMcKie Jul 25, 2024
Maintainer