Use page.get_text("json", flags=2) to extract the text, and the bbox in the extraction result has a negative number #1104

zyc130130 · 2021-06-23T03:10:07Z

zyc130130
Jun 23, 2021

Please provide all mandatory information!

Describe the bug (mandatory)

Use page.get_text("json", flags=2) to extract the text, and the bbox in the extraction result has a negative number.

To Reproduce (mandatory)

    with open(file_path, "rb") as fr:
        file_name = fr.read()
    doc = fitz.open(stream=file_name, filetype="pdf")
    page = doc[page_num-1]
    d2 = page.get_text("json", flags=2)

Expected behavior (optional)

Describe what you expected to happen (if not obvious).

Screenshots (optional)

If applicable, add screenshots to help explain your problem.

Your configuration (mandatory)

3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
win32

PyMuPDF 1.18.14: Python bindings for the MuPDF 1.18.0 library.
Built for Python 3.7 on win32 (64-bit).

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

Additional context (optional)

I think the problem is caused by the page margins. After converting the page to a picture, I found that the picture display content has more margins than the PDF page display content.
original PDF page：

convert2img：

Answered by JorjMcKie

Jun 23, 2021

Negative coordinates are not a bug necessarily. They may happen and if so, it was the PDF creator who is responsible.
By the way: your width and height are both positive, so this is not the problem.
If you want to see only those which has positive coordinate, specify a rectangle when extracting text: page.get_text(..., clip=rect).

View full answer

zyc130130 · 2021-06-23T03:15:03Z

zyc130130
Jun 23, 2021
Author

I want to get the right height&weight of pdf page(have the page margin, like size of conver2pdf's page), and get the positive bbox values

0 replies

JorjMcKie · 2021-06-23T05:51:12Z

JorjMcKie
Jun 23, 2021
Maintainer

Negative coordinates are not a bug necessarily. They may happen and if so, it was the PDF creator who is responsible.
By the way: your width and height are both positive, so this is not the problem.
If you want to see only those which has positive coordinate, specify a rectangle when extracting text: page.get_text(..., clip=rect).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use page.get_text("json", flags=2) to extract the text, and the bbox in the extraction result has a negative number #1104

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Use page.get_text("json", flags=2) to extract the text, and the bbox in the extraction result has a negative number #1104

zyc130130 Jun 23, 2021

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

Replies: 2 comments

zyc130130 Jun 23, 2021 Author

JorjMcKie Jun 23, 2021 Maintainer

zyc130130
Jun 23, 2021

zyc130130
Jun 23, 2021
Author

JorjMcKie
Jun 23, 2021
Maintainer