Discard cropped / hidden content ("sanitize") or extract cropped images instead of raw images #1312

quanvinh · 2021-10-05T03:15:19Z

quanvinh
Oct 5, 2021

My PDF has tons of cropped images and AFAIK PyMuPDF only allows me to extract raw (uncropped) ones.

I was wondering if any of the following is possible?

a. Discard hidden / cropped part of all images (similar to "Redact" -> "Sanitize" in Acrobat, without rasterizing) prior to extracting.
b. Obtain cropbox of each images so I can crop the extracted raw images using another library.
c. (Preferrably) Ignore cropped data during extracting (aka extract just the cropped images instead of raw ones).

JorjMcKie · 2021-10-05T08:45:37Z

JorjMcKie
Oct 5, 2021
Maintainer

You need to know which part of the image is covered by other stuff. This would be a sequence of rectangles which have a non-empty intersection with the image bbox on the page.
Then use the image's transformation matrix to compute those intersections in original image coordinates. For all those intersections, empty the resp. rectangle in the image pixmap (i.e. set it to white).
Method page.get_image_bbox(item, transform=True) will return that transforation matrix alongside the image bbox.
Then look at this for details on transformation matrix.

Let pix be the pixmap of the extracted image. Then we define matrix expand = fitz.Matrix(pix.width, 0, 0, pix.height, 0, 0).
Let intersect be an intersection area on the page in page coordinates.
Then xrect = (intersect * ~transform * expand).irect is an area in the pixmap we need to blank out.
Depending in the pixmap colorspace (GRAY, RGB, CMYK, with or without alpha), we use the correct pixel value for "white" and execute pix.set_rect(xrect, white). If you have for example an RGB image without alpha, you would use white = (255, 255, 255).

0 replies

JorjMcKie · 2021-10-05T08:48:10Z

JorjMcKie
Oct 5, 2021
Maintainer

Also note, that there is Pixmap.copy(sourcepix, irect), which lets you make sub-pixmaps of a given source pixmap only containing the irect part.

0 replies

JorjMcKie · 2021-10-15T04:30:37Z

JorjMcKie
Oct 15, 2021
Maintainer

With the coming v1.19.0, you can detect things like

whether text is covered / hidden by a drawing or an image
whether an image is "above" or "below" another object on the page, like image, drawing or text.
etc.

For this to work, you must match the images (or other objects) in question with a new "bboxlog". This is a list of rectangles in the same sequence as they are used to build the page appearance. So an image rectangle wth a higher index in that list will cover (parts of) every object appearing earlier with an intersecting rectangle.

0 replies

abe-mxff · 2023-12-21T15:41:20Z

abe-mxff
Dec 21, 2023

Hi, I'm not able to detect cropped image using the get_bboxlog() method (fitz version 1.23.7).

I generated the attached PDF with two cropped image (one rotated 90°), but the extraction gives me the bounding boxes of the non-cropped images:

Image 0 - bbox: Rect(266.25, 157.2283935546875, 328.5, 608.3989868164062)
Image 1 - bbox: Rect(73.5, 73.5, 568.5, 142.5)
1 - Type: 'fill-path', width=595.5 height=842.25 (raw = (0.0, 0.0, 595.5, 842.25))
2 - Type: 'fill-image', width=62.25 height=451.17059326171875 (raw = (266.25, 157.2283935546875, 328.5, 608.3989868164062))
3 - Type: 'fill-image', width=495.0 height=69.0 (raw = (73.5, 73.5, 568.5, 142.5))

In the following the rendered PDF page and the script used to replicate the result. What am I doing wrong?

import fitz

fn_in = "test_page.pdf"

with open(fn_in, "rb") as f:
    doc = fitz.open(f)

page = doc.load_page(0)

# Extract images
imgs = []
for i, img in enumerate(page.get_image_info(xrefs=True)):
    xref = img["xref"]
    img["bbox"] = fitz.Rect(img["bbox"])
    print(f"Image {i} - bbox: {img['bbox']}")
    img["transform"] = fitz.Matrix(img["transform"])
    imgs.append(img)

# Get bbox_log
for i, (type, raw) in enumerate(page.get_bboxlog()):
    rect = fitz.Rect(raw)
    print(f"{i+1} - Type: '{type}', width={rect.width} height={rect.height} (raw = {raw})")

# There are three elements
# 1) A rectangle occupying the full page (I don't know why it is there)
# 2) The first image
# 3) The second image (correctly detect rotation)
# PROBLEM: None of the images are cropped

# Here images are correctly cropped
# page.get_pixmap().save('rendered_page.png')

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discard cropped / hidden content ("sanitize") or extract cropped images instead of raw images #1312

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Discard cropped / hidden content ("sanitize") or extract cropped images instead of raw images #1312

quanvinh Oct 5, 2021

Replies: 4 comments

JorjMcKie Oct 5, 2021 Maintainer

JorjMcKie Oct 5, 2021 Maintainer

JorjMcKie Oct 15, 2021 Maintainer

abe-mxff Dec 21, 2023

quanvinh
Oct 5, 2021

JorjMcKie
Oct 5, 2021
Maintainer

JorjMcKie
Oct 5, 2021
Maintainer

JorjMcKie
Oct 15, 2021
Maintainer

abe-mxff
Dec 21, 2023