Skip to content
This repository has been archived by the owner on Jul 8, 2022. It is now read-only.
/ pdf-barber Public archive

Mendicant University session 9 personal project

Notifications You must be signed in to change notification settings

gjp/pdf-barber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Barber

A command line PDF cropping tool targeted specifically at adapting documents formatted for print to the different requirements of e-readers.

Submitted as a Mendicant University session 9 personal project

Books which are formatted for print often contain large margins which waste a lot of the limited screen real estate on non-print devices, especially e-readers. Because PDF is a final output format, designed to look nearly identical on any display device, it does not contain the semantic data necessary to tell a reader where those margins are. We have to cheat to find them.

The goal of this project is to identify a bounding box for a given PDF which contains most, but not all, of the "ink" on a range of pages within that PDF, and to create a new PDF with the CropBox adjusted to contain only those interesting bits. Page numbers, headers, and rare footnotes or marginal notes should be trimmed in order to maximize the size of the body text.

How does this work?

Barber renders a range of pages as very low resolution raster images and then composes them into a single image, somewhat like running the same piece of paper through a printer many times. For documents with obvious margins, this should produce a large black rectangle in the center of the page.

The composed image is then floodfilled from the center, and the non-floodfilled pixels are removed; this is a crude form of blob detection. The size of the remaining image is then compared to the original. The size adjustment and offset are scaled to match that of the original document. Finally, a new PDF is written with the CropBox set to the new values.

It's up to you to visually scan the document beforehand to find a good range of pages to use as a basis for the required --range parameter. It's best to skip titles, tables of contents, and pages which contain content which runs into the margins, such as large images or horizontal rules. A range of about ten pages will usually provide good results.

Sample run

pdf-barber$ ruby bin/barber.rb --range 1-8 pdfs/bookie-basic-feature.pdf 
MediaBox: [0, 0, 504, 661] CropBox: []
Rendering pages 1 to 8...
New CropBox for all pages: [66, 74, 440, 615] Page Size: [374, 541]
Writing PDF with new CropBox to cropped_bookie-basic-feature.pdf...

Options:

--separate: Process odd- and even-numbered pages separately. This is useful for books in which the binding edge and outside edge of each page have different margins.

--dryrun: Display the calculated CropBox without writing a new file.

--tmpdir DIR: Render the working files to the specified directory and retain them, so you can see what the renderer is doing. WARNING: Using the same tmpdir for multiple runs will cause odd behavior.

--verbose: Echoes all of the system commands to stdout.

What else do I need?

  • A *nix-like system. This has been tested on Ubuntu. Believe it or not, I don't currently own a Mac.
  • GhostScript, to read and write PDF files.
  • ImageMagick, to process the raster files generated by GhostScript.

What about the API?

Barber is intended to be used by a person, from a command line. You must eyeball each document in order to find a good page range. If you really want to, you can tell the Barber to give himself a shave:

require_relative 'lib/barber'
Barber::Shaver.shave(filename: 'pdfs/bookie-basic-feature.pdf', range: [1,8])

About

Mendicant University session 9 personal project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages