Secure PDF Redactions: a Script Based Approach using ImageMagick -

In today’s digital landscape, PDF documents have become a ubiquitous medium for sharing information, ranging from professionally generated reports to scanned copies of vital documents. There are numerous scenarios, be it for privacy concerns, security protocols, or compliance with legal requirements, where redacting certain pieces of information from these PDFs becomes necessary.

A common method for redacting PDF documents involves overlaying black boxes or shapes over the text intended to be concealed. While this approach is straightforward and widely adopted due to its simplicity, it is alarmingly insecure. The fundamental issue with this method is its reversibility – these black boxes or shapes can be manipulated or removed, revealing the supposedly hidden information underneath. It is alarming that most users, have a false sense of security as they might share documents redacted with the black overlays without realizing that the information beneath is still accessible.

This is by no means a new topic. There are many articles, youtube videos, training material, guides dealing with this topic. There are several software tools, including “Preview” for Mac OS X and “Adobe Acrobat”, that provides the possibility to redact PDF documents securely, i.e. in an irreversible manner. However to my knowledge there is no good tool to do that on Linux Machines.

This blog post introduces a simple method to process PDFs redacted with black overlays in bulk, using the free and open-source command line tool “convert” from ImageMagick. In the following section we will introduce the script, describe it and describe the usage.

A Script-Based Solution:

For those comfortable with scripting and basic command-line operations, we propose a script that utilizes ImageMagick, the powerful image processing tool, to convert PDF pages into images, apply the redactions, and then reassemble the pages back into a PDF. This method ensures that the redacted information is irrecoverable since all layers of the original PDF has been flattened, i.e. transformed into an single layer image, removing any information about text that was present beneath the black overlays.

We added options to run the script in -q quiet mode, to keep backups of the original files with the .bak extension, and to process only files with a specific prefix within a given folder.

In comparison with other tools, the script allows us to process files in bulk. It is free and open-source thus give the user the possibility to analyze it, understand it, and make sure there is nothing fishy about it. It also gives the power-user the option to tweak it and improve it according to the need.

Please note that using this method, text or other elements in the resulting PDF cannot be “selected” even if that was possible in the original PDFs.

The script

#!/bin/bash

# Function to display usage instructions
usage() {
    echo "Usage: $0 [-q] <folder> [prefix]"
    echo "Options:"
    echo "  -q    Quiet mode (less output)"
    echo "  -b    Backup original PDF files"
    exit 1
}

# Function to ensure ImageMagick is installed
check_dependencies() {
    if ! command -v convert &> /dev/null; then
        echo "ImageMagick is not installed. Please install it to use this script."
        exit 5
    fi
}

# Improved backup functionality
backup_file() {
    local file=$1
    local backup_dir="$TARGET_FOLDER/backups"
    mkdir -p "$backup_dir"
    local timestamp=$(date +"%Y%m%d_%H%M%S")
    cp "$file" "$backup_dir/$(basename "$file" .pdf)_$timestamp.bak"
    verbose_echo "Backup of $file created in $backup_dir."
}

# Function to print verbose messages
VERBOSE=1
verbose_echo() {
    if [ "$VERBOSE" -ne 0 ]; then
        echo "$@"
    fi
}

# Function to handle Ctrl-C
handle_sigint() {
    echo "Script interrupted by user. Exiting..."
    exit 2
}

# Set a trap for Ctrl-C
trap handle_sigint SIGINT

# check dependencies
check_dependencies

# Parse optional flags
BACKUP=0
while getopts ":qhb" option; do
    case $option in
        b)
            BACKUP=1
            ;;
        q)
            VERBOSE=0
            ;;
        h)
            usage
            ;;
        \?)
            echo "Invalid option: -$OPTARG" >&2
            usage
            ;;
    esac
done
shift $((OPTIND -1))

# Check if the correct number of arguments are provided
if [ "$#" -lt 1 ]; then
    usage
fi

# Get the target folder and optional prefix from the command-line arguments
TARGET_FOLDER="${1%/}"
PREFIX="${2:-}"

if [ ! -d "$TARGET_FOLDER" ]; then
    echo "The selected folder does not exist: $TARGET_FOLDER"
    exit 3
fi

# Check for PDF files in the folder with the given prefix
if [ -n "$PREFIX" ]; then
    PDF_FILES=("$TARGET_FOLDER"/"$PREFIX"*.pdf)
else
    PDF_FILES=("$TARGET_FOLDER"/*.pdf)
fi

TOTAL_FILES=${#PDF_FILES[@]}
if [ "$TOTAL_FILES" -eq 0 ] || [ -z "${PDF_FILES[0]}" ]; then
    verbose_echo "No PDF files found in the folder."
    exit 4
else
    verbose_echo "Number of PDF files to be processed: $TOTAL_FILES"
fi

# Create a temporary directory for intermediate files
TEMP_DIR=$(mktemp -d)
verbose_echo "Temporary directory created at $TEMP_DIR"

# Process each PDF in the folder
CURRENT_FILE=0
for pdf in "${PDF_FILES[@]}"; do
    CURRENT_FILE=$((CURRENT_FILE + 1))
    base_name=$(basename "$pdf" .pdf)
    
    # Backup original PDF if backup option is set
    if [ "$BACKUP" -eq 1 ]; then
        backup_file "$pdf"
    fi
    
    verbose_echo "Processing $pdf ($CURRENT_FILE of $TOTAL_FILES)..."

    convert -density 300 "$pdf" "$TEMP_DIR/${base_name}-page.png"
    convert "$TEMP_DIR/${base_name}-page-*.png" "$TARGET_FOLDER/$base_name.pdf"
    rm "$TEMP_DIR"/"${base_name}"-page-*.png

    verbose_echo "Completed processing for $pdf. Saved to $TARGET_FOLDER/$base_name.pdf"
done

# Remove the temporary directory
rm -rf "$TEMP_DIR"
verbose_echo "Temporary files cleaned up."

verbose_echo "All files processed successfully!"

Script breakdown

The lines after, parse the passed flags and passed arguments. Here the flags quiet: -q, backup: -b and help: -h are allowed and processed. The scrip takes at least one arguments which is the folder where the target PDFs are located. in the script the variable that holds that passed argument is called TARGET_FOLDER. A second argument for the prefix of th PDF files is optional and will be stored in PREFIX.

After that some checks are made to ensure the TARGET_FOLDER exists and that it is not empty and that there is at least one PDF in the folder that starts with the PREFIX. If the PREFIX was empty, the script checks if there is at least one PDF in the folder.

Then bash loops over the selected files and process the files one by one. It creates a backup of the file if the flag -b was passed. It converts the pdf to a list of images for every page in a temporary folder, flattening the layers in those pages and performing the actual redaction, then converts those images into a PDF with the same name.

Finally the script deletes the temporary folder used for the session, and the script exists with 0.

Usage Instructions

To use the script, navigate to the directory containing the script and run:

./pdf_redact.sh [-q] [-b] <folder> [prefix]

Options:

-q: Quiet mode (suppresses most of the output)
-b: Backup original files before processing
<folder>: The target folder containing PDF files
[prefix]: (Optional) A prefix to filter which PDF files to process

The basic usage is to pass the target folder to the script, and let the script runs

% ./pdf_redact.sh ~/Documents/Experiment/Redacted_PDFs/

Optionally creates backups

% ./pdf_redact.sh -b ~/Documents/Experiment/Redacted_PDFs/

To select a subset of the PDFs in the folder that starts with the prefix: “to_redact_”

% ./pdf_redact.sh -b ~/Documents/Experiment/Redacted_PDFs/ to_redact_

Conclusion:

While this script-based approach requires a bit more technical skill than using standard PDF editors, it offers a significantly more secure way to handle PDF redactions. In scenarios where data confidentiality is paramount, taking this extra step can be crucial. By converting sensitive information into non-recoverable image formats, you can share your redacted documents with confidence, knowing that the redacted content remains secure.