Scanning Workflow and Tools

Background

This document describes a workflow for scanning documents which is being used at the Ma_Sys.ma. Additionally, the repository provides tools to aid with this workflow.

Being very specific, it is expected that the scripts are not immediately useful for general-purpose usage. However, they may still serve as an inspiration for developing an own workflow.

Technical Details

Scans are expected to be acquired in form of multiple PDF documents which may potentially contain multiple pages (e.g. from the same scan run through an automatic document feeder, short: ADF). Alternatively, scanning documents one-by-one by using a locally connected scanner are supported.

For storage, scans are converted to PNG files with a resolution of 150 DPI and 8 colors. If the mode of scanning can be influenced, it is set to grayscale by default.

As an additional feature, scanned pages can be processed by an OCR to retrieve a six-digit number which is used for constucting the file name of form madocYYYYYY.png where YYYYYY is the document’s number.

Workflow

The Ma_Sys.ma scanning workflow starts with (often hand-written) documents that are labelled by a pagination stamp with consecutive six-digit numbers near the begin of the document.

These documents are then processed by any of these two “ways”:

Way A: Scanning one-by-one

Script ma_scanner2 -n is invoked
First document put into the scanner
With [ENTER] the scan is triggered.
After scanning, the next document is processed by putting it in the scanner and continuing from step 3. If no more documents remain, the scanning is completed by entering q followed by [ENTER]

In parallel, background processes convert the scan results to indexed grayscale with 8 colors and OCR processes try to recognize the numbers and rename the files accordingly

Way B: Scanning with ADF

A local FTP server is started.
Using a networked-scanner with ADF, documents are scanned in multiple batches and uploaded to the FTP server in form of PDF documents.
In the directory with the PDF files, ma_scanner -n . is invoked.
Documents are converted and OCRed in parallel.

After the scanning

After processing documents this way, two types of faults can be observed:

Wrongly named files: These can often be identified by being numbered outside the interval of scanned pages. For instance, if documents 002001 to 002300 were processed, files named madoc992022.png are likely to be wrongly named. One can identify these files by using a file manager’s sort by name function.
Unidentified files: If the OCR did not return any six-digit numbers for a file, it likely did not recognize the stamped number. In this case, the file will be called like its origin PDF + page number or scan_YYY with YYY being a consecutive numbering scheme.

In both cases, the wrongly named files need to be renamed. To do this, program scanimgrename is invoked on all the files whose names are incorrect. For instance, a typical invocation is scanimgrename scan_???.png to process all the hand-scanned but unidentified files.

The `scanimgrename` tool

Name

scanimgrename – rename scanned image files

Synopsis

Usage: scanimgrename [file...]

Description

scanimgrename provides a minmalistic interface showing only the scanned document and a field to enter a number which will automatically be prefixed by a suitable number of zeroes if it is less than six digits. Upon pressing [ENTER], the file is renamed and the next document is presented. Having processed all files, the interface closes.

In case a file with the target file name already exists, pressing [ENTER] will save the file name to a stack and turn to the already existing file under the assumption that that file might have been mis-named. scanimage indicates this mode by showing the number being entered in red as opposed to the regular blue color. Once the rename conflict has been resolved, the color will turn blue again.

Configuration

In order to configure a different file name scheme, the source code ScanImgRename.java nees to be changed and the scanimgrename.jar needs to be rebuilt e.g. by invoking ant jar.

The `ma_scanner2` tool

Name

ma_scanner2 – Ma_Sys.ma Scanning Helper Script

Synopsis

Way A Usage ma_scanner2 [-n] [MODE RESOLUTION INDICES]
Way B Usage ma_scanner2 [-n] DIRECTORY

Description

Two different modes of invocation, corresponding to the different ways explained for the workflow are available.

Way A: To perform scans of individual documents, the scanimage tool is used. ma_scanner2 provides an interactive interface querying the user to press [ENTER] to perform the next scan. MODE can be one of Color, Gray or Lineart (default: Gray). RESOLUTION is the scanning resolution in DPI (default: 150) and INDICES is the number of colors to use for the output file. If -1 is given, all the colors from the scanner are retained.
Way B: To process scanned documents from an ADF, all pages from PDF files in DIRECTORY are processed using parallel processes.

Options

-n: If -n is given, numbers are attempted to be assigned to the files by processing them through the Tesseract OCR.

Examples

ma_scanner2: A plain invocation allows scanning documents without numbers.
ma_scanner2 -n: Scan individual pages and attempt to recognize the numbers.
ma_scanner2 Color 300 -1: Scan images without reducing colors and with elevated resolutions. This is useful for non-document scans.

Background

Technical Details

Workflow

Way A: Scanning one-by-one

Way B: Scanning with ADF

After the scanning

The scanimgrename tool

Name

Synopsis

Description

Configuration

The ma_scanner2 tool

Name

Synopsis

Description

Options

Examples

See also

The `scanimgrename` tool

The `ma_scanner2` tool