Optical Character Recognition

Case Study: Accelerating Error Detection in One Million PDF Documents with OCR and Machine Learning

Challenge:

How do you identify errors in approximately one million PDF documents within a tight deadline of one month? This daunting task posed significant challenges, particularly when considering the manual resources required and the inherent complexities in the data within these documents.

Solution:

We opted for a combination of Optical Character Recognition (OCR) and a custom machine learning model. This approach allowed us to process and analyze 30 documents in just 10 seconds on a medium-spec server. The solution was designed to be trained to identify specific errors within the documents, providing an efficient, scalable method for error detection.

Implementation:

Integration with Document Server: The first step was to integrate with a document server that could retrieve and manage the vast volume of documents efficiently. This allowed seamless access to the PDFs required for processing.
Optical Character Recognition (OCR): Given OCR's longstanding reliability, it was the obvious choice for reading and extracting text from the documents. OCR enabled the machine learning model to interpret the content of each document, making it possible to analyze and validate data accurately.
Custom Machine Learning Model: The complexity of the task was compounded by the nature of the data—specifically amounts—which presented challenges such as:
- OCR errors, where characters could be misinterpreted (e.g., '5' as 'S').
- Variability in amounts from line to line, with some being empty.
To overcome these challenges, we developed a custom algorithm that identified patterns in the text to determine if a particular string was likely an amount and whether it was accurate. Rather than converting everything into amounts and performing financial calculations, which could introduce additional errors, we integrated with a reliable data source. This source provided the expected values for amounts, and the model flagged any deviations.
Efficiency in Processing: By avoiding unnecessary conversions and calculations, the model worked directly with the OCR-read text. It utilized a small JSON document that specified which pages to examine, where on those pages to look, and what patterns to identify. This selective processing approach meant we didn't need to validate every piece of data or read every page of each document, drastically reducing the time and computing resources required.

Outcome:

The combination of OCR, machine learning, and targeted data validation proved to be highly effective. The system reliably validated the integrity of the documents, flagging any discrepancies for further review. This allowed human resources to focus on higher-value tasks rather than manually reviewing each document.

Conclusion:

By leveraging OCR and custom machine learning algorithms, we transformed a potentially overwhelming task into a manageable and efficient process. This solution not only met the tight deadline but also ensured high accuracy in detecting errors across a massive volume of documents.