During the last decade, companies have developed software solutions to collect clinical data, but there is no easy solution for existing documents, especially printed or scanned paper forms and reports.
Advancements in scanning and automatic character recognition techniques (e.g. OCR) have facilitated the transformation from printed material to electronic documents, but it is still challenging to develop techniques to extract, classify and interpret data from scanned documents automatically.
One good example of the difficulty is the parsing of regulatory documents/forms (CIOMS, BfArM, MedWatch, etc) used in the pharmacovigilance area. These documents are supposed to have a standardized layout and well identified information blocks. The major problem is that the content of each block is free text. Even the way paragraphs are identified (block numbering and header) varies from one report to another. Therefore, using an OCR program to read scanned such documents will not be enough to obtain usable information.
Some companies try to solve the problem by hiring a Data Entry Service provider in a low-wage country to re-enter data manually. Beside being quite expensive. this method has a major drawback: data entry personnel needs intensive training in order to understand the business comprehensively and might have difficulties to identify misspelled information, inconsistencies and errors in the source documents.
To address the problem, we have developed a methodology to analyze existing documents, define regions and fields to extract, and interpret collected data based on customer requirements. This method has been implemented and validated on large, real-life projects.
Instead of trying a one-size-fits-all approach with a heavy-customized framework, we use a project-oriented strategy, which is more cost-effective and manageable. One major advantage of this method is that users are steadily involved in all project steps, from the functional specification definition to the validation/qualification process.
The final system we offer is able to:
- automatically extract raw data out of documents (MS Word, PDF, scanned documents) using rules defined by the customer,
- report parse errors and allow for manual corrections and/or modifications,
- load data into a relational database for further processing,
- automatically recode products (drug safety database's internal dictionary) and adverse events (MedDRA), according to user or regulation agencies specification,
- give users from the business department a Web application to modify the parsed documents and make them E2B-compatible,
- automatically generate E2B documents (single or mass generation) according to E2B specification and customer requirements,
- generate additional reports as required by the business users,
- offer an optional hosting solution for application and data during the migration phase.
Extracting raw data out of regulatory documents
We use a two-level parsing to extract raw data out of documents. The first parsing level is designed to identify valid blocks within documents. Advanced pattern-matching is used to cope with misspelled text. This is a recurring phase where patterns are tuned until all variations and deviations are successfully recognized and parsed.
Supported input formats are MS Office or OpenOffice Word documents, PDF files, scanned documents as single files or wrapped in PDF documents. Other formats can be added on request.
Special attention is given to non-text components such as check boxes.
Once correctly parsed, blocks are loaded into a staging area for further processing.
Loading raw data into a relational database
The second parsing level is the fine parsing step. Its goal is to analyze the content of the logical blocks to extract low-level information. This step uses data from the staging area to be able to refine parsing and extraction rules without having to repeat the document parsing step itself.
Again, this is a recurring process where patterns are tuned until all usable information is extracted from the text blocks.
Once successfully extracted, low-level data is stored in a custom data structure to easily generate reports, E2B documents or transfer files that will be imported into drug safety systems (Argus Safety, ARISg, etc.) or exchanged with health authorities worldwide.
Data Mapping
This step is often necessary to standardize data fields. This is almost mandatory for predefined fields (time units, quantity units, etc) that are supposed to match lookup tables in the target system or specification. This can be done using predefined or customer-defined tables. Simple mapping values or regular expressions can be used, as well as complex matching rules.
Basic examples are the remapping of free-text country names to ISO 3166 codes, or the “Duration Unit” mapping in E2B documents (801=Year, 802=Month, etc.). Standard mapping rules such as E2B recoding are already implemented in the framework and can be used out-of-the-box.
Online browsing and editing
To allow users to control the whole process, a Web application (GUI) has been developed as part of the
framework.
This application allows users to visualize the output of the parsing processes and to apply necessary
changes and/or recoding to data. All changes are tracked in order to meet validation requirements.
The structure of the maintenance pages maps the structure of the parsed document to ease navigation.
The following screenshots show maintenance pages used to process CIOMS documents.
Maintenance panel #1 Click here...
Maintenance panel #2 Click here...
Maintenance panel #3 Click here...
Maintenance panel #4 Click here...
Maintenance panel #5 Click here...
The generated E2B document can be seen here...
Although a lot of work can be done automatically, there are always situations where it is easier to manually change data. This is for example the case when the quality of scanned documents is so bad that the OCR program was unable to recognize parts of the documents. Using a GUI is much more user-friendly than editing text documents and can be done by end users without special training.
The GUI application is also used by users to trigger reports or E2B document generation. In fact, once the basic parsing phase is complete, the iterative process of fine tuning data and generating output can be done by end users without IT support.
Reporting and Validation
Using a central repository for parsed documents makes reporting easy. Ad-hoc and/or validation reports can
be generated to support the whole process. An example of a validation report can be seen
here...
All installation, configuration and execution tasks include validation and/or qualification documentation in compliance with governing regulatory requirements and industry standards.
Conclusion
This method has been successfully used to parse CIOMS documents and to load extracted data into an Argus Safety system using the standard E2B interface. Of course, this solution can be extended to any type of document with an identifiable structure (fixed layout or patterns) such as Medwatch or self-developed forms.
To discuss your specific needs and get a quote, please contact This e-mail address is being protected from spambots. You need JavaScript enabled to view it .





