Contract Conversion & Analytics
To enable data driven decisions, the Air Force required a process to convert raw non-searchable PDF contracts into machine readable structured and unstructured formats to enable actionable data to be extracted, analyzed, and visualized.
ILW created a processing pipeline that automated the extraction of semi-structured form and table data embedded within Air Force contracts and parsed this data into structured tabular outputs. The extracted contract text, forms, and tables are ingested into a NoSQL database, allowing for east search capability for Air Force users. ILW utilizes text mining tools to search these converted contracts for compliance with various regulations.
- Insight into an almost untapped source of data
- Converted 3.7 million Air Force contracts into machine readable language
- Processed 300,000 computational hours
- Parsed 7 types of PDF forms and tables into structured format
New search capability enables enterprise-level understanding on contract compliance, contract health, data rights
- Data Science, NLP, ML, Text Mining
- Optical Character Recognition
- High Performance Computing
- Open-source Python solution using DoD compatible libraries (Pandas, Tabula, Fitz, Scikit-learn, OpenCV)
- Tesseract and Couchbase