Semi-automatic matching of semi-structured data updates
Includes bibliographical references.
Data matching, also referred to as data linkage or field matching, is a technique used to combine multiple data sources into one data set. Data matching is used for data integration in a number of sectors and industries; from politics and health care to scientific applications. The motivation for this study was the observation of the day-to-day struggles of a large non-governmental organisation (NGO) in managing their membership database. With a membership base of close to 2.4 million, the challenges they face with regard to the capturing and processing of the semi-structured membership updates are monumental. Updates arrive from the field in a multitude of formats, often incomplete and unstructured, and expert knowledge is geographically localised. These issues are compounded by an extremely complex organisational hierarchy and a general lack of data validation processes. An online system was proposed for pre-processing input and then matching it against the membership database. Termed the Data Pre-Processing and Matching System (DPPMS), it allows for single or bulk updates. Based on the success of the DPPMS with the NGO’s membership database, it was subsequently used for pre-processing and data matching of semi-structured patient and financial customer data. Using the semi-automated DPPMS rather than a clerical data matching system, true positive matches increased by 21% while false negative matches decreased by 20%. The Recall, Precision and F-Measure values all improved and the risk of false positives diminished. The DPPMS was unable to match approximately 8% of provided records; this was largely due to human error during initial data capture. While the DPPMS greatly diminished the reliance on experts, their role remained pivotal during the final stage of the process.