Incremental Record Linkage. Anja Gruenheid!! Xin Luna Dong!!! Divesh Srivastava

Similar documents
Week 5 Objectives. Subproblem structure Greedy algorithm Mathematical induction application Greedy correctness

-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!)

-- CS341 info session is on Thu 3/18 7pm in Gates Final exam logistics

Big Data Integration. Xin Luna Dong (Amazon) Divesh Srivastava (AT&T Labs-Research)

Rail Haverhill Viability Study

UNIVERSITY OF PLYMOUTH SUSTAINABLE FOOD PLAN

Learning Connectivity Networks from High-Dimensional Point Processes

STABILITY IN THE SOCIAL PERCOLATION MODELS FOR TWO TO FOUR DIMENSIONS

A complex data set: 7 malt parameters x 3 locations x 3 nitrogen rates x 5 varieties x 2 labs = 210 possible combinations

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

HONDURAS. A Quick Scan on Improving the Economic Viability of Coffee Farming A QUICK SCAN ON IMPROVING THE ECONOMIC VIABILITY OF COFFEE FARMING

You know what you like, but what about everyone else? A Case study on Incomplete Block Segmentation of white-bread consumers.

Restaurant Hygiene Grade Cards. Yvonne Zhuang Melissa Gonzalez Jay De Jesus Nick Tse

IKAWA App V1 For USE WITH IKAWA COFFEE ROASTER. IKAWA Ltd. Unit 2 at 5 Durham Yard Bethnal Green London E2 6QF United Kingdom

The Market Potential for Exporting Bottled Wine to Mainland China (PRC)

Biocides IT training Vienna - 4 December 2017 IUCLID 6

Biocides IT training Helsinki - 27 September 2017 IUCLID 6

EFFECT OF TOMATO GENETIC VARIATION ON LYE PEELING EFFICACY TOMATO SOLUTIONS JIM AND ADAM DICK SUMMARY

Draft Document: Not for Distribution SUSTAINABLE COFFEE PARTNERSHIP: OUTLINE OF STRUCTURE AND APPROACH

Temperature effect on pollen germination/tube growth in apple pistils

Jake Bernstein Trading Webinar

Relation between Grape Wine Quality and Related Physicochemical Indexes

Feeling Hungry. How many cookies were on the plate before anyone started feeling hungry? Feeling Hungry. 1 of 10

Response to Reports from the Acadian and Francophone Communities. October 2016

Multiple Imputation for Missing Data in KLoSA

MBA 503 Final Project Guidelines and Rubric

Innovations for a better world. Ingredient Handling For bakeries and other food processing facilities

Quorn the production of alternative first-class protein source for a balanced, sustainable diet.

Tips for Writing the RESULTS AND DISCUSSION:

UNIVERSITY OF PLYMOUTH FAIRTRADE PLAN

5. Supporting documents to be provided by the applicant IMPORTANT DISCLAIMER

A CASE STUDY: HOW CONSUMER INSIGHTS DROVE THE SUCCESSFUL LAUNCH OF A NEW RED WINE

Buying Filberts On a Sample Basis

Lesson 41: Designing a very wide-angle lens

Barista at a Glance BASIS International Ltd.

Towards EU MRLs for biocides current status. Karin Mahieu

COMPILATION AND SUMMARY OF COMMERCIAL CATCH REPORT FORMS USED IN THE U.S. VIRGIN ISLANDS, 1974/75 TO 2004/05

Mixers Innovation. José Cheio De Oliveira

Delivering Great Cocktails Through Full Serve Testing. Jean A. McEwan and Janet McLean Diageo Innovation

Targeting Influential Nodes for Recovery in Bootstrap Percolation on Hyperbolic Networks

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

Lesson 41: Designing a very wide-angle lens

Following is the Post-secondary contest packet.

Fungicides for phoma control in winter oilseed rape

Coffee zone updating: contribution to the Agricultural Sector

Product Consistency Comparison Study: Continuous Mixing & Batch Mixing

Introduction to Management Science Midterm Exam October 29, 2002

Virginie SOUBEYRAND**, Anne JULIEN**, and Jean-Marie SABLAYROLLES*

New Zealand Winegrowers Vineyard Register User Guide

ISO 9852 INTERNATIONAL STANDARD

CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Control wine quality after bottling. Monitor wine storage and shipment conditions

Session 4: Managing seasonal production challenges. Relationships between harvest time and wine composition in Cabernet Sauvignon.

NomaSense PolyScan. Analysisof oxidizable compounds in grapes and wines

INFLUENCE OF THIN JUICE ph MANAGEMENT ON THICK JUICE COLOR IN A FACTORY UTILIZING WEAK CATION THIN JUICE SOFTENING

Managing Multiple Ontologies in Protégé

Market Basket Analysis of Ingredients and Flavor Products. by Yuhan Wang A THESIS. submitted to. Oregon State University.

ETHIOPIA. A Quick Scan on Improving the Economic Viability of Coffee Farming A QUICK SCAN ON IMPROVING THE ECONOMIC VIABILITY OF COFFEE FARMING

EXECUTIVE SUMMARY OVERALL, WE FOUND THAT:

Charlie to Go Online Ordering Guide

Is Fair Trade Fair? ARKANSAS C3 TEACHERS HUB. 9-12th Grade Economics Inquiry. Supporting Questions

Fractions with Frosting

Computerized Models for Shelf Life Prediction of Post-Harvest Coffee Sterilized Milk Drink

VIII. Claim Drafting Methodologies. Becky White

Internet Appendix to. The Price of Street Friends: Social Networks, Informed Trading, and Shareholder Costs. Jie Cai Ralph A.

ULTRA FRESH SWEET INTRODUCTION

P O L I C I E S & P R O C E D U R E S. Single Can Cooler (SCC) Fixture Merchandising

OUTLINE Plan of the talk. Introduction Vineyards are variable in space The efficient vineyard project. The field site in Sonoma Results

Responsibilities I choose what to cook every day. I personally cook the main dishes in the kitchen. I check on the dishes in our

Figure 1: Percentage of Pennsylvania Wine Trail 2011 Pennsylvania Wine Industry Needs Assessment Survey

Click to edit Master title style Delivering World-Class Customer Service Through Lean Thinking

Growth and Market Validation of Compostable Coffee Capsules. Fabio Osculati, Innovation & Management Consultant

Case Study 8. Topic. Basic Concepts. Team Activity. Develop conceptual design of a coffee maker. Perform the following:

2019 SkillsUSA Missouri State Culinary Arts SECONDARY Contest

The Impact of the BPR on the Automotive Supply Chain

Table of Contents. Toast Inc. 2

Ready2Eat Avocado Development of improved ripening protocols Ernst Woltering Wageningen-UR Food & Biobased Research

The Dun & Bradstreet Asia Match Environment. AME FAQ. Warwick R Matthews

White Paper. Dry Ingredient Chilling for Bakery Manufacturers.

INTRODUCTION. Your new smoker comes almost completely assembled. You will need to complete the assembly which includes:

+ = Power up your Smart Cup while pressing the corresponding button to reach different program modes. Heat Exchange fill/tank Drain Page:

IT tool training. Biocides Day. 25 th of October :30-11:15 IUCLID 11:30-13:00 SPC Editor 14:00-16:00 R4BP 3

Flexible Imputation of Missing Data

Imputation of multivariate continuous data with non-ignorable missingness

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

FOOD ALLERGY CANADA COMMUNITY EVENT PROPOSAL FORM

Noun-Verb Decomposition

Northern Region Central Region Southern Region No. % of total No. % of total No. % of total Schools Da bomb

Supplementation of Beverages, Salad Dressing and Yogurt with Pulse Ingredients. Summary of Report

DONOR PROSPECTUS March 2017

The University of Georgia

GLOSSARY OF MENU ITEMS

DELAWARE COMPENSATION RATING BUREAU, INC. Proposed Excess Loss (Pure Premium) Factors

Worksite Wellness Karensa Tischer, RD

Detecting Melamine Adulteration in Milk Powder

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Francis MACARY UR ETBX, Irstea The 31st of March to the 2nd of April,

Experience with CEPs, API manufacturer s perspective

1

Transcription:

Incremental Record Linkage Anja Gruenheid!! Xin Luna Dong!!! Divesh Srivastava

Introduction What is record linkage?!!! The task of linking records that refer to the same!!!! real-world entity.! Why do we need incremental record linkage?!!! atch computing record linkage is costly. If the!!!! underlying data set is modified only slightly, it is!!! more efficient to use an incremental approach.

Example: IRL izid ID name street address city phone r Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 r 2 Starbucks 23 MISSION ST SAN FRANCISCO 4554350 r 3 Starbucks 23 Mission St San Francisco 4554350 2 r 4 Starbucks Co ee 340 MISSION ST SAN FRANCISCO 4554350 D 3 r 5 Starbucks Co ee 333 MARKET ST SAN FRANCISCO 455434786 0 3 r 6 Starbucks MARKET ST San Francisco 4 r 7 Starbucks Co ee 52 California St San Francisco 453988630 4 r 8 Starbucks Co ee 52 CALIFORNIA ST SAN FRANCISCO 453988630 5 r 9 Starbucks Co ee 295 California St San Francisco 459862349 5 r 0 Starbucks 295 California St San Francisco

Example: IRL izid ID name street address city phone r Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 r 2 Starbucks 23 MISSION ST SAN FRANCISCO 4554350 r 3 Starbucks 23 Mission St San Francisco 4554350 2 r 4 Starbucks Co ee 340 MISSION ST SAN FRANCISCO 4554350 D 3 r 5 Starbucks Co ee 333 MARKET ST SAN FRANCISCO 455434786 0 3 r 6 Starbucks MARKET ST San Francisco 4 r 7 Starbucks Co ee 52 California St San Francisco 453988630 4 r 8 Starbucks Co ee 52 CALIFORNIA ST SAN FRANCISCO 453988630 5 r 9 Starbucks Co ee 295 California St San Francisco 459862349 5 r 0 Starbucks 295 California St San Francisco apply batch record linkage r 2 r 4 C r 3 r 5 C 2 r 6 C r 3 8 C 4 r 9 r 7 C r r 5 0

Example: IRL izid ID name street address city phone r Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 r 2 Starbucks 23 MISSION ST SAN FRANCISCO 4554350 r 3 Starbucks 23 Mission St San Francisco 4554350 2 r 4 Starbucks Co ee 340 MISSION ST SAN FRANCISCO 4554350 D 3 r 5 Starbucks Co ee 333 MARKET ST SAN FRANCISCO 455434786 0 3 r 6 Starbucks MARKET ST San Francisco 4 r 7 Starbucks Co ee 52 California St San Francisco 453988630 4 r 8 Starbucks Co ee 52 CALIFORNIA ST SAN FRANCISCO 453988630 5 r 9 Starbucks Co ee 295 California St San Francisco 459862349 5 r 0 Starbucks 295 California St San Francisco izid ID name street address city phone D 6 r Starbucks Co ee 20 Spear Street San Francisco 459745077 D 3 r 2 Starbucks Co ee MARKET ST San Francisco 455434786 2 3 r 3 Starbucks 333 MARKET ST San Francisco 455434786 D r 4 Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 3 r 5 Starbucks 23 Mission St Ste St San Francisco 4554350 D 5 r 6 Starbucks 295 CALIFORNIA ST SAN FRANCISCO 459862349 4 4 r 7 Starbucks 52 California Street SF 453988630

Example: IRL izid ID name street address city phone r Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 r 2 Starbucks 23 MISSION ST SAN FRANCISCO 4554350 r 3 Starbucks 23 Mission St San Francisco 4554350 2 r 4 Starbucks Co ee 340 MISSION ST SAN FRANCISCO 4554350 D 3 r 5 Starbucks Co ee 333 MARKET ST SAN FRANCISCO 455434786 0 3 r 6 Starbucks MARKET ST San Francisco 4 r 7 Starbucks Co ee 52 California St San Francisco 453988630 4 r 8 Starbucks Co ee 52 CALIFORNIA ST SAN FRANCISCO 453988630 5 r 9 Starbucks Co ee 295 California St San Francisco 459862349 5 r 0 Starbucks 295 California St San Francisco apply! incremental! record linkage + izid ID name street address city phone D 6 r Starbucks Co ee 20 Spear Street San Francisco 459745077 D 3 r 2 Starbucks Co ee MARKET ST San Francisco 455434786 2 3 r 3 Starbucks 333 MARKET ST San Francisco 455434786 D r 4 Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 3 r 5 Starbucks 23 Mission St Ste St San Francisco 4554350 D 5 r 6 Starbucks 295 CALIFORNIA ST SAN FRANCISCO 459862349 4 4 r 7 Starbucks 52 California Street SF 453988630 r r 2 r 4 C r 3 r 5 C 2 r 6 C r 3 8 r 7 r C 4 r 9 r 0 C 5 + r 2 r 3 r 4 r 5 r 6 r 7 =!?

Example: IRL izid ID name street address city phone r Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 r 2 Starbucks 23 MISSION ST SAN FRANCISCO 4554350 r 3 Starbucks 23 Mission St San Francisco 4554350 2 r 4 Starbucks Co ee 340 MISSION ST SAN FRANCISCO 4554350 D 3 r 5 Starbucks Co ee 333 MARKET ST SAN FRANCISCO 455434786 0 3 r 6 Starbucks MARKET ST San Francisco 4 r 7 Starbucks Co ee 52 California St San Francisco 453988630 4 r 8 Starbucks Co ee 52 CALIFORNIA ST SAN FRANCISCO 453988630 5 r 9 Starbucks Co ee 295 California St San Francisco 459862349 5 r 0 Starbucks 295 California St San Francisco apply! incremental! record linkage + izid ID name street address city phone D 6 r Starbucks Co ee 20 Spear Street San Francisco 459745077 D 3 r 2 Starbucks Co ee MARKET ST San Francisco 455434786 2 3 r 3 Starbucks 333 MARKET ST San Francisco 455434786 D r 4 Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 3 r 5 Starbucks 23 Mission St Ste St San Francisco 4554350 D 5 r 6 Starbucks 295 CALIFORNIA ST SAN FRANCISCO 459862349 4 4 r 7 Starbucks 52 California Street SF 453988630 C r 4 r 3 r 2 r r 4 r 5 C r 5 C 2 r 3 r 2 r 6 r C 6 r 7 C 4 r 8 C r 5 9 r 7 r 6 r 0

Optimal Approaches Connected Component Approach!!! Update the connected component (set of clusters)!!! that is or was connected to the modified record.! Iterative Approach!!! Iteratively propagate the update through clusters in!! the connected component.

Example: Iterative Approach ) A modified record is associated with a modified cluster.!! Modified clusters can!! be singleton clusters if!! record cannot be!!! associated with an!!! existing cluster. modified record modified cluster

Example: Iterative Approach 2) The directly connected component is evaluated with a batch algorithm.! directly! connected! component! A directly connected!!! component are those!!! clusters directly connected! to the modified cluster.

Example: Iterative Approach 3) Iteratively proceed along modified clusters only.!! The modified clusters!! are iteratively explored! to avoid unnecessary!! clustering for non-!!! modified clusters.! new! modified! cluster un-modified cluster

Approximation Approach The greedy variation of the iterative approach! uses the iterative mechanism of propagating modifications through modified clusters only.! uses a locally optimal decision function to create, merge, split, or move records across clusters.

Greedy Operations Merge!!! If the benefits of merging the records into one cluster!! outweigh the penalties, then merge them. C 2 r 2 r 5 r3 r 6 r 5 C 2 r 3 r 2 r 6 C 3

Greedy Operations Split!!! If the benefits of separating the records in one cluster!! into two clusters outweigh the penalties, then!!!! split them. r 2 r 4 r C 4 r 3 r r 5 r 2 r 4 r 4 C r 3 r r 5

Greedy Operations Split!!! If the benefits of separating the records in one cluster!! into two clusters outweigh the penalties, then!!!! split them. r 2 r 4 r C 4 r 3 r r 5 r 2 r 4 r 4 C r 3 r r 5 r 2 r 4 r 4 C C r 3 r r 5

Greedy Operations Move!!! If removing a record from one cluster and adding!!! it to another decreases the overall penalty, then!!!! move the record. r 7 r 8 C 4 C 4 r 9 r r 9 7 r 7 r 6 r 7 r 6 C 5 r 0 r 8 r 0 C 5

Greedy Operations Move!!! If removing a record from one cluster and adding!!! it to another decreases the overall penalty, then!!!! move the record. r 7 r 8 C 4 C 4 r 9 r r 9 7 r 7 r 6 r 7 r 6 C 5 r 0 r 8 r 0 C 5 r 7 C 4 r 8 C r 5 9 r 7 r 6 r 0

Experiments 3 (real-world and synthetic) datasets!! usiness dataset!-! contains records from businesses!registered!!!!!!! in the SFO area! Cora dataset!! -! widely used publications dataset! Febrl dataset!! -! dataset generator! 2 batch algorithms and 4 incremental approaches!! ) Cautious correlation!!! ) Naive!! clustering!!!!! 2) Connected component (CC)! 2) D-Index!!!!!! 3) Iterative (IT)!!!!!!!!!! 4) Greedy

Experiments: Penalty Penalty for usiness dataset with Correlation Clustering: 6" Penalty((in(K)( 4" 2" 0" " 2" 4" 6" 8" Updates( atch( Naïve( CC( IT( Greedy( Penalty for usiness dataset with D-Index: 0" Penalty((in(K)( 8" 6" 4" 2" 0" " 2" 4" 6" 8" Updates( Naïve( CC( IT( Greedy(

Experiments: Execution Time Execution time for usiness dataset with Correlation Clustering:!me$(in$ms,$log$scale)$ 00000" 0" 000" 0" 0." 0.00" " " 2" 3" 4" Updates$ 5" 6" 7" 2000" Changed$ 4000" Deleted$ 6000" Inserted$ 8000" atch$ 0000" Naïve$ 2000" CC$ 4000" IT$ Greedy$ 8" Execution time for usiness dataset with D-Index: 0"!me$(in$ms,$log$scale)$ 00" 2000" " 4000" 0.0" 6000" 0.000" 8000" 0000" 0.00000" 2000" E)08" 4000" E)0" " " 2" 3" 4" Updates$ 5" 6" 7" 8" Changed$ Deleted$ Inserted$ Naïve$ CC$ IT$ Greedy$

Conclusion Incremental record linkage is an essential mechanism to improve the overall performance of linkage algorithms.! The performance and quality trade-offs for incremental record linkage are dependent on the applied objective function.! Greedy approximations provide a good alternative to optimal incremental record linkage algorithms.

Conclusion Incremental record linkage is an essential mechanism to improve the overall performance of linkage algorithms.! The performance and quality trade-offs for incremental record linkage are dependent on the applied objective function.! Greedy approximations provide a good alternative to optimal incremental record linkage algorithms. Thank you!! anja.gruenheid@inf.ethz.ch

Experiments: usiness Measurements for usiness dataset with Correlation Clustering and D-Index: Method Time (s) Impro. Penalty atch 3.7-988 Naive 6 76.7% 3037 Cont CC 78.7% 988 Corr IT 0.6 8.4% 98 Clust. Greedy 0.4 84.% 592 Naive 0.79 79.7% 072 Reset CC 0.20 74.2% 987 IT 0.7 77.7% 987 Greedy 0.20 74.3% 922 Naive 997 99% 5426 D- Cont CC 57. 94.3% 65 Index IT 4.4 98.6% 783 Greedy.79 99% 94

Experiments: Execution Time Execution time for Cora dataset with Correlation Clustering: Time'(in'ms,'log'scale)' 00000.00" 000.00" 0.00" 0.0" 0.00" " 2" 3" 4" 5" 6" 7" 8" 9" 0" " Updates' 0" 000" 2000" 3000" 4000" 5000" Update'Size' Deleted' Inserted' atch' Naïve' CC' IT' Greedy'

Experiments: Quality Penalty for Cora dataset with Correlation Clustering: Penalty((in(K)( 40" 20" 0" " 2" 3" 4" 5" 6" 7" 8" 9" 0" " Updates( atch( Naïve( CC( IT( Greedy( F-Measure for Cora dataset with D-Index: F"Measure) " 0" 0.6" 0.4" 0.2" 0" " 2" 3" 4" 5" 6" 7" 8" 9" 0" " Update) atch) Naïve) CC) IT) Greedy)

Experiments: Execution Time Execution time for Febrl dataset with Correlation Clustering! and varying similarity thresholds Time%(in%ms,%log%scale)% 00000$ 0000$ 000$ 00$ 0$ $ 0.$ $ 05$ 0$ 05$ 0$ 0.75$ 0.7$ Similarity%Threshold% Naïve% CC% IT% Greedy% Execution time for Febrl dataset with Correlation Clustering! and varying update sizes Time%(in%ms,%log%scale)% 000$ 00$ 0$ $ 0.$ 00$ 200$ 400$ 600$ 800$ 000$ Update%Size% Naïve% CC% IT% Greedy%

Experiments: Quality F-Measure for Febrl dataset with Correlation Clustering! and varying similarity thresholds F"Measure) " 0" 0.6" 0.4" 0.2" 0" " 05" 0" 05" 0" 0.75" 0.7" Similarity)Threshold) Naïve) CC) IT) Greedy)