Incremental Record Linkage. Anja Gruenheid!! Xin Luna Dong!!! Divesh Srivastava

Incremental Record Linkage Anja Gruenheid!! Xin Luna Dong!!! Divesh Srivastava

Introduction What is record linkage?!!! The task of linking records that refer to the same!!!! real-world entity.! Why do we need incremental record linkage?!!! atch computing record linkage is costly. If the!!!! underlying data set is modified only slightly, it is!!! more efficient to use an incremental approach.

Example: IRL izid ID name street address city phone r Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 r 2 Starbucks 23 MISSION ST SAN FRANCISCO 4554350 r 3 Starbucks 23 Mission St San Francisco 4554350 2 r 4 Starbucks Co ee 340 MISSION ST SAN FRANCISCO 4554350 D 3 r 5 Starbucks Co ee 333 MARKET ST SAN FRANCISCO 455434786 0 3 r 6 Starbucks MARKET ST San Francisco 4 r 7 Starbucks Co ee 52 California St San Francisco 453988630 4 r 8 Starbucks Co ee 52 CALIFORNIA ST SAN FRANCISCO 453988630 5 r 9 Starbucks Co ee 295 California St San Francisco 459862349 5 r 0 Starbucks 295 California St San Francisco izid ID name street address city phone D 6 r Starbucks Co ee 20 Spear Street San Francisco 459745077 D 3 r 2 Starbucks Co ee MARKET ST San Francisco 455434786 2 3 r 3 Starbucks 333 MARKET ST San Francisco 455434786 D r 4 Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 3 r 5 Starbucks 23 Mission St Ste St San Francisco 4554350 D 5 r 6 Starbucks 295 CALIFORNIA ST SAN FRANCISCO 459862349 4 4 r 7 Starbucks 52 California Street SF 453988630

Example: IRL izid ID name street address city phone r Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 r 2 Starbucks 23 MISSION ST SAN FRANCISCO 4554350 r 3 Starbucks 23 Mission St San Francisco 4554350 2 r 4 Starbucks Co ee 340 MISSION ST SAN FRANCISCO 4554350 D 3 r 5 Starbucks Co ee 333 MARKET ST SAN FRANCISCO 455434786 0 3 r 6 Starbucks MARKET ST San Francisco 4 r 7 Starbucks Co ee 52 California St San Francisco 453988630 4 r 8 Starbucks Co ee 52 CALIFORNIA ST SAN FRANCISCO 453988630 5 r 9 Starbucks Co ee 295 California St San Francisco 459862349 5 r 0 Starbucks 295 California St San Francisco apply! incremental! record linkage + izid ID name street address city phone D 6 r Starbucks Co ee 20 Spear Street San Francisco 459745077 D 3 r 2 Starbucks Co ee MARKET ST San Francisco 455434786 2 3 r 3 Starbucks 333 MARKET ST San Francisco 455434786 D r 4 Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 3 r 5 Starbucks 23 Mission St Ste St San Francisco 4554350 D 5 r 6 Starbucks 295 CALIFORNIA ST SAN FRANCISCO 459862349 4 4 r 7 Starbucks 52 California Street SF 453988630 r r 2 r 4 C r 3 r 5 C 2 r 6 C r 3 8 r 7 r C 4 r 9 r 0 C 5 + r 2 r 3 r 4 r 5 r 6 r 7 =!?

Example: IRL izid ID name street address city phone r Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 r 2 Starbucks 23 MISSION ST SAN FRANCISCO 4554350 r 3 Starbucks 23 Mission St San Francisco 4554350 2 r 4 Starbucks Co ee 340 MISSION ST SAN FRANCISCO 4554350 D 3 r 5 Starbucks Co ee 333 MARKET ST SAN FRANCISCO 455434786 0 3 r 6 Starbucks MARKET ST San Francisco 4 r 7 Starbucks Co ee 52 California St San Francisco 453988630 4 r 8 Starbucks Co ee 52 CALIFORNIA ST SAN FRANCISCO 453988630 5 r 9 Starbucks Co ee 295 California St San Francisco 459862349 5 r 0 Starbucks 295 California St San Francisco apply! incremental! record linkage + izid ID name street address city phone D 6 r Starbucks Co ee 20 Spear Street San Francisco 459745077 D 3 r 2 Starbucks Co ee MARKET ST San Francisco 455434786 2 3 r 3 Starbucks 333 MARKET ST San Francisco 455434786 D r 4 Starbucks 23 MISSION ST STE ST SAN FRANCISCO 4554350 3 r 5 Starbucks 23 Mission St Ste St San Francisco 4554350 D 5 r 6 Starbucks 295 CALIFORNIA ST SAN FRANCISCO 459862349 4 4 r 7 Starbucks 52 California Street SF 453988630 C r 4 r 3 r 2 r r 4 r 5 C r 5 C 2 r 3 r 2 r 6 r C 6 r 7 C 4 r 8 C r 5 9 r 7 r 6 r 0

Optimal Approaches Connected Component Approach!!! Update the connected component (set of clusters)!!! that is or was connected to the modified record.! Iterative Approach!!! Iteratively propagate the update through clusters in!! the connected component.

Example: Iterative Approach ) A modified record is associated with a modified cluster.!! Modified clusters can!! be singleton clusters if!! record cannot be!!! associated with an!!! existing cluster. modified record modified cluster

Example: Iterative Approach 2) The directly connected component is evaluated with a batch algorithm.! directly! connected! component! A directly connected!!! component are those!!! clusters directly connected! to the modified cluster.

Example: Iterative Approach 3) Iteratively proceed along modified clusters only.!! The modified clusters!! are iteratively explored! to avoid unnecessary!! clustering for non-!!! modified clusters.! new! modified! cluster un-modified cluster

Approximation Approach The greedy variation of the iterative approach! uses the iterative mechanism of propagating modifications through modified clusters only.! uses a locally optimal decision function to create, merge, split, or move records across clusters.

Greedy Operations Merge!!! If the benefits of merging the records into one cluster!! outweigh the penalties, then merge them. C 2 r 2 r 5 r3 r 6 r 5 C 2 r 3 r 2 r 6 C 3

Greedy Operations Split!!! If the benefits of separating the records in one cluster!! into two clusters outweigh the penalties, then!!!! split them. r 2 r 4 r C 4 r 3 r r 5 r 2 r 4 r 4 C r 3 r r 5

Greedy Operations Split!!! If the benefits of separating the records in one cluster!! into two clusters outweigh the penalties, then!!!! split them. r 2 r 4 r C 4 r 3 r r 5 r 2 r 4 r 4 C r 3 r r 5 r 2 r 4 r 4 C C r 3 r r 5

Greedy Operations Move!!! If removing a record from one cluster and adding!!! it to another decreases the overall penalty, then!!!! move the record. r 7 r 8 C 4 C 4 r 9 r r 9 7 r 7 r 6 r 7 r 6 C 5 r 0 r 8 r 0 C 5

Experiments 3 (real-world and synthetic) datasets!! usiness dataset!-! contains records from businesses!registered!!!!!!! in the SFO area! Cora dataset!! -! widely used publications dataset! Febrl dataset!! -! dataset generator! 2 batch algorithms and 4 incremental approaches!! ) Cautious correlation!!! ) Naive!! clustering!!!!! 2) Connected component (CC)! 2) D-Index!!!!!! 3) Iterative (IT)!!!!!!!!!! 4) Greedy

Experiments: Penalty Penalty for usiness dataset with Correlation Clustering: 6" Penalty((in(K)( 4" 2" 0" " 2" 4" 6" 8" Updates( atch( Naïve( CC( IT( Greedy( Penalty for usiness dataset with D-Index: 0" Penalty((in(K)( 8" 6" 4" 2" 0" " 2" 4" 6" 8" Updates( Naïve( CC( IT( Greedy(

Experiments: Execution Time Execution time for usiness dataset with Correlation Clustering:!me$(in$ms,$log$scale)$ 00000" 0" 000" 0" 0." 0.00" " " 2" 3" 4" Updates$ 5" 6" 7" 2000" Changed$ 4000" Deleted$ 6000" Inserted$ 8000" atch$ 0000" Naïve$ 2000" CC$ 4000" IT$ Greedy$ 8" Execution time for usiness dataset with D-Index: 0"!me$(in$ms,$log$scale)$ 00" 2000" " 4000" 0.0" 6000" 0.000" 8000" 0000" 0.00000" 2000" E)08" 4000" E)0" " " 2" 3" 4" Updates$ 5" 6" 7" 8" Changed$ Deleted$ Inserted$ Naïve$ CC$ IT$ Greedy$

Conclusion Incremental record linkage is an essential mechanism to improve the overall performance of linkage algorithms.! The performance and quality trade-offs for incremental record linkage are dependent on the applied objective function.! Greedy approximations provide a good alternative to optimal incremental record linkage algorithms.

Experiments: usiness Measurements for usiness dataset with Correlation Clustering and D-Index: Method Time (s) Impro. Penalty atch 3.7-988 Naive 6 76.7% 3037 Cont CC 78.7% 988 Corr IT 0.6 8.4% 98 Clust. Greedy 0.4 84.% 592 Naive 0.79 79.7% 072 Reset CC 0.20 74.2% 987 IT 0.7 77.7% 987 Greedy 0.20 74.3% 922 Naive 997 99% 5426 D- Cont CC 57. 94.3% 65 Index IT 4.4 98.6% 783 Greedy.79 99% 94

Experiments: Execution Time Execution time for Cora dataset with Correlation Clustering: Time'(in'ms,'log'scale)' 00000.00" 000.00" 0.00" 0.0" 0.00" " 2" 3" 4" 5" 6" 7" 8" 9" 0" " Updates' 0" 000" 2000" 3000" 4000" 5000" Update'Size' Deleted' Inserted' atch' Naïve' CC' IT' Greedy'

Experiments: Quality Penalty for Cora dataset with Correlation Clustering: Penalty((in(K)( 40" 20" 0" " 2" 3" 4" 5" 6" 7" 8" 9" 0" " Updates( atch( Naïve( CC( IT( Greedy( F-Measure for Cora dataset with D-Index: F"Measure) " 0" 0.6" 0.4" 0.2" 0" " 2" 3" 4" 5" 6" 7" 8" 9" 0" " Update) atch) Naïve) CC) IT) Greedy)

Experiments: Execution Time Execution time for Febrl dataset with Correlation Clustering! and varying similarity thresholds Time%(in%ms,%log%scale)% 00000$ 0000$ 000$ 00$ 0$ $ 0.$ $ 05$ 0$ 05$ 0$ 0.75$ 0.7$ Similarity%Threshold% Naïve% CC% IT% Greedy% Execution time for Febrl dataset with Correlation Clustering! and varying update sizes Time%(in%ms,%log%scale)% 000$ 00$ 0$ $ 0.$ 00$ 200$ 400$ 600$ 800$ 000$ Update%Size% Naïve% CC% IT% Greedy%

Experiments: Quality F-Measure for Febrl dataset with Correlation Clustering! and varying similarity thresholds F"Measure) " 0" 0.6" 0.4" 0.2" 0" " 05" 0" 05" 0" 0.75" 0.7" Similarity)Threshold) Naïve) CC) IT) Greedy)