Ais o MoetDB Viktor Rosefeld rosefel@iformatik.hu-berli.de 14. Jauary 2013 Advisors: Prof. Dr. Ulf Leser ad Dr. Stefa Maegold
http://www.flickr.com/photos/karola/3623768629 2
1. What is Ais ad how is it used? 2. Curret implemetatio o PostgreSQL 3. What are Colum-Stores? How ca Ais beefit? 4. New implemetatio o MoetDB ad evaluatio 3
1. What is Ais ad how is it used? 4
What s a corpus? ay pricipled collectio of laguage 5
What s a aotatio? classificatio ad iterpretatio of the corpus data additioal data to erich the corpus Steilpass Märkische Allgemeie Zeitug, 12.10.2001 Potsdam Commetary Corpus (Stede, 2004) 6
What s a aotatio? classificatio ad iterpretatio of the corpus data additioal data to erich the corpus Steilpass Märkische Allgemeie Zeitug, 12.10.2001 Potsdam Commetary Corpus (Stede, 2004) 7
What s a aotatio? classificatio ad iterpretatio of the corpus data additioal data to erich the corpus Steilpass Märkische Allgemeie Zeitug, 12.10.2001 Potsdam Commetary Corpus (Stede, 2004) 8
What s a aotatio? classificatio ad iterpretatio of the corpus data additioal data to erich the corpus Steilpass Märkische Allgemeie Zeitug, 12.10.2001 Potsdam Commetary Corpus (Stede, 2004) 9
Ais 10
Ais Query Aotatios Corpus selectio 11
Ais Query Aotatios Export for statistical Corpus aalysis selectio 11
Ais query laguage cat="s" & fid a setece "Wuder" & ad fid the phrase Wuder #1 _i_ #2 the setece icludes the phrase Wuder 12
Ais query laguage cat="s" & fid a setece "Wuder" & ad fid the phrase Wuder #1 _i_ #2 the setece icludes the phrase Wuder SELECT id1, id2 FROM... RDBMS WHERE... 12
2. Curret implemetatio o PostgreSQL 13
Database schema ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace 14
Example 1: Text search ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace AQL: "Wuder" SQL: oden.spa = 'Wuder' 15
Example 2: Aotatio search ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace cat="s" ode_aotation.ame = 'cat' ode_aotation.value = 'S' 16
Example 3: Iclusio operator ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace #1 _i_ #2 ode1.text_ref = ode2.text_ref ode1.right <= ode2.right ode1.left >= ode2.left 17
May tables May jois ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace 18
May tables May jois Aotatio searches ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace 18
May tables May jois Aotatio searches ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace Edge aotatios Biary relatios o edges 18
May tables May jois Aotatio searches ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace Edge aotatios Biary relatios o edges Bad performace o PostgreSQL 18
Solutio 1: Oe big table id spa text_ref left right... ode ode_aotatio ode_ref amespace ame value id spa text_ref left right... a_ame a_value... 1 1 1 30 cat S 2 Wuder 1 1 5 morph Acc.Pl.Neut 2 Wuder 1 1 5 pos NN 2 Wuder 1 1 5 lemma Wuder... 19
Solutio 1: Oe big table id spa text_ref left right... ode ode_aotatio ode_ref amespace ame value id spa text_ref left right... a_ame a_value... 1 1 1 30 cat S 2 Wuder 1 1 5 morph Acc.Pl.Neut 2 Wuder 1 1 5 pos NN 2 Wuder 1 1 5 lemma Wuder... Pro: Fewer jois Cotra: Icreased redudacy, less extesible 19
Solutio 2: Combied idexes id spa text_ref left right... a_ame a_value... 1 1 1 30 cat S 2 Wuder 1 1 5 morph Acc.Pl.Neut 2 Wuder 1 1 5 pos NN 2 Wuder 1 1 5 lemma Wuder... Oe idex over 4 colums Fid odes spaig a certai word, i a certai text, at a certai positio. cat="s" & "Wuder" & #1 _i_ #2 Pro: Potetially very fast Cotra: Uses lots of disk space 20
Disk usage i PostgreSQL TIGER Treebak 2.1 ca. 50.000 seteces, 900.000 tokes, 3 millio aotatios, 1 millio edges 280 MB 525 MB 1.2 GB 7.7 GB Data files Normalized Materialized Materialized (may tables) (oe table) + Idexes Icrease by factor 15 (or almost 30) 21
3. What are Colum-Stores? How ca Ais beefit? 22
What's a Colum-Store? ode_ref ame value coceptual model 1 2 123 pos VVINF 123 lemma esse table 3 456 pos NN storage model 23
What's a Colum-Store? ode_ref ame value coceptual model 1 2 123 pos VVINF 123 lemma esse table 3 456 pos NN 1 123 pos VVINF storage model 2 123 lemma esse 3 456 pos NN rows 23
What's a Colum-Store? ode_ref ame value coceptual model 1 2 123 pos VVINF 123 lemma esse table 3 456 pos NN 1 123 pos VVINF ode_ref ame value storage model 2 123 lemma esse 123 123 pos lemma VVINF esse 3 456 pos NN 456 pos NN rows colums 23
Why Colum-Stores? Why Databases? data data too slow! very big sequetial fast(er) too small! radom very fast job of the database (traditioally) 24
Why Colum-Stores? Why Databases? data fast(er) very fast 24
Caches betwee RAM ad CPU 48 GB 12 MB 256 kb 32 kb 33 s 5.4 s 1.7 s 1.4 s job of the database o a moder system (amog others) 25
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 123 pos VVFIN 26
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 123 pos VVFIN 1. load first row 26
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 123 pos VVFIN 2. 1. load first row 2. locate ame attribute 26
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 123 pos VVFIN 2. 3. 1. load first row 2. locate ame attribute 3. test ame attribute 26
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 123 pos VVFIN 2. 3. 123 lemma esse 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 26
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 123 pos VVFIN 2. 3. 123 lemma esse 5. 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 26
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 123 pos VVFIN 2. 3. 123 lemma esse 5. 6. 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 6. test ame attribute 26
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 7. 123 pos VVFIN 2. 3. 123 lemma esse 5. 6. 123 pos NN 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 6. test ame attribute 7. load third row 26
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 7. 123 pos VVFIN 2. 3. 123 lemma esse 5. 6. 123 pos NN 8. 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 6. test ame attribute 7. load third row 8. locate ame attribute 26
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 7. 123 pos VVFIN 2. 3. 123 lemma esse 5. 6. 123 pos NN 8. 9. 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 6. test ame attribute 7. load third row 8. locate ame attribute 9. test ame attribute 26
Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 7. 123 pos VVFIN 2. 3. 123 lemma esse 5. 6. 123 pos NN 8. 9. 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 6. test ame attribute 7. load third row 8. locate ame attribute 9. test ame attribute 26
Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: pos lemma pos 27
Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: 1. pos lemma pos 1. load ame colum 27
Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: 1. pos lemma pos 2. 1. load ame colum 2. test first ame attribute 27
Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: 1. pos lemma pos 2. 3. 1. load ame colum 2. test first ame attribute 3. test secod ame attribute 27
Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: 1. pos lemma pos 2. 3. 4. 1. load ame colum 2. test first ame attribute 3. test secod ame attribute 4. test third ame attribute 27
Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: 1. pos lemma pos 2. 3. 4. 1. load ame colum 2. test first ame attribute 3. test secod ame attribute 4. test third ame attribute 27
Colum operatios i Ais Search terms ca be idexed "Wuder" Regular expressios ca ofte be idexed but ot always morph=/.*\.pl\.neut/ Biary operatios ca be idexed eed may idexes slow if there are may idex lookups _= i_ > id spa text_ref left right... 1 1 1 30 2 Wuder 1 1 5 2 Wuder 1 1 5 2 Wuder 1 1 5... 28
4. New implemetatio o MoetDB ad evaluatio 29
Prototype implemetatio Supported COUNT queries Ais 2 Query Laguage Not supported Ais 3 laguage features corpus selectio ANNOTATE, MATRIX queries 30
Realistic test workload Corpus: TIGER Treebak 2.1 Queries: 3 moth query log of Ais istace at the SFB 632 337 TIGER queries (224 uique) up to 4 search terms up to 6 biary operators Radom workload: 10000 queries origial distributio excluded PostgreSQL timeout 31
Workload of 10000 queries 6 MoetDB PostgreSQL 5 hours 47 miutes Hours 4 2 0 25 miutes 1 hour 37 miutes Server (48 GB RAM) 29 miutes Laptop (4 GB RAM) factor 20 280 MB Data files 396 MB MoetDB 7.7 GB PostgreSQL 32
Idividual query performace 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 33
Idividual query performace 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 34
Idividual query performace 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 35
Idividual query performace 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 36
Idividual query performace 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 37
Simple queries are fast 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 Query MoetDB PostgreSQL ode 6.6 ms 2926 ms "der" 5.1 226 /der.*/ 19 383 cat="s" 41 184 lemma="wasche" 43 14 38
Ifluece of result size 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 Query Results MoetDB PostgreSQL pos="vvimp" 162 43 ms 15 ms pos="vvpp" 17770 43 111 pos="vvfin" 35628 44 182 pos="adja" 54534 43 246 39
Queries with millios of results 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 Query Results Moet PSQL lemma="müsse" & pos= /VV.*/ & pos="$." & #1.* #2 & #2.* #3 4.5 M 2 s 35 s pos=/vm.*/ & pos= /VV.*/ & pos=/.*/ & #1.* #2 & #2.* #3 384 M 175 s > 1 h 40
Fast regular expressios regular expressio without a fixed prefix ca't use a idex, eed to sca the etire colum Query MoetDB PostgreSQL /.*sich.*/ 213 ms 4206 ms /[Kk]a.*/ 219 2812 pos="vvpp" & lemma=/(ge)?komme/ & #1 _=_ #2 229 383 pos=/n.*/ & /[12][09][0-9][0-9]/ & #1 _=_ #2 246 2902 lemma=/[^äöü]+/ & /.+[äöü].+/ & pos="nn" & #1 _=_ #2 & #2 _=_ #3 469 6246 41
Advatages MoetDB better overall performace stable query performace fast regular expressios ormalized schema greatly reduced disk cosumptio PostgreSQL queries with highly selective search term complete implemetatio bug-free SQL processig better use of limited resources 42
Summary prototypical implemetatio of Ais o MoetDB test sceario from a Ais istallatio i service i-depth performace compariso of Ais o MoetDB ad PostgreSQL SELECT viele FROM dak; 43