Annis on MonetDB. Viktor Rosenfeld 14. January Advisors: Prof. Dr. Ulf Leser and Dr.

Similar documents
THIS REPORT CONTAINS ASSESSMENTS OF COMMODITY AND TRADE ISSUES MADE BY USDA STAFF AND NOT NECESSARILY STATEMENTS OF OFFICIAL U.S.

Characteristics and dead-time of GM-tube

The Wine Specialist. brewers direct. [Limit 3 per person Manufacturers instant rebate applied] Festa Juice Whites

Record your answers to all the problems in the EMCF titled Homework 4.

CALIFORNIA CABERNET Class 1 Tasting

Determining the Optimal Stages Number of Module and the Heat Drop Distribution

Model Predictive Control for Central Plant Optimization with Thermal Energy Storage

Methodology of industrial projects economic evaluation (M.E.E.P.I.)

SAP Fiori UX Design and Build Assignment SOMMELIER

France 2019 Information Pack

Predicting Persimmon Puree Colour as a Result of Puree Strength Manipulation. Andrew R. East a, Xiu Hua Tan b, Jantana Suntudprom a

An Introduction to DBIx::Class. Tom Hukins

FUNCTIONAL RELATIONAL MAPPING WITH SLICK

A NOVEL OPTIMIZED ENERGY-SAVING EXTRACTION PROCESS ON COFFEE

Barista at a Glance BASIS International Ltd.

NEW METRICS FOR EVALUATING MONTE CARLO TOLERANCE ANALYSIS OF ASSEMBLIES

HACCP implementation in Jap an. Hajime TOYOFUKU, DVM., PhD Professor, Joint Faculty of Veterinary Medicine, Yamaguchi University, Japan

Alcohol & You Promoting Positive Change DERBYSHIRE Alcohol Advice Service

Lathyrus Lathyrism Newsletter 1 (2000)

Measure and cook. Recipe Book

Release Letter. Trufa

Recently, I had occasion to re-read George

User Studies for 3-Sweep

AT HALLMARK HOTEL GLOUCESTER

rpr static-rs 10 rpr station-name 10 rpr timer 10 rpr weight 10 service 11 shutdown 11 stp tc-snooping 11 te-set-subtlv 11

Rowe Snack Machine 6800 Manual

NVIVO 10 WORKSHOP. Hui Bian Office for Faculty Excellence BY HUI BIAN

Product cordial labeling for alternate snake graphs

Long-run Determinants of Export Supply of Sarawak Black and White Pepper: An ARDL Approach

Belling Country Range

Pa c k De s i g n s

DoubleTree by Hilton. Where the little things mean everything.

Draft general guidance on sampling and surveys for SSC projects

Casual Dining Solutions

Ratio Estimators Using Coefficient of Variation and Coefficient of Correlation

110cm Dual Fuel Range Cooker

Magical. Christmas. Palace Hotel - Southend on Sea

Mapping and Tracking (Invasive) Plants with Calflora s Weed Manager

GEORGIA WIC PROGRAM. Your Recipe for Choosing Healthy Foods. WIC Approved Foods List EFFECTIVE DECEMBER 1, 2011

Food Image Recognition by Deep Learning

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

Read & Download (PDF Kindle) Camping Recipes: Foil Packet Cooking

Results On The Run Fast Food Guide

AT HALLMARK HOTEL WARRINGTON

Control of hydrogen sulfide formation during fermentation

Installation instructions. Instructions d installation. Professional grill. Le gril professionnel. and User guide. et Guide d utilisation US CA

HOLIDAY. m e n u s. S a n t a A n i t a P a r k

Further Results on Divisor Cordial Labeling

SN 60FP. 600mm Fanned Electric Oven / Grill. User & Installation Instructions

Years 5-6. Information. Introduction. Key understandings

THE STEEL DETAILER SolidWorks 2015 INSTALLATION PROCEDURE

Flexible Working Arrangements, Collaboration, ICT and Innovation

AWRI Refrigeration Demand Calculator

GRAPE POWDERY MILDEW: MANAGEMENT AND RESISTANCE

Structural Reforms and Agricultural Export Performance An Empirical Analysis

Conductivity in Bulk and Film-type Zinc Oxide

Prediction of Vertical Spindle Force due to Loaded and Rolling Tire

Noun-Verb Decomposition

Table For Two - Back For Seconds By Warren Caterson READ ONLINE

Table of Contents. Introduction. Logo Interpretation

AT HALLMARK HOTEL HULL

Drivers of Agglomeration: Geography vs History

MORE HEALTHY RECIPES Volume 3 Heart Healthy Recipes

Predicting Wine Quality

Brewculator Final Report

Yelp Chanllenge. Tianshu Fan Xinhang Shao University of Washington. June 7, 2013

Dr.Abdulsattar A.jabbar Alkubaisi Associate Professor Department of Accounting World Islamic Sciences & Education University Amman-Jordan

Hostess Training Outline. Significance. Company Name Here

menù tradition AND INNOVATION MEET IN THE COZY ATMOSPHERE OF ONE OF THE OLDEST TAVERNS IN SESTRI LEVANTE key to symbols

Promote and support advanced computing to further Tier-One research and education at the University of Houston

th griffins 38 hindmarsh square adelaide city tel

Statistics: Final Project Report Chipotle Water Cup: Water or Soda?

WINE: Wine Lifestyle - Beginner To Expert Guide On: Wine Tasting, Wine Pairing, & Wine Selecting (Wine History, Spirits, World Wine, Vino, Wine

The R&D-patent relationship: An industry perspective

AT HALLMARK HOTEL CARLISLE

Nutri Diet Guide Double Your Nutri Diet Results: Double Your Nutri Diet Results - Quick & 5 Minute Easy Lose Pounds Blender & Shaker Recipes You Can

Christmas Party Menu 2017

Most Affordable Professional Grade 2D & 3D CAD Software

Demonstration Vineyard for Seedless Table Grapes for Cool Climates

Pantry Hero. Chiyuki Kitagawa SFUXD36

A Brief Introduction Das U-Boot

DE HOTLINE: DE: AT: CH: FR HOTLINE : B : F : CH :

Intelligent Call Admission Control Using Fuzzy Logic in Wireless Networks

Energy and Communication Efficient Group Key Management Protocol for Hierarchical Sensor Networks

Banquets with a Classical Touch 2018 W E D D I N G P A C K A G E S. High Speed Internet

Effect of Yeast Propagation Methods on Fermentation Efficiency

Purpose/Objective: Monitor and provide information on pre-spawning and spawning Delta Smelt distribution in the upper San Francisco Estuary.

Table of Contents. Foundation and Preparation 2 Hearth Base Dimensions 2. Laying the Inner Hearth 3 Inner Hearth Dimensions 4

FILE // CROCK POT THE ORIGINAL SLOW COOKER

GROWTH RATES OF RIPE ROT FUNGI AT DIFFERENT TEMPERATURES

Responsibilities I choose what to cook every day. I personally cook the main dishes in the kitchen. I check on the dishes in our

Encyclopedia Of Coffee And Espresso From Beans To Brew - Complete Guide For The Home Preparation Of Filter Drip Coffee... By Krups North America

AT HALLMARK HOTEL PRESTON LEYLAND

EVA LUATION OF JACKFRUIT GENOTYPES FOR YI ELD AND QUALITY ATTRIBUTES UNDER EASTERN INDIAN CONDITION

Calculating the Costs of Bur Management

Biocides IT training Vienna - 4 December 2017 IUCLID 6

Red wine consumption in the new world and the old world

TruMeasur. Liquor Gun Systems. By Beverage Management Systems, Inc.

SPILL RESISTANT DISPOSABLE CUP

Authentic Taste and Texture in Minutes. No Preservatives No Artificial Colors or Flavors

Transcription:

Ais o MoetDB Viktor Rosefeld rosefel@iformatik.hu-berli.de 14. Jauary 2013 Advisors: Prof. Dr. Ulf Leser ad Dr. Stefa Maegold

http://www.flickr.com/photos/karola/3623768629 2

1. What is Ais ad how is it used? 2. Curret implemetatio o PostgreSQL 3. What are Colum-Stores? How ca Ais beefit? 4. New implemetatio o MoetDB ad evaluatio 3

1. What is Ais ad how is it used? 4

What s a corpus? ay pricipled collectio of laguage 5

What s a aotatio? classificatio ad iterpretatio of the corpus data additioal data to erich the corpus Steilpass Märkische Allgemeie Zeitug, 12.10.2001 Potsdam Commetary Corpus (Stede, 2004) 6

What s a aotatio? classificatio ad iterpretatio of the corpus data additioal data to erich the corpus Steilpass Märkische Allgemeie Zeitug, 12.10.2001 Potsdam Commetary Corpus (Stede, 2004) 7

What s a aotatio? classificatio ad iterpretatio of the corpus data additioal data to erich the corpus Steilpass Märkische Allgemeie Zeitug, 12.10.2001 Potsdam Commetary Corpus (Stede, 2004) 8

What s a aotatio? classificatio ad iterpretatio of the corpus data additioal data to erich the corpus Steilpass Märkische Allgemeie Zeitug, 12.10.2001 Potsdam Commetary Corpus (Stede, 2004) 9

Ais 10

Ais Query Aotatios Corpus selectio 11

Ais Query Aotatios Export for statistical Corpus aalysis selectio 11

Ais query laguage cat="s" & fid a setece "Wuder" & ad fid the phrase Wuder #1 _i_ #2 the setece icludes the phrase Wuder 12

Ais query laguage cat="s" & fid a setece "Wuder" & ad fid the phrase Wuder #1 _i_ #2 the setece icludes the phrase Wuder SELECT id1, id2 FROM... RDBMS WHERE... 12

2. Curret implemetatio o PostgreSQL 13

Database schema ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace 14

Example 1: Text search ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace AQL: "Wuder" SQL: oden.spa = 'Wuder' 15

Example 2: Aotatio search ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace cat="s" ode_aotation.ame = 'cat' ode_aotation.value = 'S' 16

Example 3: Iclusio operator ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace #1 _i_ #2 ode1.text_ref = ode2.text_ref ode1.right <= ode2.right ode1.left >= ode2.left 17

May tables May jois ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace 18

May tables May jois Aotatio searches ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace 18

May tables May jois Aotatio searches ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace Edge aotatios Biary relatios o edges 18

May tables May jois Aotatio searches ode id amespace ame text_ref left right spa toke_idex left_toke right_toke corpus_ref toplevel_corpus ode_aotatio ode_ref amespace ame value rak pre post paret ode_ref compoet_ref root level edge_aotatio rak_ref amespace ame value compoet id type ame amespace Edge aotatios Biary relatios o edges Bad performace o PostgreSQL 18

Solutio 1: Oe big table id spa text_ref left right... ode ode_aotatio ode_ref amespace ame value id spa text_ref left right... a_ame a_value... 1 1 1 30 cat S 2 Wuder 1 1 5 morph Acc.Pl.Neut 2 Wuder 1 1 5 pos NN 2 Wuder 1 1 5 lemma Wuder... 19

Solutio 1: Oe big table id spa text_ref left right... ode ode_aotatio ode_ref amespace ame value id spa text_ref left right... a_ame a_value... 1 1 1 30 cat S 2 Wuder 1 1 5 morph Acc.Pl.Neut 2 Wuder 1 1 5 pos NN 2 Wuder 1 1 5 lemma Wuder... Pro: Fewer jois Cotra: Icreased redudacy, less extesible 19

Solutio 2: Combied idexes id spa text_ref left right... a_ame a_value... 1 1 1 30 cat S 2 Wuder 1 1 5 morph Acc.Pl.Neut 2 Wuder 1 1 5 pos NN 2 Wuder 1 1 5 lemma Wuder... Oe idex over 4 colums Fid odes spaig a certai word, i a certai text, at a certai positio. cat="s" & "Wuder" & #1 _i_ #2 Pro: Potetially very fast Cotra: Uses lots of disk space 20

Disk usage i PostgreSQL TIGER Treebak 2.1 ca. 50.000 seteces, 900.000 tokes, 3 millio aotatios, 1 millio edges 280 MB 525 MB 1.2 GB 7.7 GB Data files Normalized Materialized Materialized (may tables) (oe table) + Idexes Icrease by factor 15 (or almost 30) 21

3. What are Colum-Stores? How ca Ais beefit? 22

What's a Colum-Store? ode_ref ame value coceptual model 1 2 123 pos VVINF 123 lemma esse table 3 456 pos NN storage model 23

What's a Colum-Store? ode_ref ame value coceptual model 1 2 123 pos VVINF 123 lemma esse table 3 456 pos NN 1 123 pos VVINF storage model 2 123 lemma esse 3 456 pos NN rows 23

What's a Colum-Store? ode_ref ame value coceptual model 1 2 123 pos VVINF 123 lemma esse table 3 456 pos NN 1 123 pos VVINF ode_ref ame value storage model 2 123 lemma esse 123 123 pos lemma VVINF esse 3 456 pos NN 456 pos NN rows colums 23

Why Colum-Stores? Why Databases? data data too slow! very big sequetial fast(er) too small! radom very fast job of the database (traditioally) 24

Why Colum-Stores? Why Databases? data fast(er) very fast 24

Caches betwee RAM ad CPU 48 GB 12 MB 256 kb 32 kb 33 s 5.4 s 1.7 s 1.4 s job of the database o a moder system (amog others) 25

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 123 pos VVFIN 26

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 123 pos VVFIN 1. load first row 26

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 123 pos VVFIN 2. 1. load first row 2. locate ame attribute 26

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 123 pos VVFIN 2. 3. 1. load first row 2. locate ame attribute 3. test ame attribute 26

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 123 pos VVFIN 2. 3. 123 lemma esse 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 26

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 123 pos VVFIN 2. 3. 123 lemma esse 5. 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 26

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 123 pos VVFIN 2. 3. 123 lemma esse 5. 6. 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 6. test ame attribute 26

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 7. 123 pos VVFIN 2. 3. 123 lemma esse 5. 6. 123 pos NN 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 6. test ame attribute 7. load third row 26

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 7. 123 pos VVFIN 2. 3. 123 lemma esse 5. 6. 123 pos NN 8. 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 6. test ame attribute 7. load third row 8. locate ame attribute 26

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 7. 123 pos VVFIN 2. 3. 123 lemma esse 5. 6. 123 pos NN 8. 9. 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 6. test ame attribute 7. load third row 8. locate ame attribute 9. test ame attribute 26

Cache usage of row layout query: compare ame attribute with value 'lemma' data file: 1 123 pos VVINF 2 123 lemma esse 3 456 pos NN L1 cache: 1. 4. 7. 123 pos VVFIN 2. 3. 123 lemma esse 5. 6. 123 pos NN 8. 9. 1. load first row 2. locate ame attribute 3. test ame attribute 4. load secod row 5. locate ame attribute 6. test ame attribute 7. load third row 8. locate ame attribute 9. test ame attribute 26

Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: pos lemma pos 27

Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: 1. pos lemma pos 1. load ame colum 27

Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: 1. pos lemma pos 2. 1. load ame colum 2. test first ame attribute 27

Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: 1. pos lemma pos 2. 3. 1. load ame colum 2. test first ame attribute 3. test secod ame attribute 27

Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: 1. pos lemma pos 2. 3. 4. 1. load ame colum 2. test first ame attribute 3. test secod ame attribute 4. test third ame attribute 27

Cache usage of colum layout query: compare ame attribute with value 'lemma' ode_ref ame value data file: 123 123 pos lemma VVINF esse 456 pos NN L1 cache: 1. pos lemma pos 2. 3. 4. 1. load ame colum 2. test first ame attribute 3. test secod ame attribute 4. test third ame attribute 27

Colum operatios i Ais Search terms ca be idexed "Wuder" Regular expressios ca ofte be idexed but ot always morph=/.*\.pl\.neut/ Biary operatios ca be idexed eed may idexes slow if there are may idex lookups _= i_ > id spa text_ref left right... 1 1 1 30 2 Wuder 1 1 5 2 Wuder 1 1 5 2 Wuder 1 1 5... 28

4. New implemetatio o MoetDB ad evaluatio 29

Prototype implemetatio Supported COUNT queries Ais 2 Query Laguage Not supported Ais 3 laguage features corpus selectio ANNOTATE, MATRIX queries 30

Realistic test workload Corpus: TIGER Treebak 2.1 Queries: 3 moth query log of Ais istace at the SFB 632 337 TIGER queries (224 uique) up to 4 search terms up to 6 biary operators Radom workload: 10000 queries origial distributio excluded PostgreSQL timeout 31

Workload of 10000 queries 6 MoetDB PostgreSQL 5 hours 47 miutes Hours 4 2 0 25 miutes 1 hour 37 miutes Server (48 GB RAM) 29 miutes Laptop (4 GB RAM) factor 20 280 MB Data files 396 MB MoetDB 7.7 GB PostgreSQL 32

Idividual query performace 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 33

Idividual query performace 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 34

Idividual query performace 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 35

Idividual query performace 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 36

Idividual query performace 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 37

Simple queries are fast 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 Query MoetDB PostgreSQL ode 6.6 ms 2926 ms "der" 5.1 226 /der.*/ 19 383 cat="s" 41 184 lemma="wasche" 43 14 38

Ifluece of result size 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 Query Results MoetDB PostgreSQL pos="vvimp" 162 43 ms 15 ms pos="vvpp" 17770 43 111 pos="vvfin" 35628 44 182 pos="adja" 54534 43 246 39

Queries with millios of results 60 MoetDB PostgreSQL DNC Secods (log) 10 1 0.1 0.01 Query Results Moet PSQL lemma="müsse" & pos= /VV.*/ & pos="$." & #1.* #2 & #2.* #3 4.5 M 2 s 35 s pos=/vm.*/ & pos= /VV.*/ & pos=/.*/ & #1.* #2 & #2.* #3 384 M 175 s > 1 h 40

Fast regular expressios regular expressio without a fixed prefix ca't use a idex, eed to sca the etire colum Query MoetDB PostgreSQL /.*sich.*/ 213 ms 4206 ms /[Kk]a.*/ 219 2812 pos="vvpp" & lemma=/(ge)?komme/ & #1 _=_ #2 229 383 pos=/n.*/ & /[12][09][0-9][0-9]/ & #1 _=_ #2 246 2902 lemma=/[^äöü]+/ & /.+[äöü].+/ & pos="nn" & #1 _=_ #2 & #2 _=_ #3 469 6246 41

Advatages MoetDB better overall performace stable query performace fast regular expressios ormalized schema greatly reduced disk cosumptio PostgreSQL queries with highly selective search term complete implemetatio bug-free SQL processig better use of limited resources 42

Summary prototypical implemetatio of Ais o MoetDB test sceario from a Ais istallatio i service i-depth performace compariso of Ais o MoetDB ad PostgreSQL SELECT viele FROM dak; 43