Te Extenon of Wegt Determnng Metod for Wegted Zone Scorng n nformaton Retreval Sergey Sauln *, Alexander Alfmtev 2,2 Bauman Mocow State Tecncal Unverty, 2 Baumanaya t., 5-, Mocow, 05005, Ruan Federaton * auln@bmtu.ru; 2 alfm@bmtu.ru Abtract-nformaton retreval baed on wegted zone corng mean te agnment wegt for eac zone or eac feld n te document metadata. All tee wegt are obtaned ung macne learnng metod. Te paper preent a metod of determnng te wegt ung te fuzzy Coquet ntegral. T allow tang nto poble account nterdependence between te zone parameter wen calculatng te relevance and allow to obtan ger corng accuracy. Keyword- nformaton Retreval; Aggregaton Operator; Fuzzy Meaure; Coquet Fuzzy ntegral. NTRODUCTON nformaton retreval te earc for document tat are relevant to te text query ung varou tecnque []. Wen worng wt large collecton of document te earc reult can be o bg tat te uer wll mply not be able to ee tem all. So one of te mportant ta of nformaton retreval to ran earc reult accordng to ter relevance to te query. f we ue te document metadata n t ranng, we need to tae nto account te expert nowledge about metadata tructure and t caractertc. Document metadata are te feld (uc a te date te document wa created, type of document, te boo cot, etc.) and te zone (ttle, autor, publer, abtract, eyword, te text etc.). Te dfference between te zone and feld le n te fact tat te feld may ave a lmted predefned et of value and te zone et of value not lmted. Furter, we conder te feld a a pecal cae of zone. Searc reult ranng metod wa decrbed n []. T metod baed on allocatng wegt g to eac zone. Te wegt are ettng ung macne learnng baed on tranng example. Denote te text query a q and te document a d. n wegted zone corng eac par (, d) q agned a value on te unt nterval by calculatng te lnear combnaton of eac zone core. Conder a et of document eac of wc a H zone. Let g [ 0, ], H, uc tat H g, [0, ], wle zone core conderng te degree of complance (or non) between te query and te -t zone of te document. T value can be calculated n dfferent way for eac of te zone []. Conder one of te mot common way to calculate t. For example, f all te query term contaned n te partcular zone, value equal to ; f only one term contaned n te zone, value equal to / r; f any term not contaned n te zone, value equal to zero, were r te number of term n te query. Oter way to compute t value nvolve ung te frequency wt wc te query term occur n te partcular zone a nput nformaton or may be baed on qualty ndcator of te document, age of te document, t lengt and o on. n partcular, tere a zone core calculatng metod baed on te band functon BM25F [2], wc tae nto account te query term occurrence frequency n te document zone. BM25F baed on functon BM25 [3] wc a lnear combnaton of tree man attrbute: te term frequency, te document frequency, and te lengt of te document. n t paper, te focu made not on ow to calculate te zone core but on te aggregaton of tee core nto a ngle core of document relevance to te query core ( q, d). T aggregaton wa performed by a lnear combnaton of zone core []: H core ( q, d ) g () Suppoe tat we ave a et of tranng example eac of wc a tuple contng of te query q, te document d, and ratng of relevance for te par ( q, d). Uually eac query q lned wt a et of document wc completely ordered by an expert accordng to ter relevance. n accordance to t order te ratng of te relevance can be agned by te expert wtn unt nterval. Ten te wegt g are determned by macne learnng ung avalable example o tat te reultng value of te wegt allow to approxmate te ratng of relevance of te tranng example. Gettng wegt coeffcent reduced to an optmzaton problem wt te objectve functon n te form of total error correpondng to tranng example. Tere are alo emprcal rule for wegt agnng to te document zone. For example, te autor of paper [4] beleve tat - 29 -
tey can aceve ger ranng accuracy by agnng te relatvely g wegt to te document ttle zone. Te autor of paper [5] made te aumpton tat te ranng accuracy of new document can be ncreaed by eparatng te frt entence to a eparate zone and agnng ncreaed wegt to t zone. Tee and oter mlar rule can be appled n macne learnng wtn te zone core aggregaton ung a wegted artmetc mean aggregaton operator [6]. Te approac decrbed above n all t varete aume an mplct aumpton of te mutual ndependence of te value. However, t can be own tat te value can be dependent of eac oter. For example, f te query term n te ttle of te new document mot lely to meet t term n te frt entence of te document. n t cae, we are dealng wt a potve correlaton between value and f we calculate te relevance core by a wegted artmetc mean () we obvouly get ome redundancy of reult. T penomenon of aggregated value potve correlaton and way of compenaton correpondng redundancy to te reult dcued n detal for example n [7]. A poble example of a more complex dependence wll be te next one. Suppoe an expert now te followng: query term occur bot n te body and n te abtract of everal document. Tee document are ordered by relevance ung te followng rule. Te document wt te ame term n te ead zone more relevant to te query tan a document wt te ame term n te document type zone. Suc dependence between te zone core nown a te preferred dependence of crtera [7]. T dependence cannot be expreed by any of te addtve operator ncludng te wegted artmetc mean operator. Suc nowledge cannot be formalzed by te form of rule for te zone wegt obtanng ung macne learnng wt wegted average aggregaton operator. Tu, we coaren te reult wen applyng te wegted average operator to compute te relevance of document to te query and aumng tat te value are alway ndependent of eac oter.. FUZZY MEASURES AND THE CHOQUET NTEGRAL f te aggregated crtera are nterdependent, ten Coquet ntegral wt repect to fuzzy meaure can be ued ntead of te wegted artmetc mean operator for formalzng tat dependence. Coquet ntegral a generalzaton of te wegted artmetc mean operator n cae of dependence between value (wc we call te crtera of aggregaton for followng etabled termnology [7, 8]). Fuzzy meaure expreed te ubjectve wegt or mportance of eac ubet of crtera and defned a follow [7]. J Fuzzy (dcrete) meaure a functon : 2 [0,], were 2 J te et of all ubet of te crtera ndex et J {,..., H}, wc atfe te followng condton: ) ( ) 0, ( J) ; 2) D, B J : D B ( D) ( B) j ntead of { }, {, j } repectvely. ntead of te crteron of te Furter, we wll omt te curly bracet wrtng, ndex J, we wll alo ue te crteron ntead of te crtera ndex et J, we wll ue te et of crtera J, bot done for brevty reaon. Frtly, we conder te bac concept ued n te fuzzy meaure teory. Sapley [9] propoed a defnton of te crteron mportance coeffcent baed on everal natural axom. n te context of te fuzzy meaure teory Sapley ndex for te crteron J wt repect to fuzzy meaure determned by te followng expreon: D( J ) J D! D! S( ) : D D J! Murofu and Soneda propoed an nteracton ndex between crtera [0]. T ndex ued to expre te gn and degree of nteracton between crtera and determned by te followng expreon: D( J {, j}) J D 2! D! (, j) : ( D j) ( D ) ( D j) ( D) J! Coquet ntegral ung for dependent crtera aggregaton wa condered n [7, 8]. Partcularly crtera preferred dependence modeled by Coquet ntegral dcued n [8]. n [] dcued n detal te applcaton of a new metod of macne learnng baed on te Coquet ntegral n dfferent applcaton area, and concluded te feablty of t ue. n te - 30 -
feld of nformaton retreval, Coquet ntegral can be ued for modelng expert preference formalzed by rule mlar to te rule decrbed n te prevou ecton.,..., Te Coquet ntegral of te crtera H wt repect to defned by H,..., H ) : ( ) [ ( A( ) ) ( A( ) )] CH (, were (*) ndcate a permutaton of J, uc tat. Alo A( ) {( ),..., ( H)} and ( H ) A [7].. FUZZY MEASURE DENTFCATON FOR WEGHTED ZONE SCORNG f we ue te wegted artmetc mean operator for wegted zone corng ten wegt g can be drectly et by te expert. But due to te great complexty of t ta n mot cae tee wegt are determned baed on macne learnng []. f we ue te Coquet ntegral for wegted zone corng t requred to obtan a fuzzy meaure ntead of wegt g. Drect agnment of fuzzy meaure by an expert even more dffcult ta tan wegt ettng due to exponentally ncreang complexty. For example, for te four crtera an expert wll ave to et 2 4 6 fuzzy meaure coeffcent. Suc ettng mpoble n practce. Terefore, te coeffcent of fuzzy meaure are obtaned ung macne learnng a t done for te wegted artmetc mean operator. For realzaton of uc macne learnng procedure t neceary to form a et of tranng example and a et of formal emprcal rule le toe decrbed above. Eac of te tranng example a trple ( d, q, r( q, d )) n wc te aement of relevance r ( q, d ) of te document d to te query q agned by an expert on te unt nterval or tee aement are raned by an expert. Te rule are te lmtaton bot on te fuzzy meaure and te Coquet ntegral a wea partal order on te et of zone core realzaton, reult of aggregaton (fnal relevance of te document), te Sapley ndce, and nteracton ndce of crtera. Metod ued to formalze tee rule were condered n detal n [7]. n partcular, f te rule tate tat te zone core are correlated ten t wll be formalzed by agnng a potve gn to te nteracton ndex of tee core. n practce, to enable te expert to create uc rule t common to ue 2 nd -order fuzzy meaure and, accordngly, 2 nd -order Coquet ntegral. Remanng relatvely mple t allow to model te nteracton between te crtera wc are decrbed by te rule mlar to toe mentoned above. Te paper [2] entrely devoted to te queton under wat condton uc a mplfcaton (ung of te 2-order Coquet ntegral) correct. T paper preent neceary condton tat ould atfy te expert preference n order tat tey can be formalzed ung te 2 nd -order Coquet ntegral. For eac tranng example, we ave te value tat are approprated for any area of te document. Relevance of te document d to te query q wll be determned a core q, d ) CH (,..., ) ( H. Becaue of te nature of avalable nformaton n te form of rule decrbed above we need to cooe a metod of dentfcaton of fuzzy meaure. Metod baed on mnmzaton of fuzzy meaure varance or maxmzaton of fuzzy meaure entropy te mot uted for olvng many practcal problem [3]. One of te advantage of t metod te lac of any trct requrement to nput nformaton, n contrat to oter metod of dentfcaton of fuzzy meaure. T metod baed on te prncple of maxmum entropy propoed n 957 by Jayne [4]. n relaton to te contructon of aggregaton operator tat prncple nvolve te ue of all avalable nformaton about te aggregaton crtera but te mot unbaed atttude to te nacceble nformaton. We wll follow t prncple n wegted zone corng of te document, tat, tang nto account te expert nowledge n te form of tranng example and rule we wll conder te mng nformaton wtout ba. Kojadnovc [3] extended te prncple of maxmum entropy on te utlty teory and developed fuzzy meaure dentfcaton metod baed on t. Te objectve functon of t metod defned a te varance of fuzzy meaure: 2 J G! G!. FMV ( ) : a( D ) J J GJ J! DG J Correpondng optmzaton problem tae te followng form. Mnmze F ( ) MV under te followng contrant: - 3 -
a( D ) 0, J, G J DG D ad ( ) DJ 0D CH( g) CH( g)... Here G J; te order of fuzzy meaure ; CH - ndfference treold tat et by an expert to compare te two reult of aggregaton; ad ( ) et functon of a et J, t called te Möbu functon and defned by te followng expreon and gven by GD D G a ( D) ( ) ( D). CH V. THE PROCEDURE FOR DETERMNNG THE WEGHTS FOR WEGHTED ZONE SCORNG f te aggregaton operator te Coquet ntegral wt repect to te fuzzy meaure, t procedure cont of te followng tep. Step. Form a et of zone for te document and a metod of zone core calculatng. Step 2. Generate tranng example ung a collecton of document, tee example beng relevance etmaton and(or) non-trct partal order on te et of te etmate,.e. mplement expert ranng of document relatve to te query. Create rule n te form of partal wea order on et of Coquet ntegral parameter. Step 3. Formalze obtaned on te prevou tep nformaton n te form of retrcton on te Coquet ntegral parameter n te form of nequalte wt ndfference treold. Set te ndfference treold from tranng example and cale tat ave been appled. Step 4. dentfy fuzzy meaure on te ba of obtaned n te prevou tep nformaton by te mnmzng dperon metod. Wen new avalable nformaton added to te et of tranng example and te et of rule te procedure repeated from tep 3. Te Coquet ntegral wt repect to te fuzzy meaure aggregatng operator for zone core troug wc te document are raned accordng to ter relevance to te query. V. EXPERMENT Durng te experment we do not attempt to create a complete earc engne. Te purpoe of expermental tudy wa to obtan an anwer to te queton about te practcal applcablty of fuzzy meaure and te Coquet ntegral n te feld of nformaton retreval. A et of tranng example ncluded 30 quere, 00 term, and 300document (publcaton n te feld of artfcal ntellgence). Te procedure dcued above wa put n practce to determne te fuzzy meaure for te wegted zone corng. Step. We condered fve zone of document: ttle ( ), abtract ( 2), eyword ( 3), man text ( 4), and reference ( 5). Tee zone correpond to te zone ndcator wc are calculated baed on te functon BM25F [2]. Step 2. ntal data for macne learnng compred bot et of tranng example and te followng emprcal rule mlar to toe dcued above. Set of tranng example J {,..., H}, were,..., 000 wa receved wt expert upport. A noted above, eac of tee example a trple: ( q, d, r( q, d )). Relevance of te document d to te query q wa evaluated on a cale wc te et S={0,, 2, 3, 4} n te ame manner a t wa done n [6]. n t et, 0 mean tat te document doe not fully matce te query (no relevance), 4 mean full complance (document relevant to te query), oter value correpond to ntermedate gradaton of relevance. - 32 -
Alo, we obtaned te followng emprcal rule n t tep wt expert upport: Rule. f te query term wa met n te ttle, t lely to meet te ame term bot n te abtract and n te man text. T rule mean tat te correpondng crtera are potvely correlated and ter nteracton ndce are le tan zero. Ten nteracton ndce of tee crtera are defned by te followng nequalte: (,3) 0; (,4) 0; (3,4) 0 (2) Rule 2. n order to ave te document relevant to te query t leat mportant tat te query term contaned n te lt of reference; more mportantly, tat te query term contaned n te man text; more mportantly to meet te term n te eyword; and fnally, mot mportantly to meet te query term n te ttle and (or) n te annotaton. T rule mean te followng. mportance of te crteron le tan te mportance of te crteron 5 4. Smlarly, mportance of te crteron 4 le tan mportance of te crteron. mportance of te crteron 3 te ame a mportance of te crteron 2 and more tan mportance of te crteron. T reaonng can be expreed by a partal 3 wea order J on te et J of document zone: 5 J J J J 4 2 ~ 3 (3) Rule 3. f te query term found n te man text and n te abtract, n order to get te document beng more relevant to te query t preferable tat te ame term contaned n te ttle rater tan t contaned n te eyword. T rule can be expreed by te followng preference relaton on te et S of avalable realzaton of crtera: S 2 3 S 2 Here, 2, 3 are te realzaton of crtera for tree document from te tranng et. Step 3. nequalte (2) are tranlated nto nequalte wt ndfference treold: (,3) 0; (,4) 0; (3,4) 0 Here - ndfference treold defned by an expert. T treold nterpreted a te mnmum gnfcantly non-zero abolute value of nteracton ndex. Partal wea order (3) tranlated nto nequalte wt Sapley ndexe of te crtera: S ( 4) S(5) S; S(2) S(4) S; S ( ) S(2) S; S(3) S(2) S; ( 3) () S S Here S te ndfference treold defned by an expert. Sapley ndce are gnfcantly dtngued f ter abolute dfference exceed ndfference treold S. Step 4. Tranng example and rule formed te retrcton mpoed on te Coquet ntegral and t parameter durng te dentfcaton proce of fuzzy meaure. Fuzzy meaure wa dentfed by te mnmum varance metod ung pecalzed pacage Kappalab [7] by te above decrbed optmzaton problem. An mportant queton tat aroe n te dentfcaton proce related to te need for expert agnment of ndfference treold. Tee value were coen on te ba of te document relevance cale: for te aggregaton reult ndfference treold wa taen to be =0,25. n addton, retrcton mpoed on te ndfference treold ave been met (tee treold can be et o tat te fuzzy meaure dentfcaton problem obvouly doe not ave a oluton), tu nequalty propoed n [5] contrant te mplementaton of wc allow to exclude uc a tuaton. Experment for evaluatng te accuracy of propoed metod were performed on a tattcally gnfcant ample of 500 earc quere contanng te term of tranng example n varou combnaton. S S c - 33 -
ntally we calculated te document relevance core q, d ) to tee earc quere ung te et of tranng example (, te emprcal rule -3, and te metod decrbed above. Ten we calculated te relevance core q, d ) on te ba of te macne learnng metod decrbed n [] and te et of tranng example. ( Fnally, t wa found tat te accuracy of earc reult ranng wen ung 2 nd order Coquet ntegral aggregaton a mproved by an average of 4.5% wen compared to te wegted average aggregaton operator. Te ranng accuracy condered te dfference between te relevance agned by an expert and document relevance prepared on te ba of wegted zone corng aggregaton wt one of two aggregaton operator condered n t paper. V. CONCLUSONS Te paper conder te practcal applcaton of te fuzzy meaure and te Coquet ntegral n te feld of nformaton retreval. Expermental reult ave own tat ncreang te accuracy of document relevance ranng can be aceved by ung te Coquet ntegral a an aggregaton operator for zone core. Te ncreae of accuracy of document relevance ranng about 4.5% compared to ung te wegted average operator. Furter, t aumed to nvetgate te applcaton of propoed metod for determnng te wegt on te varou collecton of document a well a to nvetgate te practcal applcablty of te Coquet ntegral and fuzzy meaure n oter ta of nformaton retreval uc a automatc error correcton, automatc abtractng, and annotatng of text. REFERENCES [] Mannng C., Ragavan P., and Scutze H., ntroducton to nformaton retreval, Cambrdge Unverty Pre, 2008, 544 p. [2] Roberton S., Zaragoza H., and Taylor M., Smple BM25 Extenon to Multple Wegted Feld, Proc. of ACM conference on nformaton Knowledge Management (CKM), pp. 42-49, Nov. 2004. [3] Roberton S. and Waler S., Some Smple Effectve Approxmaton to te 2 Poon Model for Probabltc Wegted Retreval, Proc. of te 7t Annual nternatonal ACM SGR Conference on Reearc and Development n nformaton Retreval, pp. 232-24, 994. [4] Coen W. and Snger Y., Context-entve learnng metod for text categorzaton, ACM tranacton on nformaton Sytem, vol. 7(2), pp. 4-73, 999. [5] Murata M., Ma Q., Ucmoto K., Oza H., Utama M., and aara H., Japanee probabltc nformaton retreval ung locaton and category nformaton, Proc. of te Fft nternatonal Worop on nformaton Retreval wt Aan Language, pp. 8-88, 2000. [6] Kryta S. and Burge C., A Macne Learnng Approac for mproved BM25 Retreval, Proc of te 8t ACM Conference on nformaton and Knowledge Management, pp. 8-84, 2009. [7] Grabc M., Kojadnovc., and Meyer P., A revew of metod for capacty dentfcaton n Coquet ntegral baed mult-attrbute utlty teory: Applcaton of te Kappalab R pacage, European journal of operatonal reearc, vol. 86(2), pp. 766-785, 2008. [8] Marcal J.-L., An axomatc approac to te dcrete Coquet ntegral a a tool to aggregate nteractng crtera, EEE Tranacton on Fuzzy Sytem, vol. 8(6), pp. 800-807, 2000. [9] Sapley L., A value for n-peron game, Kun H, Tucer A., Ed., Contrbuton to te Teory of Game, Prnceton: Prnceton Unverty Pre, pp. 307-37, 953. [0] Murofu T and Soneda S., Tecnque for readng fuzzy meaure (): nteracton ndex, 9t Fuzzy Sytem Sympoum, pp. 693-696, 993. [] Falla T. A., Ceng W., and Hüllermeer E., Preference Learnng ung te Coquet ntegral: Te Cae of Multpartte Ranng, EEE Tranacton on Fuzzy Sytem, pp. 5-28, 202. [2] Mayag B., Grabc M., and Labreuce C., A repreentaton of preference by te Coquet ntegral wt repect to a 2-addtve capacty, Teory and Decon, 7, pp. 297-324, 20. [3] Kojadnovc., Mnmum varance capacty dentfcaton, European Journal of Operatonal Reearc, vol. 77(), pp. 498-54, 2007. [4] Jayne E., nformaton teory and tattcal mecanc, Pycal Revew, 06, pp. 620-630, 957. [5] Alfmtev A., Sauln S., and Devyatov V., Web peronalzaton baed on fuzzy aggregaton and recognton of uer actvty, nternatonal Journal of Web Portal, vol. 4(), pp. 33-4, 202. Sergey Sauln graduated te Bauman Mocow State Tecncal Unverty n 200. He a P.D. MSTU n.a. N.E. Bauman n 2009. Today e atant profeor of nformaton ytem and telecommuncaton department. He a ten centfc paper. Scentfc nteret le n te feld of artfcal ntellgence metod and expert nowledge formalzaton and vualzaton. - 34 -
Alexander Alfmtev graduated te Bauman Mocow State Tecncal Unverty n 2005. He an aocated profeor at BMSTU, nformaton ytem and telecommuncaton department. He a ffty centfc paper, ncludng tree patent for nventon. Scentfc nteret le n te feld of ntellgent multmodal nterface, pattern recognton and computer von. - 35 -