A Simple Additive Re-weighting Strategy for Improving Margins

A Simple Additive Re-weigting Stategy fo Impoving Magins abio Aiolli and Alessando Spedti Depatment of Compte Science, Coso Italia 4, Pisa, Italy e-mail: aiolli, peso @dinipiit Abstact We pesent a sample e-weigting sceme inspied by ecent eslts in magin teoy e basic idea is to add to te taining set eplicas of samples wic ae not classified wit a sfficient magin We pove te convegence of te inpt distibtion obtained in tis way As stdy case, we conside an instance of te sceme involving a -NN classifie implementing a Vecto antization algoitm tat accommodates tangent distance models e tangent distance models ceated in tis way ave sown a significant impovement in genealization powe wit espect to te standad tangent models Moeove, te obtained models wee able to otpefom state of te at algoitms, sc as SVM Intodction In tis pape we intodce a simple additive e-weigting metod tat is able to impove te magin distibtion on te taining set Recent eslts in comptational leaning teoy [Vapnik, 99 Scapie et al, 99 Batlett, 99] ave tigtly linked te expected isk of a classifie ie te pobability of misclassification of a patten dawn fom an independent andom distibtion, wit te distibtion of te magins in te taining set In geneal, it eslts tat we can expect best pefomances on genealization minimal eo on test data wen most of te pattens ave ig magins e afoementioned eslts ae at te basis of te teoy of two of te most impessive algoitms: Sppot Vecto Macines and Boosting Eite SVM s and Boosting effectiveness is lagely de to te fact tat tey, diectly o not, effectively impove te magins on te taining set In paticla, SVM explicitly finds te ype-plane wit te lagest minimm magin in a dimensional-agmented space wee taining points ae mapped by a kenel fnction In tis case, magin teoy pemits to explain impessive pefomances even in vey ig dimensional spaces wee data ae spposed to be moe sepaated Most of te ecent effots in SVMs ae in te coice of te igt kenels fo paticla applications o example, in OCR poblems, te polynomial kenel was poven to be vey effective On te ote side, boosting algoitms, and in paticla te most famos vesion AdaBoost, podce weigted ensemble of ypoteses, eac one tained in sc a way to minimize te empiical eo in a given difficlt distibtion of te taining set Again, it as been sown [Scapie, 999] tat boosting essentially is a pocede fo finding a linea combination of weak ypoteses wic minimizes a paticla loss fnction dependent on te magins on te taining set, liteally Recently, eseac effots elated to boosting algoitms faced te diect optimization of te magins on te taining set o example, tis as been done by defining diffeent magin-based cost fnctions and seacing fo combinations of weak ypoteses so to minimize tese fnctions [Mason et al, 99] We will follow a elated appoac tat aims to find a single eventally non linea optimal ypotesis wee te optimality is defined in tems of a loss-fnction dependent on te distibtion of te magins on te taining set In ode to minimize tis loss we popose a e-weigting algoitm tat maintains a set of weigts associated wit te pattens in te taining set e weigt associated to a patten is iteatively pdated wen te magin of te cent ypotesis does not eac a pedefined tesold on it In tis way a new distibtion on te taining data will be indced temoe, a new ypotesis is ten compted tat impoves te expectation of te magin on te new distibtion In te following we pove tat te distibtion conveges to a nifom distibtion on a sbset of te taining set We apply te above sceme to an OCR patten ecognition poblem, wee te classification is based on a -NN tangent distance classifie [Simad et al, 993], obtaining a significant impovement in genealization Basically, te algoitm bilds a set of models fo eac class by an extended vesion of te Leaning Vecto antization pocede LV [Koonen et al, 99] adapted to tangent distance In te following we will efe to tis new algoitm as angent Vecto antization V e pape is oganized as follows In Section 2, we intodce te concept of magin eglaization via te inpt distibtion on te taining set Specifically, we pesent te - Magin Re-weigting Stategy, wic olds te popety to gaantee te convegence of te inpt distibtion In Section 3, we intodce a definition fo te magins in a -NN sceme tat consides te disciminative atio obseved fo a paticla patten, and in Section 4 we define te V algoitm inally, in Section we pesent empiical eslts

= compaing V wit ote -NN based algoitms, inclding SVM 2 Reglaization of te magins Wen leaning takes place, te examples tend to inflence in a diffeent way te disciminant fnction of a classifie A disciminant fnction can be viewed as a esoce tat as to be saed among diffeent clients te examples Often, wen pe Empiical Risk Minimization ERM pinciple is applied, tat esoce is sed in a wong way since, wit ig pobability, it is almost entiely sed by a faction of te taining set Magin teoy fomally tells s tat it is pefeable to eglaize te disciminant fnction in sc a way to make te examples saing moe eally its sppot Inspied on te basic ideas of magin optimization, ee, we popose a simple geneal pocede applicable, eventally, to any ERM-based algoitm It pemits to eglaize te paametes of a disciminant fnction so to obtain ypoteses wit lage magins fo many examples in te taining set Witot geneality loss we conside te magin fo a taining example as a eal nmbe, taking vales in, epesenting a mease of te confidence sown by a classifie in te pediction of te coect label In a binay classifie, eg te pecepton, te magin is sally defined as wee is te taget and is te otpt compted by te classifie Anyway, it can be easily e-condced to te ange by a monotonic linea o sigmoidal tansfomation of te otpt In any case, a positive vale of te magin mst coespond to a coect classification of te example Given te fnction tat, povided an ypotesis, associates to eac patten its magin, we want to define a lossfnction tat, wen minimized, pemits to obtain ypoteses wit lage magins geate tan a fixed tesold fo many examples in te taining set o tis, we popose to minimize a fnction tat, basically, is a e-fomlation of SVM s slack vaiables:! "$ ' wee is a taining set wit! "$ -, examples, and +' if 2, and otewise e fnction is nll fo magins ige tan te tesold and is linea wit espect to te vales of te magins wen tey ae below tis tesold We sggest to minimize indiectly via a two-step iteative metod tat simltaneosly seaces fo an a pioi distibtion fo te examples tat, given te cent ypotesis 3! "$, bette appoximates te fnction +' and 2 seaces fo a ypotesis eg by a4gadient based pocede tat, povided te distibtion! weigted fnction!, impoves te is new fomlation is eivalent to tat given in e povided tat conveges to te nifom distibtion on te -mistakes pattens tat ave magin less tan te tesold 2 -Magin Re-weigting Stategy Inpt: : nmbe of iteations : ypoteses space > : @? magin BA CDEG tesold : bonded fnction HJIJIJH : taining set A = Initialize LK initial ypotesis fo M N O P C R, SC R : fo ÜV W @ begin find DY sc tat P end 9Z: [ \ ^] R P a >! "$ \ +' b c ed fsg ` c f etn D [ \_3` ige : e -Magin Re-weigting Stategy e algoitm, sown in ige, consists of a seies of tials An optimization pocess, tat explicitly maximizes te fnction accoding to te cent distibtion fo te examples, woks on an atificial taining set ji, initialized to be eal to te oiginal taining set o eac, > eplicas of tose pattens in tat ave magin below te fixed tesold ae added to bi agmenting tei density in Ri and conseently tei contibtion in te optimization pocess Note tat P denotes te nmbe of occences in te extended taining set Ri of te patten In te following, we will pove tat te simple iteative pocede jst descibed makes te distibtion appoacing a nifom distibtion on te -mistakes, povided tat > is bonded 2 Convegence of te distibtion o eac tial, given te magin of eac example in te taining nm set, we can patition te taining sample as lk po wee k po lk pattens Let denote s Ri and let P occences of patten in Ri at time, wit density P t Moeove, let sc tat v Let P w P > of CLxEy occences in Ri > >, wee in note tat ^z it is independent fom M It s easy to veify tat R{ eac becase of te monotonicity of, and tat is te set of -mistakes and is te complementay set of -coect be te nmbe of be a sitable fnction of be te pdate le fo te nmbe is bonded and takes vales may cange at diffeent iteations bt fo

C Y 9 ' : 2 wit te nmbe of iteations In fact At time v P we ave > V! "3 ' j E {! "3 +' ist of all we sow tat te distibtion conveges is can be sown by demonstating tat te A canges tend { toc zeo wit te nmbe of iteations, ie, We ave V! "3 ' a wic can be easily bonded in modle by a antity tat tends to : a { C We now sow to wic vales tey convege Let and be, espectively, te cmlative nmbe and te mean atio of -mistakes fo on te fist epocs,, ten P C > C > j > { > Given te convegence of te optimization pocess tat maximizes in e 2, te two sets k and o ae going to become stable and te distibtion on will tend to a nifom distibtion in k wee { and will be nll elsewee wee { is can be ndestood in te following way as well Given te definition of te canges made on te gamma vales on eac iteation of te algoitm, we calclate te fnction tat we indeed minimize Since P >?! "3 as: ', afte some algeba, we can ewite! " v ' lk s, vc wen : V " ', fo wic te minimm of fnction lk! " ' is eaced Note tat, te minimm of is consistent wit te constaint N In geneal, te enegy fnction is modlated by a tem deceasing wit te nmbe of iteations, dependent on te > sed bt independent fom gamma, tat can be viewed as a sot of annealing intodced in te pocess In te following, we stdy a specific instance of te - Magin Re-weigting Stategy 3 Magins in a -NN famewok and tangent distance A A"! Given a taining example, and a fixed nmbe of models fo eac class, below, we give a definition of te magin fo te example wen classified by a distance based -NN classifie Given te example, let and $ be te saed distances between te neaest of te positive set of models and te neaest of te negative sets of models, espectively We can define te magin of a patten in te taining set as: $ 3 $ V! is fomla takes vales in te inteval epesenting te confidence in te pediction of te -NN classifie Hige vales of te s can also be viewed as an indication of a ige disciminative powe of te set of models wit espect to te patten Moeove, a patten will eslt coectly classified in te -NN sceme if and only if its magin is geate tan zeo In tis pape, we ae paticlaly inteested in distances tat ae invaiant to given tansfomations Specifically, we efe to te one-sided tangent distance [Simad, 994 Hastie et al, 99 Scwenk and Milgam, 99b 99a], wic comptes te distance between a patten : and a patten as te minimm distance between : and te lin- ea sbspace ' appoximating te manifold indced by accoding to a given set of tans- tansfoming te patten fomations: : @,+- 32 : If te tansfomations ae not known a pioi, we can lean tem by defining, fo eac class, a one-sided tangent distance model, componded by a centoid ie, a pototype vecto fo class and a set of tangent vectos ie, an otonomal base of te linea sbspace => :, tat can be witten as : is model can be detemined by jst sing te positive, as examples of class, ie "?@BAC+D- k! ^A +-E G 2 ' 4-2 wic can easily be solved by esoting to pincipal component analysis PCA teoy, also called Kanen-Loéve Expansion In fact, eation can be minimized by coosing : as te aveage ove all available positive samples, and 9 as te set of te most epesentative eigenvectos pincipal components of te covaiance matix H! Y e coesponding poblem fo te two-sided tangent distance can be solved by an iteative algoitm, called twosided HSS, based on Singla Vale Decomposition, poposed by Hastie et al [Hastie et al, 99] Wen te onesided vesion of tangent distance is sed, HSS and PCA coincide So, in te following, te one sided vesion of tis algoitm will be simply efeed as to HSS 4

C $ Given a one-sided tangent distance model, it is ite easy to veify tat te saed tangent distance between a patten and te model can be witten as: Y 9Z: N wee :, Y and Y denotes te tanspose of Conseently, in o definition of magin, we ave M, and $ M Given tis definition of magin, we can implement te coice of te new ypotesis in te -Magin Re-weigting Stategy by maximizing te magin sing gadient ascent on te cent inpt distibtion 3 Impoving magins as a diven gadient ascent Consideing te tangent distance fomlation as given in eation we can veify tat it is defined by scala podcts s, we can deivate it wit espect to te centoid 9 and te tangent vectos of te neaest positive model obtaining: Consideing tat " " and " o o we can compte te deivative of te magin wit espect to canges in te neaest positive model: $ $ $ $ A simila soltion is obtained fo te neaest negative model since it only diffes in canging te sign and in excanging indexes and Moeove, te deivatives ae nll fo all te ote models s, we can easily maximize te aveage magin in te taining set if fo eac patten pesented to te classifie we move te neaest models in te diection sggested by te gadient Note tat, like in te LV algoitm, fo eac taining example, only te neaest models ae canged Wen maximizing te expected magin on te cent distibtion, ie,, fo eac model we ave: wee is te sal leaning ate paamete In te algoitm see ige 2, fo bevity, we will gop te above vaiations " by efeing to te wole model, ie, k " V Algoitm Inpt: : no of iteations : no of models pe class : magin tesold Initialize Z, :, initialize models fo ÜV W @ BA 9 wit andom, select st and ae te neaest, positive and negative, models Compte as in e 3 and accmlate te canges on te neaest models 4 4 tangents ^A Nomalize End, pdate te distibtion! " ' s sc tat Ü, 4 ige 2: e V Algoitm 4 e V algoitm and otonomalize its by te le e algoitm see ige 2 stats wit andom models and a nifom distibtion on te taining set o eac patten, te vaiation on te closest positive and te closest negative models ae compted accodingly to te density of tat patten on te taining set Ri Wen all te pattens in ae pocessed, te models ae pdated pefoming a weigted gadient ascent on te vales of te magin Moeove, fo eac patten in te taining set sc tat te vale of te magin is smalle tan a fixed vale, te distibtion is agmented e effect is to foce te gadient ascent to concentate on adest examples in te taining set As we saw in Section 2 te incement > to te distibtion is simply te effect of adding a eplica Ü of incoectly classified pattens to te agmented taining set e initialization of te algoitm may be done in diffeent ways e defalt coice is to se andom geneated models oweve, wen te taining set size is not poibitive, we can dastically speed p te algoitm by taking as initial models te ones geneated by any algoitm eg, HSS How-

C Metod Paametes E HSS -sided tangents 3 LV 2 codebooks 32 D-Neon tangents 3 HSS 2-sided 9 tangents 34 Eclidean -NN pototypes 3 SVM Linea 4 SVM Poly d=2 22 SVM Poly d=3 323 SVM Poly d=4 42 able : est eslts fo diffeent -NN metods eve, in te case of mltiple models pe class te initialization tog te HSS metod wold geneate identical models fo eac class and tat wold invalidate te pocede A possible coice in tis case, is to geneate HSS models by sing diffeent andom conditional distibtions fo diffeent models associated to te same class Anote soltion, wic is sefl wen te size of te taining set is elatively lage, is to initialize te centoids as te aveage of te positive instances and ten geneating andom tangents Expeimental eslts ave sown tat te diffeences on te pefomance obtained by sing diffeent initialization citeia ae negligible As we cold expect te speed of convegence wit diffeent initialization metods may be dastically diffeent is is de to te fact tat wen V is initialized wit HSS models it stats wit a good appoximation of te optimal ypotesis see ige 3-, wile andom initializations implicitly intodce an initial poo estimate of te final distibtion de to te mistakes tat most of te examples do on te fist few iteations Reslts We compaed te V algoitm vess SVMs and ote -NN based algoitms: -sided HSS, 2-sided HSS, D- Neon [Sona et al, 2], and LV e compaison was pefomed sing exactly te same split of a dataset consisting of digits andomly taken fom te NIS-3 dataset e binay 2x2 digits wee tansfomed into 4-gey level x images by a simple local conting pocede e only pepocessing pefomed was te elimination of empty bodes e taining set consisted of andomly cosen digits, wile te emaining digits wee sed in te test set e obtained eslts fo te test data ae smmaized in able o eac algoitm, we epoted te best eslt, witot ejection, obtained fo te dataset Specifically, fo te SVM taining we sed te SVM Y package available on te intenet 2 Diffeent kenels wee consideed fo te SVMs: linea and polynomial wit degees 2,3 and 4 we sed te defalt fo te ote paametes Since SVMs ae binay classifies, we bilt SVMs, one fo eac class against all te otes, and we consideed te oveall pediction as te label wit ige magin e best pefomance as been obtained e nmbe of pixel wit vale eal to is sed as te gey vale fo te coesponding pixel in te new image 2 ttp:www-aicsni-dotmnddesowaresvm LIGH vc 9 9 = 9 vc 9 C 9 = = = 24 22 3 222 4 3 2 22 243 2 able 2: est eslts fo V = wit a polynomial kenel of degee 2 We an te V algoitm wit two diffeent vales fo and fo diffeent acitectes Moeove, we an also 9 an expeiment jst sing a single centoid fo class ie, wit C e smalle vale fo as been cosen jst to accont fo te fa smalle complexity of te model In almost all te expeiments te V algoitm obtained te best pefomance Reslts on te test data ae epoted in able 2 Specifically, te best eslt fo SVM is wost tan almost all te eslts obtained wit V Paticlaly inteesting is te eslt obtained by jst sing a single centoid fo eac class is coesponds to pefom an LV wit jst codebooks, one fo eac class In addition, V etns fa moe compact models allowing a edced esponse time in classification In fact, te - NN sing polynomial SVMs wit, needs 223 sppot vectos, wile in te wost case te models etned by te V involve a total of 4 vectos one centoid pls tangents fo eac model In ige 3, typical eo cves fo te taining and test eos 3-, as well as te magin distibtions on te taining set 3- and te indced magin distibtion on te test set 3- ae epoted om tese plots it is easy to see tat te V doesn t sow ovefitting is was also confimed by te expeiments involving te models wit ige complexity and smalle vales of Moeove, te impact of te -magin on te final magin distibtion on te taining set is clealy sown in 3-, wee a steep incease of te distibtion is obseved in coespondence of at te expenses of ige vales of magin Even if at a mino extent, a simila impact on te magin distibtion is obseved fo te test data In ige 4 we ave epoted te ejection cves fo te diffeent algoitms As expected, te V algoitm was competitive wit te best SVM, eslting to be te best algoitm fo almost te wole eo ange Conclsions We poposed a povably convegent e-weigting sceme fo impoving magins, wic focses on difficlt examples On te basis of tis geneal appoac, we defined a Vecto antization algoitm based on tangent distance, wic expeimentally otpefomed state of te at classifies bot in genealization and model compactness ese eslts confim tat te contol of te sape of te magin distibtion as a geat effect on te genealization pefomance Wen compaing te poposed appoac wit SVM, we may obseve tat, wile o appoac saes wit SVM te Statistical Leaning eoy concept of nifom convegence of te empiical isk to te ideal isk, it exploits te inpt distibtion to diectly wok on non-linea models instead of esoting to pedefined kenels is way to poceed is

4 2 HSS initialization Random Initialization Centoid Initialization 2 4 4 3 2 V est-eo V ain-eo 2 C 9 4 3 2 HSS model It V It V It V -2 2 4 9 4 3 2 HSS model It V It V It V -2 2 4 ige 3: V wit tangents, and : compaison wit diffeent initialization metods test and taining eo cmlative magins on te taining set at diffeent iteations cmlative magins on te test set at diffeent iteations ejection 4 4 3 3 2 2 D_Neon wit tangents -sided HSS wit tangents Eclidean -NN 2-sided HSS wit 9 tangents V - iteations SVM - Polynomial d=2 2 4 2 4 eo ige 4: Detail of ejection cves fo te diffeent -NN metods e ejection citeion is te diffeence between te distances of te inpt patten wit espect to te fist and te second neaest models vey simila to te appoac adopted by Boosting algoitms Howeve, in Boosting algoitms, seveal ypoteses ae geneated and combined, wile in o appoac te focs is on a single ypotesis is jstifies te adoption of an additive e-weigting sceme, instead of a mltiplicative sceme wic is moe appopiate fo committee macines Acknowledgments abio Aiolli wises to tank Cento MEA - Consozio Pisa Ricece fo sppoting is PD fellowsip Refeences [Batlett, 99] PL Batlett e sample complexity of patten classification wit neal netwoks: te size of te weigts is moe impotant tan te size of te netwok IEEE ans on Info eoy, 442:2 3, 99 [Hastie et al, 99] Hastie, P Y Simad, and E Säckinge Leaning pototype models fo tangent distance In G easao, D S oetzky, and K Leen, editos, Advances in Ne Infom Poc Systems, volme, pages 999 MI Pess, 99 [Koonen et al, 99] Koonen, J Hynninen, J Kangas, J Laaksonen, and K okkola Lv pak: e leaning vecto antization pogam package ecnical Repot A3, Helsinki Univ of ec, Lab of Compte and Infom Sci, Janay 99 [Mason et al, 99] L Mason, P Batlett, and J Baxte Impoved genealization tog explicit optimization of magins ecnical epot, Dept of Sys Eng, Astalian National Univesity, 99 [Scapie et al, 99] RE Scapie, Y end, P Batlett, and WS Lee Boosting te magin: A new explanation fo te effectiveness of voting metods An of Stat, 2, 99 [Scapie, 999] R Scapie eoetical views of boosting In Comptational Leaning eoy: Poc of te 4t Eopean Confeence, EoCOL 99, 999 [Scwenk and Milgam, 99a] H Scwenk and M Milgam Leaning disciminant tangent models fo andwitten caacte ecognition In Inten Conf on Atif Ne Netw, pages 9 9 Spinge-Velag, 99 [Scwenk and Milgam, 99b] H Scwenk and M Milgam ansfomation invaiant atoassociation wit application to andwitten caacte ecognition In G easao, D S oetzky, and K Leen, editos, Advances in Ne Infom Poc Systems, volme, pages 99 99 MI Pess, 99 [Simad et al, 993] P Y Simad, Y LeCn, and J Denke Efficient patten ecognition sing a new tansfomation distance In S J Hanson, J D Cowan, and C L Giles, editos, Advances in Neal Infomation Pocessing Systems, volme, pages Mogan Kafmann, 993 [Simad, 994] P Y Simad Efficient comptation of complex distance metics sing ieacical filteing In J D Cowan, G easao, and J Alspecto, editos, Advances in Ne Infom Poc Systems, volme, pages Mogan Kafmann, 994 [Sona et al, 2] D Sona, A Spedti, and A Staita Disciminant patten ecognition sing tansfomation invaiant neons Ne Compt, 2:3 3, 2 [Vapnik, 99] V Vapnik Statistical Leaning eoy Wiley, 99