Using tree-grammars for training set expansion in page classification

Using tree-grammars for training set expansion in page classification Stefano Baldi Simone Marinai Gioanni Soda DSI - Uniersity of Florence - Italy Email: marinai@dsi.unifi.it Abstract In tis paper we describe a metod for te expansion of training sets made by XY trees representing page layout. Tis approac is appropriate wen dealing wit page classification based on MXY tree page representations. Te basic idea is te use of tree grammars to model te ariations in te tree wic are caused by segmentation algoritms. A set of general grammatical rules are defined and used to expand te training set. Pages are classified wit a k nn approac were te distance between pages is computed by means of tree-edit distance.. Introduction Document image classification as a large number of applications suc as document organization, retrieal, routing, and understanding. An efficient initial classification can be acieed by representing te layout wit XY-trees [9], and teir extension dealing wit ruling lines (MXY-tree [2]). In XY-trees te root brings information about te entire page and eac cild contains a portion of te image related to its fater. Eery portion is recursiely obtained by XY-cuts. An XY-cut is a orizontal or ertical cut following blank spaces (or tin lines in MXY-tree), wic extends from side to side of te image. MXY trees ae been recently used for page classification by using a ectorial representation of te tree tat is classified by means of artificial neural networks [3]. MXYtree descriptions of document images ae been used as well for te retrieal of releant pages in te image domain [4]. Unfortunately, te segmentation algoritms wic build te XY-tree, do not produce similar trees starting from similar pages (e.g. Fig. ). In some cases te related trees are ery different eac oter, and tis will gie rise to unexpected difficulties wen trying to compare tese trees. Wen working in te domain of trainable classifiers a basic assumption is te aailability of a large enoug training set so as to be able to generalize differences introduced by segmentation algoritms and obtain a correct classification. For practical uses it is frequently too expensie to produce large and-labeled training sets. Moreoer, in some application domains large training sets are simply unaailable regardless of te effort required for groundtruting. In tis paper we propose a solution to tis problem tat is based on te introduction of a set of tree-grammar rules wic are used by an expansion algoritm to enlarge an initial set of XY-trees. A larger training set is obtained wic contains bot te initial samples (natural samples) andte artificial ones. Tis set will be te new learning set. Tis approac is someow similar to te distortion of grap models proposed in [8] in order to model real world distortions in attributed grap. Training set expansion as been recently applied also in te domain of andwritten caracter recognition []. Te paper is organized as follows: in Section 2 we describe te proposed metod for training set expansion, tat in turn contains a description of te proposed tree grammar (Sec. 2.) and a discussion of te expansion algoritm (Sec. 2.2). Te use of tree-edit distance for page classification is analyzed in Section 3, and te related classes are analyzed in Section 3.. Te experimental results are reported in Section 4, wereas a final discussion is drawn in Section 5. 2. Training set expansion Page classification is based on two main steps: an offline training set expansion and an on-line page classification. In te off-line step new trees are added to te training set by modifying te labeled ones to simulate actual distortions occurring in real segmentations. Te distortions are modeled wit an appropriate tree grammar. In te second step an unknown tree is classified by comparing it wit te trees in te expanded training set. To purpose, in tis paper we use a k nn classification approac were te distance wit documents in te training set is computed wit te treeedit distance. Proceedings of te Seent International Conference on Document Analysis and Recognition (ICDAR 2003)

Figure. MXY trees obtained from similar pages wit different spacings between regions. 2.. MXY tree grammar Tree grammars [7] are similar to string grammars except tat te basic objects are trees instead of strings. More precisely a tree grammar G =(S, N, T, P) is defined by a starting symbol S (S N), a set N of nonterminal symbols, a set T of terminal symbols, a set P of production rules of te form α β were α contains at least one nonterminal. In te following we will refer to α as left and side (LHS) memberandtoβ as rigt and side (RHS) one. Generally speaking te LHS member detects wat objects te rule as to be applied to, and RHS describes ow to build te related output. It is well known tat eac labeled tree can be represented as a string by using a pre-fix notation were te label of a node precedes, in te related string, te list of te sub-strings wic represent te sub-trees originated by te node s cildren (see Fig. 2). I T T [I,[T,T]] Figure 2. Example of prefix notation. Tis property becomes ery useful wen defining rules wit a pre-fix notation. Similarly to string grammars we also allow te use of wildcards as star-mark (*) or plusmark (+) wit te meaning of zero or more repetitions of and one or more repetitions of a tree, respectiely. In tis way we describe te structures of te trees we want to detect or build. In addition, we are interested to work wit labeled trees, were a label describes a region, like te type of XY-cut for internal nodes (orizontal or ertical cut along spaces or lines) and te region content for leaes (image, text block, or ruling line). Te labels can contain additional information suc as te block size and te number of cildren. Seeral packages exist for te definition of tree grammar and subsequent language generation (e.g. [5, 6]), oweer te language generation algoritms are not appropriate for our problem. Instead we deeloped our own system in order to integrate te grammar definition wit te expansion algoritm described in Section 2.2. Our main task in designing te expansion program is to allow an user to describe te rules in te most flexible way. Seeral kinds of tree grammars can be described as regular or expansie ones [7], and te user can define (in a specific JAVA compiled file) logical predicates or alteration functions to be used into rules. A logical predicate (Table ) works in te LHS member of a rule, it cecks a local property of one tree (related to arrangement, number and label of nodes) and returns a boolean alue. Te LHS-conditions are satisfied only wen all te contained predicates are satisfied. One example of predicate is te exclamation-mark (!) tat works as a negation. Let us suppose it precedes a label identifying an image in te LHS member; in tis case te rule will be applied to all te trees tat do not contain an image in te related node. On te contrary, an alteration function (Table ) works in te RHS member of a rule, and allows to modify te tree structure. Te user can cange te type of a set of nodes (blocks of text wit images for example), or teir disposition. We use te following set of labels for describing te meaning of nodes. (s): cut along a orizontal (ertical) space. l (l): cut along a orizontal (ertical) line. T : text-block (leaf). I : image-block (leaf). L (L): ori- 2 Proceedings of te Seent International Conference on Document Analysis and Recognition (ICDAR 2003)

Predicate Ldw alue Lup alue Function b+(trees) b-(trees) +(trees) /(tree) rnd(tree) Meaning Node leel lower tan alue Node leel greater tan alue Meaning Te sub-trees listed in trees are added as rigt-broters to te current node. Te sub-trees listed in trees are added as left-broters to te current node. Te sub-trees listed in trees are added as rigt-cildren to te current node. Te tree starting in te current node is substituted wit te tree described in tree. Te sub-tree described in tree is added to te cildren s list of te current node in a random point. Table. Some predicates and alteration functions. zontal (ertical) line (leaf). x : identifies any kind of block. Rules are defined wit a pre-fix notation. For instance, one rule containing l[i,[t,t]] in te LHS will be applied to trees like tat in Figure 2. By using wildcards it is possible to define more general rules. For instance, te expression l[i,[t*]] still detects te tree described in Figure 2 as well as trees wit a different number of leaes in te rigt branc. Let us now describe some typical examples of rules. Rule () adds an image-leaf in te rigt branc. l[i,[t,t]] l[i,[t,i,t]] () Rule (2) sows ow predicates and alteration functions work. Ldw0 is a logical predicate wic is true wen te related node belongs to te zero leel of te tree (i.e. it is te root).! is a logical predicate wic is a negation. rcb is an alteration function wic canges te order of te cildren of te corresponding node (recombination). Te effect of tis rule (2) is to modify a tree witout an image in te first branc and wit two text-block leaes on te oter one, into a tree wit an image in te first block and te two text-blocks inerted eac oter. l(ldw0)[!i,[t,t]] l[i,(rcb)[t,t]] (2) 2.2. Language generation After defining an appropriate set of rules, it is possible to expand te training set. In oter words we can generate te language defined by te grammar. To aoid an excessie distortion in generated trees, we assign a tresold to eac tree and a cost to te deformation made by eac rule. A gien rule matcing te LHS wit one tree will be applied wen te application cost (Cost)isbelowtecurrent Tresold. Tis approac is described in Algoritm, were t is te tree to expand, r is te expansion rule, and (T ) is te result of te possible application of r to t. ApplicationCostOf cecks te LHS-conditions and computes te application cost of r to t (if LHS-conditions do not old it returns an oerflow alue) by considering te cost assigned to te rule. TrOf associates to eac tree te related tresold, and Apply is a function tat returns a modified tree. Algoritm Expand(t, r; T ) Cost = ApplicationCostOf(t, r) if Cost < TrOf(t) ten T Apply(t, r)returntrue else return false Wen expanding a training set we start wit an initial working set (T Set) containingonlyand-labeledtrees (te Natural Set), ten we try to apply to eac tree t T Set eac rule r in te rule set, by te Expand algoritm. Te modified tree T is added to te set T Set and te Tresold is updated (see Algoritm 2). Algoritm 2 ExpandTreeSet T Set Natural Set foreac t T Set do : foreac r Rules Set do : if (Expand(t, r; T ))ten T Set T Set T TrOf(T ) TrOf(t) - ApplicationCostOf(t, r) Note tat an expanded tree can be expanded again (as a natural one). Howeer, its tresold will be lower tan te original one, so it will probably generate a lower number of expansions. Tis process is repeated until no more rule is applicable or it is too muc expensie regarding te remaining tresold. Te sceme in Figure 3 sows an example of ierarcy of te language terms, it looks like a tree in te root of wic tere is one natural tree. We will call tis ierarcy a dictionary. Eery tree of a dictionary will be assigned to te class of te natural tree in te root. During language generation a dictionary is build for eac natural tree, and te corresponding trees are added to te expanded training set. 3. Tree classification MXY tree representations for page layout classification ae been used in [3] in conjunction wit a ectorial representation of trees and MLP-based classifiers. In tis paper we ceck te effectieness of te proposed expansion metod wit a classification approac tat is more appropriate for incremental learning. Basically, we compare wit a tree-edit distance te unknown tree wit trees in te training set and te class is found wit a k nn mecanism. Te classification cost of tis approac is quite expensie, oweer te principal aim is to demonstrate te adantages 3 Proceedings of te Seent International Conference on Document Analysis and Recognition (ICDAR 2003)

R: R2: s 20 R # s s Cost = 0 # Cost = 5 R2 30 s 25 Figure 3. Example of ierarcy of generated terms. tat can be obtained wen expanding te training set (see Section 4). Similarly to string edit distance, te tree edit distance is a metod for ealuating te distance between labeled trees by counting te number of edit operations (wit an associated cost) needed to transform one tree into anoter. We can define te distance between two trees as te cost of te minimum-cost set of operations tat are required to transform te first tree into te second one. Zang [0] proposed an efficient algoritm to compute te tree edit distance. Using tree edit distance we can build a K nn classifier were eac tree is ascribed in te most common class among te K trees of te training set tat are nearest (in te sense of te Zang s distance) to te unknown tree. Using suc a classifier, we can classify a tree using an expanded training set or a not-expanded (natural) one and ealuate te differences among te classifications obtained. If te rules work properly te classifications on te expanded training set will be better classifications on naturalset. First of all we ae to identify some useful expansion rules. Remaining in te most general conditions, we ll look for a different rules-set for eac class, since different classes ae different peculiarities to be considered. 3.. Classes In a classification problem te training set contains a different number of samples for eac class. Some class is igly populated and some oter is not. Tis is re- R R 5 5 Class name Image ImageText2 Issue2 SecE2 SecM2 Text2 Text2Image Description of pages in te class An Image wit or witout caption An Image on two columns text Start of an issue End-of-section page Section mark page Text on two columns (no images) Image and two text-columns Table 2. Main features of classes. lated to problems of practical nature. For instance, in a book tere are less pages containing titles or illustrations tan full-text pages. We are obiously interested in te expansions of low-populated classes instead of te iglypopulated ones. In our experiments we consider seen different classes sortly described in Table 2. Text2 and ImageText2 are te most populated classes, we did not write expansion-rules for tese ones. For te oter classes we designed apriorisome appropriate rules following common sense. For instance, in classes wit images one appropriate rule allows us to add a text block (caption) below an image wen it does not appear (see Figure 4). In classes wit two-column text it is useful to proide rules splitting one text-block in two or more blocks in order to simulate te presence of sub-paragrap in te text. L T s T I L l L l I Figure 4. Example of a rule used to add a caption to a figure. 4. Experimental results In tis section we describe te experiments carried out on a set of labeled pages contained in two books of te 9t Century (books of te 9t Century are particularly interesting for Digital Libraries since tey are copyrigt-free), I L T s T L l L l I 4 Proceedings of te Seent International Conference on Document Analysis and Recognition (ICDAR 2003)

Image Issue2 0. 0.3 0.5 0.7 0.9 0 0. 0.3 0.5 0.7 0.9 0 0. 0.3 0.5 0.7 0.9 Figure 5. Left: aerage classification error (for te wole test set) wen arying te percentage of pages in te natural set. Center and rigt: classification error in class Image and Issue2, respectiely. downloaded from te web site of te National Library of France. Eac book contains rougly 650 pages belonging to classes described in Table 2. An aerage of fie rules ae been designed for eac class. Wit tese experiments we want to answer to two questions. First, erify weter te generated pages are correct samples of te corresponding class. Second, ealuate te smallest size of te expanded set of pages. We classified all te pages of a book (test-set) wit a training-set built starting from natural pages from te oter book only. Te test-set is fixed, wereas we aried te number of natural pages from 0% to 00% of pages in te second book. Figure 5 reports te aerage relatie classification error wen growing te size of te natural set. Te two lines report te error obtained classifying wit a training set of natural pages only (continuous line) and a training set of natural pages and expanded ones (dotted line). In te figure we report also te relatie classification error for two typical classes. We can obsere tat te aerage error on te wole set of classes is always te lowest wen using te expanded set regardless of te size of te natural set used for training. In some classes te error gets best just wen te percentage arisesup0.3 (classes Image, Text2Image). 5. Conclusions Te experimental results confirm us tat te artificially generated pages are representatie of te corresponding class. Moreoer, we can actually extimate tat wen using a set of pages containing less tan alf of a book, we can get a good generalization from rules. Of course tese results are related to te database and te set of rules we ae used, but anyway tey alidate te effectieness of te proposed approac. Te rule set is easy to redefine, so canges in te database can be described wit te grammar. Howeer currently eery cange in te rule set is committed to user. We are working on a system tat automatically improes te performance of te rule set by acting on its composition excluding unuseful rules. Tis is made by computing te contribution of te application of eac rule to te oerall classification error on a training set. On te oter and a full-automatic learning of rules seems to be difficult for te ig-leel nature of te rules. References [] J. Cano, J. C. Pérez-Cortes, J. Arlandis, and R. Llobet. Training set expansion in andwritten caracter recognition. In Proc. SSPR/SPR 2002, pages 548 556, 2002. [2] F. Cesarini, M. Gori, S. Marinai, and G. Soda. Structured document segmentation and representation by te modified X-Y tree. In Proc. Fift ICDAR, pages 563 566, 999. [3] F. Cesarini, M. Lastri, S. Marinai, and G. Soda. Encoding of modified X-Y trees for document classification. In Proc. Sixt ICDAR, pages 3 36, 200. [4] F. Cesarini, S. Marinai, and G. Soda. Retrieal by layout similarity of documents represented wit MXY. In Document Analysis Systems V, pages 353 364, 2002. [5] F. Drews. Te TREEBAG Manual. Department of Computer Science, Ume Uniersity, S-9087 Ume Sweden. [6] C. Ermel and T. Scultzke. Te AGG Eniroment: A Sort Manual. TU Berlin. [7] R. C. Gonzalez and M. C. Tomason. Syntattic Pattern Recognition - an introduction. Addison-Wesley, 978. [8] B. T. Messmer and H. Bunke. Error-correcting grap isomorpism using decision trees. IJPRAI, 2(6):72 742, 998. [9] G. Nagy and S. Set. Hierarcical rappresentation of optical scanned documents. Proc. of ICPR, pages 347 349, 984. [0] K. Zang and D. Sasa. Simple fast algoritms for te editing distance between trees and related problems. SIAM J. Computing, 8(6):245 262, December 989. 5 Proceedings of te Seent International Conference on Document Analysis and Recognition (ICDAR 2003)