An Effective Approach for Compression of Bengali Text

Similar documents
Characteristics and dead-time of GM-tube

Annis on MonetDB. Viktor Rosenfeld 14. January Advisors: Prof. Dr. Ulf Leser and Dr.

Draft general guidance on sampling and surveys for SSC projects

Intelligent Call Admission Control Using Fuzzy Logic in Wireless Networks

Determining the Optimal Stages Number of Module and the Heat Drop Distribution

NEW METRICS FOR EVALUATING MONTE CARLO TOLERANCE ANALYSIS OF ASSEMBLIES

Conductivity in Bulk and Film-type Zinc Oxide

Product cordial labeling for alternate snake graphs

Load Carrying Capacity of Nail-Laminated Timber loaded perpendicular to its plane

Methodology of industrial projects economic evaluation (M.E.E.P.I.)

4-Difference Cordial Labeling of Cycle and

Model Predictive Control for Central Plant Optimization with Thermal Energy Storage

Record your answers to all the problems in the EMCF titled Homework 4.

Further Results on Divisor Cordial Labeling

Contour Approach for Analysis of Minimum Regions for the Economic Statistical Design of Xbar Control Charts

Optimization Model of Oil-Volume Marking with Tilted Oil Tank

Overall stability of multi-span portal sheds at right-angles to the portal spans

Dr.Abdulsattar A.jabbar Alkubaisi Associate Professor Department of Accounting World Islamic Sciences & Education University Amman-Jordan

Physics Engineering PC 1431 Experiment P2 Heat Engine. Section B: Brief Theory (condensed from Serway & Jewett)

Ratio Estimators Using Coefficient of Variation and Coefficient of Correlation

A NOVEL OPTIMIZED ENERGY-SAVING EXTRACTION PROCESS ON COFFEE

Calculation of Theoretical Torque and Displacement in an Internal Gear Pump

Ground Improvement Using Preloading with Prefabricated Vertical Drains

16.1 Volume of Prisms and Cylinders

Influence of the mass flow ratio water-air on the volumetric mass transfer coefficient in a cooling tower

Balanced Binary Trees

THIS REPORT CONTAINS ASSESSMENTS OF COMMODITY AND TRADE ISSUES MADE BY USDA STAFF AND NOT NECESSARILY STATEMENTS OF OFFICIAL U.S.

Predicting Persimmon Puree Colour as a Result of Puree Strength Manipulation. Andrew R. East a, Xiu Hua Tan b, Jantana Suntudprom a

Revision Topic 12: Area and Volume Area of simple shapes

Red Green Black Trees: Extension to Red Black Trees

Testing significance of peaks in kernel density estimator by SiZer map

OD DVOSTRUKO ZASTAKLJENOG PROZORA DO DVOSTRUKE FASADE INDIKATORI PRENOSA TOPLOTE STACIONARNOG STANJA

Sum divisor cordial graphs

CALIFORNIA CABERNET Class 1 Tasting

Recently, I had occasion to re-read George

THE PROJECTIVE GEOMETRY APPLIED TO PLANE RECTIFICATION

4.2 Using Similar Shapes

青藜苑教育 Example : Find te area of te following trapezium. 7cm 4.5cm cm To find te area, you add te parallel sides 7

Drivers of Agglomeration: Geography vs History

Prediction of steel plate deformation due to triangle heating using the inherent strain method

ANALYSIS OF WORK ROLL THERMAL BEHAVIOR FOR 1450MM HOT STRIP MILL WITH GENETIC ALGORITHM

Analysing the energy consumption of air handling units by Hungarian and international methods

HACCP implementation in Jap an. Hajime TOYOFUKU, DVM., PhD Professor, Joint Faculty of Veterinary Medicine, Yamaguchi University, Japan

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

HCR OF HEAT PUMP ROOM AIR CONDITIONER IN CHINA. Beijing , China

Energy and Communication Efficient Group Key Management Protocol for Hierarchical Sensor Networks

Annex 16. Methodological Tool. Tool to determine project emissions from flaring gases containing methane

Study of microrelief influence on optical output coefficient of GaN-based LED

th griffins 38 hindmarsh square adelaide city tel

TORQUE CONVERTER MODELLING FOR ACCELERATION SIMULATION

THE REDESIGNED CANADIAN MONTHLY WHOLESALE AND RETAIL TRADE SURVEY: A POSTMORTEM OF THE IMPLEMENTATION

Mean square cordial labelling related to some acyclic graphs and its rough approximations

Calculation Methodology of Translucent Construction Elements in Buildings and Other Structures

Farm Structure Survey 2009/2010 Survey on agricultural production methods 2009/2010

DE HOTLINE: DE: AT: CH: FR HOTLINE : B : F : CH :

Numerical Simulation of Stresses in Thin-rimmed Spur Gears with Keyway B. Brůžek, E. Leidich

Road Surface Crack Identification by Using Different Classifiers on Digital Images

Russell James Department of Scientific and Industrial Research Taupo-ldairakei, New Zealand

Math Practice Use a Formula

Fixation effects: do they exist in design problem solving?

Shaping the Future: Production and Market Challenges

234 The National Strategies Secondary Mathematics exemplification: Y7

Better Punctuation Prediction with Hierarchical Phrase-Based Translation

DELAWARE COMPENSATION RATING BUREAU, INC. Proposed Excess Loss (Pure Premium) Factors

Using tree-grammars for training set expansion in page classification

INVESTIGATION OF ERROR SOURCES MEASURING DEFORMATIONS OF ENGINEERING STRUCTURES BY GEODETIC METHODS

Variance Estimation of the Design Effect

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

wine 1 wine 2 wine 3 person person person person person

Alcohol & You Promoting Positive Change DERBYSHIRE Alcohol Advice Service

Forecasting of Tea Yield Based on Energy Inputs using Artificial Neural Networks (A case study: Guilan province of Iran)

Point Pollution Sources Dimensioning

Reflections on the drinking bowl 'Balance'

The Wine Specialist. brewers direct. [Limit 3 per person Manufacturers instant rebate applied] Festa Juice Whites

Soybean Yield Loss Due to Hail Damage*

FIRST COMPARISON OF REMOTE CESIUM FOUNTAINS

Research regarding the setting up of the Processing Directions of Peach New Cultivars and Hybrids

Managing Measurement Uncertainty in Building Acoustics

To find the volume of a pyramid and of a cone

2 2D 2F. 1pc for each 20 m of wire. h (min. 45) h (min. 45) 3AC. see details J, E

This appendix tabulates results summarized in Section IV of our paper, and also reports the results of additional tests.

Appendices. Section. Food Buying Guide for Child Nu tri tion Pro grams A P P E N D I C E S

Design of Conical Strainer and Analysis Using FEA

Wideband HF Channel Availability Measurement Techniques and Results W.N. Furman, J.W. Nieto, W.M. Batts

Description of Danish Practices in Retail Trade Statistics.

A Modified Stratified Randomized Response Techniques

Effect of Processing on Storage and Microbial Quality of Jackfruit

The household budget and expenditure data collection module (IOF 2014/2015) within a continuous multipurpose survey system (INCAF)

Background. Sample design

Wildlife Trade and Endangered Species Protection

Study of Steam Export Transients in a Combined Cycle Power Plant

AWRI Refrigeration Demand Calculator

International Plant Protection Convention Page 1 of 10

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Supporing Information. Modelling the Atomic Arrangement of Amorphous 2D Silica: Analysis

Tamanend Wine Consulting

Develop the skills and knowledge to use a range of cookery methods to prepare menu items for the kitchen of a hospitality or catering operation.

PINEAPPLE LEAF FIBRE EXTRACTIONS: COMPARISON BETWEEN PALF M1 AND HAND SCRAPPING

-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!)

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

Transcription:

COPYRIGHT 2011 IJCIT, ISSN 2078-5828 (PRINT), ISSN 2218-5224 (ONLINE), VOLUME 01, ISSUE 02, MANUSCRIPT CODE: 110111 A Effective Approac for Compressio of Beali Text S. A. Asa Rajo ad Md. Rafiqul Islam Abstract I tis paper, we propose a effective ad efficiet approac for compressi Beali Text. Tis paper focuses o a metodical stud o Beali text compressio teciques. Te mai taret of tis researc is to provide a framework for Beali text compressio; wic esures a simple ad computatioall iexpesive effective sceme for Beali text compressio. Te proposed Beali text compressio sceme is aimed to ecompass te low-overead data commuicatio ad maaemet framework for batter powered, eer costraied devices. Te curret approaces of data compressio ad teir correspodece, usae ad efficiec for compressi Beali text are also preseted i tis paper. Te comparative aalsis of existi compressio teciques ad proposed approac i terms of time ad space complexit alo wit compressio ratio as bee iterated. We also preset a effective sceme for costructi te traii platform or kowledebase for obtaii compressio, as tere is o specific tertiar dictioar based Beali text compressio sceme is available for researc. Te mai aspect of te proposed sceme is te iteratio of stri raki sceme for idexi te source text to acieve ierarcical compressio sceme. Static codi is also emploed i te proposed sceme to ecode te data for compressio. Tis paper also icorporates power cosumptio aalsis of proposed compressio sceme alo wit te performace aalsis i terms of compressio ratio. Idex Terms Beali Text, Data Compressio, Cotet or Compoet Raki, Data Model. 1 INTRODUCTION T S. A. Asa Rajo is wit te Computer Sciece ad Eieeri Disciplie, Kula Uiversit, Kula-9208 ad Facult of Computer Sciece, KPCbd, Kula-9100. Balades. E-mail: asa.rajo@mail.com. ere are udreds ad tousads of lauaes i tis world. However, Beali is te ol lauae i te paes of istor, wic as bee establised at a cost of valuable lives of martrs. Tou diit of te lauae is saved trou te terrible blood seddi ad extreme sacrifice, still ow, i establisi Beali as a powerful lauae i te diital world, te steps are ot ote-wort. Te stadardizatio of Beali text represetatio teciques or ecodi metods as ot a uique base trou a lo debate. Cosequetl, eaced Efficiet Data Represetatio Scemes (EDRS) as ot bee developed smootl. Te aspects of Beali Text Compressio are still ow mostl depedet o ordiar data compressio teciques, wic ofte results i expasio of Beali Text. Compressio is te process of reduci te size of a file or data b idetifi ad removi redudac i its structure. Data Compressio offers a effective approac of reduci commuicatio costs b usi available badwidt effectivel. Data Compressio tecique is eerall divided ito two cateories; amel, Lossless Data Compressio ad Loss Data Compressio [1]. For Lossless scemes, te recover of data sould be exact. Lossless compressio aloritms are to some extet essetial for all kids of text processi, scietific ad statistical databases applicatios, medical ad bioloical imae processi, DNA ad oter bioloical data maaemet ad so o. However, a loss data compressio tecique does ot esure te exact recover of data. For imae compressio ad multimedia data compressio, tere is a reat use of loss data compressio [3], [6]. Our aim is to develop a Lossless Compressio tecique for compressi Beali text. Here, we ave emploed a ew statistical model wit a ovel approac of iterati text raki or compoet cateorizatio sceme for buildi te model. A ew dictioar matci sceme ad static codi are used to obtai te proposed compressio. Moreover, we ave used a ew teoretical cocept of coosi te kowledebase etries, wic as facilitated us to obtai metioable compressio ratio usi a small umber of kowledebase etries ta oter metods cosumi less resource. Te prime aspect of te proposed sceme is to esure compressio of small ad moderate volume of text rater ta compressi ue text, sice we aim to develop te sceme for batter powered smart devices rater ta for iat computers. Sort text compressio for batter powered small devices is muc more callei because, te devices posses low computatioal capabilit wit small memor ad limited processi power wic limits te applicable scemes to be simplistic ad lower resource cosumi. I terms of te commuicatio cael usae, te devices sould be efficiet eou to put miimal traffic wile commuicati. Tou tere are a wide variet of devices (especiall mobile poes) are providi dataservices to commuicate i Beali texts, upto our kowlede best, tere is o sopisticated compressio sceme i tis cocer. Tis motivates us to develop suc a Beali small text compressio sceme to be used for powercostraied devices. Md. Rafiqul Islam is wit te Computer Sciece ad Eieeri Disciplie, Kula Uiversit, Kula-9208, Balades. 30

2 BASIC CONCEPTSOVERVIEW 2.1 Overview of Text Compressio Tere are mail two streams of text compressio teciques, Statistical Modeli ad Dictioar based teciques. Te dictioar-based tecique is te mostl adapted approac for Elis text compressio. I Dictioar based compressio, a text-base is formed i cocer wit specific text properties (i.e., frequec of text or sllable, patter of frequet text etc.). A dictioar coder works b searci for a matc betwee te text to be compressed ad a stri i te dictioar [6], [12]. Weever a matc is foud, te text is substituted b a referece (i.e. poiter or idex) to te stri i te predefied dictioar. Examples of dictioar based compressio scemes iclude Lempel- Ziv Aloritms (LZ, LZW). I case of statistical modeli, te statistical ecoder ecodes eac compoet separatel taki teir previous smbol (i.e. cotext) ito cosideratio. LZW allows simplificatio of trasmitti poiters of te prasebook as iitiall oe caracter stris are cotaied i te prasebook [14] resulti a faster compressio ta Predictive coders. Cotext modeli is emploed for text compressio wit iteratio of etrop coder. Geerall sai, a compressio tecique comprises of two activities, costructio of a model; tat takes smbol probabilities ito accout ad codi to form a miimal represetatio of eac smbol or compoet. A efficiet coder like Aritmetic Coder returs a biar stream for eac set of smbol probabilities [12]. Compressio teciques are i-eerall caracterized as static, semi adaptive ad adaptive scemes. I semi adaptive approac, a iitial pass over te source data is emploed to collect statistics ad i te ext pass, te codi is performed. For tis semi-adaptive approac, te seder first trasmits te codebook ad te seds ecoded messae. Te static modeli proposes to use te same static statistics reardi smbol codi for bot seder ad receiver [15]. -rams based text compressio is a ideal example of semi static modeli. I tis modeli, te total text to be compressed is cosidered as a series of stris of some let [12]. Adaptive modeli is a approac of data compressio were few udred btes are used for ateri te statistics. Te mai classes of Adaptive Coders iclude Lempel- Ziv Adaptive Dictioar Coders ad Predictive Coders. Predictive coders are cocered wit probabilit of eac smbol ad te sequece of smbols precedi te curret smbol (cotext). Te umber of precedi smbols (i.e. cotext) also determies te order of te model. 2.2 Caracteristics of Beali Text Beali Text Compressio differs from Elis text compressio from mail two poits of views. Firstl, te compressio teciques ivolvi pseudo-codi of uppercase (or lowercase) letters are ot applicable for Beali text. Secodl, i case of Beali, we ma emplo specific mecaism of codi depedet vowel sis to remove redudac, wic is abset for te case of Elis. I Beali, we ave 91 distict smbol uits icludi idepedet vowels, costats, depedet vowel sis, two part idepedet vowel sis, additioal cosoats, various sis, additioal sis ad Beali umerals etc. A detail list of Beali smbols is available i [13]. Moreover, i Beali we ave a lare ivolvemet of cojucts wic also focuses a scope of redudac removal. Tou Elis as ot a fixed ecodi base lo ao, still ow i practical applicatios, Beali as ot adapted uique ecodi sceme. Te use of Beali Uicode as ot et ot a massive use. Tis is reall a reat limitatio for researc i Beali. Beali text compressio also suffers from te same problem. 3 Literature Review Compressio is a elemetar cocer of data eieeri ad maaemet. Trou te istor of lare-scale Elis text compressio as alread crossed its ifac, still ow te total umber of researc activities o te specific field of Beali text compressio as ot crossed a metioable amout. Te most recet stud o text compressio i [15] uses a cotext model wit oe at a time asi based statistical model. Aoter fasciati researc stated i [7] provides a cocept of compressi small texts usi sllables. Te sceme provided i [11] makes use of reed sequetial rammar trasform for text compressio. A stud reardi sort text compressio i [10] uses text-raki sceme wit sllable based dictioar matci for text compressio. Te recet studies reardi text compressio [6], [7] make use of sllables for compressio of middle-sized files. Te adapted well kow Adaptive Huffma ad Lempel- Ziv-Welc codi to use sllables istead of caracter. Te provide elaborated defiitio of sllable empasizi sllable as a sequece of souds, wic cotais exactl oe maximal subsequece of vowels. Words from o-letters are also cosidered as sllable. I order to improve a compressio of alpabet of sllables or words, a database of frequet word-sets for eac cocered lauae as bee provided too. Iitializatio of compressio aloritms are performed usi words from te database. Te code te words of source text from te defied database. Te database of sllable form letters cotais tree tousad sllables approximatel wit coditio to addi sllable to database was tat, its frequec is reater ta 1:65000. Te do ot emplo a text raki sceme for costructi te dictioar, ad te umber of iitial etries ca be reduced substatiall b te use of proper substri weiti approaces. Te existi literature ivolvi Beali text compressio i [2] proposed b Hossai et al. presets a comparative aalsis of Beali text compressio wit WiZip. Te emplo static Huffma codi for acievi te compressio. Te test-bed was adapted from a collectio of Beali ewspapers. Tis sceme acieves reduced trasmissio cost i terms of compressio ad decompressio time. For te badwidt-aware applicatios, tis sceme is ot applicable because it requires submitti te codebook alo wit te 31

source text, wic essetiall puts reater load o Badwidt. Mukerjee et al. [4, 5] proposed a dictioar based compressio sceme titled star ecodi. Accordi to tis sceme, words are replaced wit sequece of * smbol accompaied wit referece to a exteral dictioar. Te dictioar is arraed accordi to te let of words ad is kow to bot seder ad receiver. Proper sub-dictioar is selected b te let of te sequece of * smbols. Let Idex Preservi Trasformatio (LIPT) is a variatio of te star ecodi b te same autors. Tis aloritm improves te Predictio b Partial Matci (PPM), Burrows-Weeler Codi Applicatios (BWCA) ad Lempel-Ziv (LZ) based compressio scemes. Aoter related literature kow StarNT works wit terar searc tree ad is faster ta te previous. Te first 312 words of te dictioar are te most frequetl used words of te Elis lauae. Te remaii part of te dictioar is filled up b words sorted b teir let first ad te b teir frequec. Tis sceme also does ot take te use of substri weiti approac. Moreover selectio of te substris are made ol from, te poit of view of areated frequec. Predictio b partial matci (PPM) is a major lossless data compressio sceme, were eac smbol is coded b taki accout of te previous smbols. A cotext model is emploed tat ives statistical iformatio about te smbol wit its cotext. I order to sial te decoder o te cotext, specific smbols are used b te ecoder. Te model order i PPM is a vital parameter of compressio performace. However, PPM is computatioall more complex ad te overead too is reater. Block Sorti is a iovative compressio mecaism proposed b Burrows ad Weeler i 1994 [1]. Block Sorti woks i tree steps, permuti te iput block oe at a time trou te use of Burrows Weeler Trasform (BWT), appli a Move to Frot Trasform (MFT) to eac of te permuted blocks, ad te etrop codi te output wit a Huffma or Aritmetic Coder. Ru Let Codi is also ofte itroduced prior to a or all of te tree staes [16]. 4 PROPOSED BENGALI TEXT COMPRESSION SCHEME I tis paper, we propose a ew dictioar for Beali text compressio. To facilitate efficiet searci of te text, we emplo term weiti or stri raki i idexi te dictioar etries. Te total compressio sceme is divided ito two staes: Stae 1: Buildi te kowledebase. Stae 2: Appl proposed text raki approac for compressi te source text. Te test-bed is formed from stadard Beali text collectios from various sources. We cosider a collectio of texts of various cateories ad temes (like ews documets, papers, essas, poems ad advertisi documets) as te test bed. Te documets were cose as te represetative of te correspodi roups. Te roups were selected to demostrate te iterrelatio of compressio performace for differet temes ad motives of text. Cosideratios of te caed setece structure ad wordi approac wit te evolutio of smart devices especiall mobile poes is also a reaso of coosi te various roups of texts wit vari size. It is ecessar to metio tat, tou a few collectios of field specific text collectio are available, still ow o sopisticated Beali text compressio evaluatio test-bed is available. As data compressio ad especiall dictioar based text compressio reatl ivolves te structure, wordi ad cotext of texts, a collectio of differet tpes of texts is a must for evaluati te compressio. I costructi te dictioar, we use te test-text-bed of 109 files vari from 4KB to 1800KB. Costructio of te Kowledebase Te text-base is emploed i two steps. Firstl, we calculate te statistics of te text taki cotext ad frequec of text ito cosideratio ad Secodl, we isert te smbol(s) or text i te dictioar selected usi collected statistical data wit tresold. Let, te test-bed cotais a total of documets. Aai, assume tat, documet di cotais tci umber of distict caracters. Te total umber of distict words i documet di is assumed to be twi. Te statistics also follows tat, caracter ci occurs fci times ad te umber of occurrece of word wi is fwi times. At first, we rak te idividual smbols i terms of teir occurrece. Here, we use te term rak as te frequec of eac caracter. We cosider tis total statistics ivolvi te idividual smbols as level oe idex. Te we proceed towards substris wit let reater ta two. Eac substri is cosidered from mutual directios. Firstl we cosider te stri for raki from forward directio ad te from backward directio. For words wit let reater ta seve smbols, it is partitioed ito several partitioed-substris wit te let of multiplier of seve. Starti from te iitial smbol, we proceed to rak eac smbol, were for smbols positioed at p wit respect to te cocered word of let u were 1< p u 7. Tis raki is expressed as te summatio of raks of previous smbols. We take te previous rak ito cocer because, if we simpl cosider te frequec of te substri, for discrete ad sopisticated documets, te motive ad sese of te documet i.e., repeated terms of te documet would il ifluece te raki ad fluctuate te overall raki ad idexi sceme. Neverteless, cosideri te cotext of te smbol, tat is, taki te rak of te smbols ito cosideratio, wic is embodied ito te curret smbol provides a cumulative statistics ad cotext of te curret substri ad ece makes te dictioar a ubiased collectio. I suc a wa, we completel rak te dictioar. It is otewort tat, we cosider seve as te tresold, based o te simple potesis tat, te averae let of a Beali word ma be seve caracters. Te ext step ivolves selecti 256 etries from te total selectio. If tere are a total of e etries i te resultat statistics were e > 256, ad for levels t=1, 2, 3,,7 if tere are 32

a total of t etries, te te etries are sorted over for eac level. For level t, ( 256 / e * t ) etries are selected i ascedi order ad i te temporar database te etries wit teir correspodi rak (i percetae wit respect to total words for level > 1 ad wit respect to total smbols for level =1) ad level id is stored. I te same wa, all te documets are raked ad stored ito temporar database. Te edi step of costructi te dictioar ivolves selecti 256 etries from te combied database b relatioal operatios usi te same criteria for coosi 256 etries from eac temporar database. Te ext step of te proposed sceme, i.e. applicatio of proposed text raki approac for compressi te source text costitutes of primaril two staes. I te first stae, te source text is successivel compared wit te etries of te kowledebase starti from te maximum level. For a successful matc, te substris are marked wit te correspodi words rabbed from te dictioar. If tere is o successful matc for tat specific level, te substris are coverted to te level below it. B repeati tis step, te total files are coverted, sice, level oe is composed of sile caracters ad smbols. SELECT CORPUS SELECT FILE REMOVE TEXTS OTHER THAN BENGALI CHECK STRING THRESHOLD FORWARD CHECK CALCULATE THE INDEXING SORT THE STATISTICS DEFINE THE THRESHOLD ORGANIZE INTO LEVELS BACKWARD CHECK INSERT INTO KNOWLEDGE-BASE Fi. 1 Costructio of te kowledebase 5 PERFORMANCE ANALYSIS OF PROPOSED BENGALI TEXT COMPRESSION SCHEME Tou it is a eeral idea tat compressio ad decompressio time sould ave a iter-relatio, te proposed sceme demostrated a little exceptio. Te poits beid tat ma be summarized trou te followi discussios. 5.1 Performace Aalsis of Compressio Process wit respect to time Let te total umber of traii etries for te statistical model be N, were N is a o-eative iteer ad te maximum level for statistical modeli is L. Te first level of te statistical model will must cotai te sile caracters, were te total umber of distict caracter is l1. For levels 1, 2, 3,., te total umber of distict multi-ram etries are l 1, l 2, l 3,, l respectivel. We a text is to be compressed, it is ierarcicall compared wit eac level of statistical model starti from te iest order. If tere is a matc, te correspodi static codi for multi-ram etr is assied for te text. If te multi-ram etr is ot foud trouout te level, it is forwarded to te ext level. Tis assimet uses efficiet searci procedures. Let te code m is foud at te i-t level wit offset k resulti a searc cost of S m ( l j ) + k m, were k m < l i ad, j = L, L-1,, i-1 wit respect to searc space. Here j limits from L to (i-1) istead of i i decreasi order because, as we fid te code i somewere of i-t level ot requiri to searc te wole elemet-space of te i-t level, rater searci trou a offset value k for i-t level, te overall searc-space is L to (i-1). Tat is w, for te above cosequeces, te total searci appears searc overead for (i-1) umber of levels wit additioal searc overead of k elemets. Here te term searc overead meas te complexit of searci as well as oter related computatioal requiremets. We te code matces, it is placed i output stream as caracter represetatio. Tis step requires paddi te bit-stream ad te coversio ito caracter stream. Assume tat, te process of overall coversio for eac successful etr occurs wit te overead B. Tat is, for a multi-ram matci, te required overead is, i 1 C 1 (S1( l j )) k1 B1 j L Similarl, i 1 C 2 (S 2( l j )) k 2 B 2 j L Ad, i 1 C (S ( l j )) k B j L I suc a wa if a total of multi-rams are idetified ad te ecoded, te required resultat umber of operatios i compressio process is: 33

T 1 C 1 i 1 j L (S ( l )) j 1 5.2 Performace Aalsis of Decompressio Process wit respect to time For te decompressio process, te text to be decompressed is coverted ito biar stream. If te larest code is of let c max ad te smallest code is of let c mi te te decompressio process will start te searci wit te c max umber of bits ad searc trou te codes up to c mi bits b reduci oe bit per step for usuccessful matc. It is ecessar to metio tat, te codes wit same bit let does ot essetiall comprise a specific level. So, to reveal te caracter represetatio for eac etr d if a switc of levels are required, for c mi c max were te maximum level is p wit te matci offset for correspodi level k d, ad te assimet of te code wit caracter represetatio for eac successful matc requires a overead of B d, te te overall requiremet for compari trou te eac level setti results (= Overead of Searci trou level + Overead of Searci trou offset + Overead of Represetatio). For detecti first caracter te overead is, E1 (S 1 ( lq )) k1 B1 q p. Similarl, for detecti te secod caracter, te levelwise overead will be: q p (S ( l )) k f q f E2 (S 2 ( lq )) k 2 B2 q p. Similarl, Here, p = maximum level, ad S / is a fuctio tat deotes te overead for searci i elemet space provided as parameter of te fuctio ad = miimum level. Te computatio proresses trou p, p-1, p-2,, +2, +1,. Here te subscript is used to deote te level-wise overead for detecti oe caracter represetatio wit respect to level. I order to detect a sile multi-rams f, te total searcoveread wit respect to searc space for level-wise calculatio is, because we are to start wit maximum level p ad te proceed decreasil towards te dowward levels (as explaied above). If S / is te searc-overead fuctio, te searci from level p to will result (S f ( l q )) q p were f is te multi-rams, wic is bei revealed. For te matci level, as ol a partial umber of elemets are to be searced, te offset k is used to deote te offset for lauae L. After cecki trou te levels, te procedure follows k 1 B searci trou te bit-wise statistics for a usuccessful matc i level-wise statistics. If tere are a total of u bitpases, we are to perform searci trou te searcspace cosisti of starti from te maximum bit pase to te miimum bit pase i decreasi order. Because of a usuccessful matc i a bit-pase, a bit switc is performed ad level wise calculatio for tat level is forwarded. Tat is, a overead of (E b ) is icurred for b d eac level-wise aalsis. Cosequetl, te overead of uit step will be, C 1 b d (E b ) Here, d ad are maximum ad miimum bit pases respectivel ad d. Substituti te value of E b, we et, C1 (S1,b ( lq )) k1,b B1,b b d q p b d b d Similarl, we et, C (S b( lq )) k,b B Ad, 2 2, 2 2,b b d q p b d b d C (S, b b d q p ( l )) q b d k,b b d B Here we use te subscript 1 wit k ad B i order to mea tat, te calculatios are for detecti uit code ol were te calculatio is performed starti from d to i decreasi order, tat is, i te order of d, (d-1), (d-2),, (+2), (+1),. If we are to reveal umber of codes, te te total overead becomes: T C 1 As for eac bit wise overead calculatio, we must iclude level-wise calculatios; we ma omit te subscript otatio for searc overead fuctio for simplicit, T (S,b (lq )) k,b B,b 1 b d q p 1 b d,b 1 b d 6 EXPERIMENTAL RESULTS AND DISCUSSIONS Te proposed compressio is basicall a multi-stae compressio tecique. I tis sceme, we use a ew kowledebase for Beali text compressio. Tis kowledebase is formed b aalzi te test bed discussed i sectio III. Te kowledebase is cosidered as a static base. I te iitial step, te source text (Beali) is passed ito Uicode coverter. For te traditioall used ecodibased Beali texts, it is coverted ito Uicode, wereas Uicoded source Beali text is simpl passed over te ext step. Te secod step ivolves matci te proposed dictioar (i.e. kowledebase). We emplo a text-base of ol 256 etries. As te proposed dictioar cotais small umber of etries i te kowledebase, te searc-space is 34

reatl reduced resulti a faster retrieval of idex data from te dictioar for te queri stri or substri. Cosequetl, we et a stream of iteers represeti te source text, wic ma be reduced at tis step. It is te passed as te iput for Aritmetic Coder after re-coverti te stream of iteers ito correspodi caracter (ad smbol) represetatio. Te compressio ratio is a metric to describe ow ma compressed uits are required to describe oe uit of data. Te lower te preseted value sows better compressio. A eeral observatio is tat ier modes lead to better compressio ratios eve if te differece wit ier orders becomes smaller. We ave also aalzed te performace of our proposed sceme wit existi domai idepedet text compactio sceme like A modificatio of Greed Sequetial Grammar Trasform based Uiversal Lossless data Compressio (mgsgt) b R. Islam et al. [8] ad Word-Based Block - Sorti Text Compressio (WBBSTC) b Isal ad Moffat [16]. Te performace aalsis was performed i a quad-core 2.0 GHz persoal computer wit 1.0 GB RAM wit treadi support. Object Orieted Prorammi Lauae JAVA was used to simulate te total sceme. File Name TABLE 1: COMPARISON OF COMPRESSION RATIO Proposed Sceme mgsgt Article 3.748 3.89 3.96 Poem 4.014 4.48 4.51 Advertise 3.928 4.36 4.34 Speec 3.624 3.81 3.88 News 3.818 4.11 3.87 SMS 3.416 4.76 4.59 Email 3.718 3.87 4.01 Particulars 3.941 4.21 4.04 Stor 3.371 3.59 3.77 Report 3.749 3.76 3.79 WBBSTC Our proposed sceme for kowledebase formatio is also applicable for ui-lauae text compressio of lauaes oter ta Beali. Sice te proposed sceme builds te kowlede-base i a ierarcical maer wit provisio to defie te levels of ierarc, re-defii te spa of levels ma optimize te compressio effectiveess. Sice, te proposed sceme demostrates a uified wa Tou te proposed sceme demostrates better performace for compressio of Beali text, te sceme is ot efficiet for compressio of multi-liual text. Future works ma be dedicated for attaii a multiliual text compressio sceme icludi Beali b appli te core cocept of proposed sceme. Let te source text comprise of lauae l1 ad l2 wit total caracter set of a1 ad a2. I order to code a source text cotaii compoets from bot lauaes, te kowledebase will must comprise of all te uit compoets of l1 ad l2 to facilitate te codi-abilit to te uit compoets. Tis requires a miimum kowledebase space of a1+a2. Aai, level 2 to level for eac lauae will also require certai spaces for formatio of te kowlede-base. If te summatio of spaces exceeds te tresold value for optimal codi usi static-codi sceme, tere ma be eative fluctuatio betwee te expected performace ad actual performace. Fi. 2 Compressio Ratio of te proposed sceme Te proposed sceme acieves better compressio ratio for Beali text compressio b meas of efficiet dictioar-mapper. Te proposed dictioar is completel differet form covetioal dictioaries as it emplos a modified sceme of selectio of dictioar etries wit eaced raki criteria. Suc adaptatio is a first for Beali text. Te costrait of maki te dictioar spa fixed esures optimal searc space. Moreover, our proposed sceme is desiated for small to medium sized text files, wic is of te most ecessar spa for widespread use. 7 CONCLUSIONS AND RECOMMENDATIONS Te proposed sceme is oe of te iitiati steps of Eaced Text Represetatio Sceme for Beali Text. I tis step, a ovel approac of costructi data compressio dictioar as bee proposed wic is also a iovative approac of Beali text compressio. We ave impressive outcomes of te proposed approac i terms of compressio time, compressio ratio ad overall overead requiremets. As te proposed sceme is adapted for bot covetioal ecodi ad Uicode stadard, it ma be emploed ver easil for a Beali text compressio. Bei a lowmemor cosumi oe, te proposed approac ma also be adapted for text compactio i small memor devices. 35

Commuicatio of Beali Small Text Messae ma also be immesel facilitated wit te preseted approac of Beali text compressio. Te proposed sceme is also to some extet a iitializatio of Beali text compressio approaces wit a few patfiders.. [16] R. Yuo Kartoo Isal ad Alistair Moffat, Word - Based Block - Sorti Text Compressio, Proc. of 24t Australia Computer Sciece Coferece, Gold Coast, Australia, pp. 92-99. [17] Md. Rafiqul Islam ad S. A. Asa Rajo, A Eaced Sort Text Compressio Sceme for smart devices, Joural of Computers, Vol. 5, No. 1, Jauar 2010, pp. 49-58. REFERENCES [1] M. Burrows ad D. J.Weeler. A block sorti lossless data compressio aloritm. Tecical report, Diital Equipmet Corporatio, Palo Alto, CA, 1994. [2] Md. Sazzad Hossai ad R.C. Debat, A Comparative Stud of Bala Text Compressio wit Wizip, Iformatio Tecolo Joural, 3 (1): 93-94, 2004. [3] A. Moffat, R. M. Neal ad I. H. Witte Aritmetic codi revisited. ACM Trasactios o Iformatio Sstems, 16:256 294, 1998. [4] F. Awa ad A. Mukerjee, "LIPT: A Lossless Text Trasform to improve compressio", Proceedis of Iteratioal Coferece o Iformatio ad Teor : Codi ad Computi, IEEE Computer Societ, Las Veas Nevada, 2001. [5] H. Kruse ad A. Mukerjee, Preprocessi Text to Improve Compressio Ratios, Proceedis of Data Compressio Coferece, IEEE Computer Societ, Sowbird Uta, 1998, pp. 556. [6] Lask, J, Zemlicka, M., Compressio of a Dictioar. Proceedis of te DATESO 2006 Aual Iteratioal Worksop o DAtabases, TExts, Specificatios ad Objects. CEUR- WS,Vol.176,(2006)11-20. [7] Lask, J, Zemlicka, M.: Compressio of Small Text Files Usi Sllables. IEEE Data Compressio Coferece-2006, IEEE CS Press, Los Alamitos, CA, USA (2006) 458. [8] Md. Rafiqul Islam, Sajib Kumar Saa, Mrial Kati Baowal. A modificatio of Greed Sequetial Grammar Trasform based Uiversal Lossless data Compressio. Publised i Proceedis of 9t Iteratioal Coferece o Computer ad Iformatio Tecolo (ICCIT 2006), 28-30 December, 2006, Daka, Balades. [9] S. Rei ad C. Guma, Aritmetic codi a sort tutorial, Wavelet Applicatio Group, Tecical Report, April 2005. [10] Md. Rafiqul Islam, S. A. Asa Rajo ad Aoda Podder, Small Text Compressio for Smart Devices, I te proceedis of Iteratioal Coferece o Computer ad Commuicatio Tecolo, ICCIT 2008, Kula, Balades. [11] E. H. Ya ad J. C. Kieffer. Efficiet uiversal lossless data compressio aloritms based o a reed sequetial rammar trasform. Part oe: Witout cotext models. IEEE Trasactios o Iformatio Teor,46(3): 755 777, 2000. [12] J. Able Uiversal text preprocessi for data compressio, IEEE 2005. [13] Uicode List of Beali: Available at Official website of Uicode Stadard 5.0, Uicode Ic. ttp://www.uicode.or. Retrieved o November 02, 2009. [14] Pil Vies ad Justi Zobel: Compressio Teciques for Ciese text, Departmet of Computer Sciece, RMIT, Melboure, Australia. [15] Stepa Rei, Clemes Guma, Frak H. P. Fitzek: Compressio of Sort Text o Embedded Sstems, Joural of Computers: Volume 1, No: 06, September 2006. Prof. Dr. Md. Rafiqul Islam obtaied Master of Sciece (M. S.) i Eieeri (Computers) from Azerbaija Poltecic Istitute (Azerbaija Tecical Uiversit at preset) i 1987 ad P.D. i Computer Sciece from Uiversiti Tekoloi Malasia (UTM) i 1999. His researc areas iclude desi ad aalsis of aloritms ad Iformatio Securit. Dr. Islam as aroud 75 papers related to tese areas publised i atioal ad iteratioal jourals as well as i referred coferece proceedis. He is curretl worki as a professor of Computer Sciece ad Eieeri Disciplie, Kula Uiversit, Balades. S. A. Asa Rajo is curretl worki as a Seior Lecturer of Departmet of Computer Sciece, KPCbd, Kula-9100, Balades. Er. Rajo is also a Adjuct Facult of Computer Sciece ad Eieeri Disciplie, Kula Uiversit, Balades. After completio of B.Sc.Eieeri from CSE disciplie, Sciece, Eieeri ad Tecolo Scool, Kula Uiversit, Balades i April 2008, e was appoited i is ative disciplie. Rajo as made tree joural ad eit coferece publicatios i Iteratioal cofereces ad Jourals. He is also a reviewer of Joural of Eieeri ad Tecolo Researc ad Iteratioal Joural of Iformatio Sstems. His researc iterest icludes data eieeri ad maaemet, iformatio sstems ad ubiquitous computi. Curretl e is worki o robotics. He is a member of Istitute of Eieers, Balades (IEB). For more iformatio about Rajo, please visit: ttp://sites.oole.com/site/asarajo 36