lucene-java-3.5.0/lucene/contrib/analyzers/stempel/src/java/overview.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
   2 <html>
   3 <head>
   4   <meta content="text/html; charset=UTF-8" http-equiv="content-type">
   5   <title>Stempel - Algorithmic Stemmer for Polish Language</title>
   6   <meta content="Andrzej Bialecki" name="author">
   7   <meta name="keywords"
   8  content="stemming, stemmer, algorithmic stemmer, Polish stemmer">
   9   <meta
  10  content="This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables."
  11  name="description">
  12 </head>
  13 <body style="font-family: Arial,SansSerif;">
  14 <h1><i>Stempel</i> - Algorithmic Stemmer for Polish Language</h1>
  15 <h2>Introduction</h2>
  16 <p>A method for conflation of different inflected word forms is an
  17 important component of many Information Retrieval systems. It helps to
  18 improve the system's recall and can significantly reduce the index
  19 size. This is especially true for highly-inflectional languages like
  20 those from the Slavic language family (Czech, Slovak, Polish, Russian,
  21 Bulgarian, etc).</p>
  22 <p>This page describes a software package consisting of high-quality
  23 stemming tables for Polish, and a universal algorithmic stemmer, which
  24 operates using these tables. The stemmer code is taken virtually
  25 unchanged from the <a href="http://www.egothor.org">Egothor project</a>.</p>
  26 <p>The software distribution includes stemmer
  27 tables prepared using an extensive corpus of Polish language (see
  28 details below).</p>
  29 <p>This work is available under Apache-style Open Source license - the
  30 stemmer code is covered by Egothor License, the tables and other
  31 additions are covered by Apache License 2.0. Both licenses allow to use
  32 the code in Open Source as well as commercial (closed source) projects.</p>
  33 <h3>Terminology</h3>
  34 <p>A short explanation is in order about the terminology used in this
  35 text.</p>
  36 <p>In the following sections I make a distinction between <b>stem</b>
  37 and <b>lemma</b>.</p>
  38 <p>Lemma is a base grammatical form (dictionary form, headword) of a
  39 word. Lemma is an existing, grammatically correct word in some human
  40 language.</p>
  41 <p>Stem on the other hand is just a unique token, not necessarily
  42 making any sense in any human language, but which can serve as a unique
  43 label instead of lemma for the same set of inflected forms. Quite often
  44 stem is referred to as a "root" of the word - which is incorrect and
  45 misleading (stems sometimes have very little to do with the linguistic
  46 root of a word, i.e. a pattern found in a word which is common to all
  47 inflected forms or within a family of languages).</p>
  48 <p>For an IR system stems are usually sufficient, for a morphological
  49 analysis system obviously lemmas are a must. In practice, various
  50 stemmers produce a mix of stems and lemmas, as is the case with the
  51 stemmer described here. Additionally, for some languages, which use
  52 suffix-based inflection rules many stemmers based on suffix-stripping
  53 will produce a large percentage of stems equivalent to lemmas. This is
  54 however not the case for languages with complex, irregular inflection
  55 rules (such as Slavic languages) - here simplistic suffix-stripping
  56 stemmers produce very poor results.</p>
  57 <h3>Background</h3>
  58 <p>Lemmatization is a process of finding the base, non-inflected form
  59 of a word. The result of lemmatization is a correct existing word,
  60 often in nominative case for nouns and infinitive form for verbs. A
  61 given inflected form may correspond to several lemmas (e.g. "found"
  62 -&gt; find, found) - the correct choice depends on the context.<br>
  63 <br>
  64 Stemming is concerned mostly with finding a unique "root" of a word,
  65 which not necessarily results in any existing word or lemma. The
  66 quality of stemming is measured by the rate of collisions (overstemming
  67 - which causes words with different lemmas to be incorrectly conflated
  68 into one "root"), and the rate of superfluous word "roots"
  69 (understemming - which assigns several "roots" to words with the same
  70 lemma). <br>
  71 <br>
  72 Both stemmer and lemmatizer can be implemented in various ways. The two
  73 most common approaches are:<br>
  74 </p>
  75 <ul>
  76   <li>dictionary-based: where the stemmer uses an extensive dictionary
  77 of morphological forms in order to find the corresponding stem or lemma</li>
  78   <li>algorithmic: where the stemmer uses an algorithm, based on
  79 general morphological properties of a given language plus a set of
  80 heuristic rules<br>
  81   </li>
  82 </ul>
  83 There are many existing and well-known implementations of stemmers for
  84 English (Porter, Lovins, Krovetz) and other European languages
  85 (<a href="http://snowball.tartarus.org">Snowball</a>). There are also
  86 good quality commercial lemmatizers for Polish. However, there is only
  87 one
  88 freely available Polish stemmer, implemented by
  89 <a
  90  href="http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en">Dawid
  91 Weiss</a>, based on the "ispell" dictionary and Jan Daciuk's <a
  92  href="http://www.eti.pg.gda.pl/%7Ejandac/">FSA package</a>. That
  93 stemmer is dictionary-based. This means that even
  94 though it can achieve
  95 perfect accuracy for previously known word forms found in its
  96 dictionary, it
  97 completely fails in case of all other word forms. This deficiency is
  98 somewhat mitigated by the comprehensive dictionary distributed with
  99 this stemmer (so there is a high probability that most of the words in
 100 the input text will be found in the dictionary), however the problem
 101 still remains (please see the page above for more detailed description).<br>
 102 <br>
 103 The implementation described here uses an algorithmic method. This
 104 method
 105 and particular algorithm implementation are described in detail in
 106 [1][2].
 107 The main advantage of algorithmic stemmers is their ability to process
 108 previously
 109 unseen word forms with high accuracy. This particular algorithm uses a
 110 set
 111 of
 112 transformation rules (patch commands), which describe how a word with a
 113 given pattern should be transformed to its stem. These rules are first
 114 learned from a training corpus. They don't
 115 cover
 116 all possible cases, so there is always some loss of precision/recall
 117 (which
 118 means that even the words from the training corpus are sometimes
 119 incorrectly stemmed).<br>
 120 <h2>Algorithm and implementation<span style="font-style: italic;"></span></h2>
 121 The algorithm and its Java implementation is described in detail in the
 122 publications cited below. Here's just a short excerpt from [2]:<br>
 123 <br>
 124 <center>
 125 <div style="width: 80%;" align="justify">"The aim is separation of the
 126 stemmer execution code from the data
 127 structures [...]. In other words, a static algorithm configurable by
 128 data must be developed. The word transformations that happen in the
 129 stemmer must be then encoded to the data tables.<br>
 130 <br>
 131 The tacit input of our method is a sample set (a so-called dictionary)
 132 of words (as keys) and their stems. Each record can be equivalently
 133 stored as a key and the record of key's transformation to its
 134 respective stem. The transformation record is termed a patch command
 135 (P-command). It must be ensured that P-commands are universal, and that
 136 P-commands can transform any word to its stem. Our solution[6,8] is
 137 based on the Levenstein metric [10], which produces P-command as the
 138 minimum cost path in a directed graph.<br>
 139 <br>
 140 One can imagine the P-command as an algorithm for an operator (editor)
 141 that rewrites a string to another string. The operator can use these
 142 instructions (PP-command's): <span style="font-weight: bold;">removal </span>-
 143 deletes a sequence of characters starting at the current cursor
 144 position and moves the cursor to the next character. The length of this
 145 sequence is the parameter; <span style="font-weight: bold;">insertion </span>-
 146 inserts a character ch, without moving the cursor. The character ch is
 147 a parameter; <span style="font-weight: bold;">substitution&nbsp;</span>
 148 - rewrites a character at the current cursor position to the character
 149 ch and moves the cursor to the next character. The character ch is a
 150 parameter; <span style="font-weight: bold;">no operation</span> (NOOP)
 151 - skip a sequence of characters starting at the current cursor
 152 position. The length of this sequence is the parameter.<br>
 153 <br>
 154 The P-commands are applied from the end of a word (right to left). This
 155 assumption can reduce the set of P-command's, because the last NOOP,
 156 moving the cursor to the end of a string without any changes, need not
 157 be stored."</div>
 158 </center>
 159 <br>
 160 Data structure used to keep the dictionary (words and their P-commands)
 161 is a trie. Several optimization steps are applied in turn to reduce and
 162 optimize the initial trie, by eliminating useless information and
 163 shortening the paths in the trie.<br>
 164 <br>
 165 Finally, in order to obtain a stem from the input word, the word is
 166 passed once through a matching path in the trie (applying at each node
 167 the P-commands stored there). The result is a word stem.<br>
 168 <h2>Corpus</h2>
 169 <p><i>(to be completed...)</i></p>
 170 <p>The following Polish corpora have been used:</p>
 171 <ul>
 172   <li><a
 173  href="http://sourceforge.net/project/showfiles.php?group_id=49316&amp;package_id=65354">Polish
 174 dictionary
 175 from ispell distribution</a></li>
 176   <li><a href="http://www.mimuw.edu.pl/polszczyzna/">Wzbogacony korpus
 177 sÅ‚ownika frekwencyjnego</a></li>
 178 <!--<li><a href="http://www.korpus.pl">Korpus IPI PAN</a></li>-->
 179 <!--<li>The Bible (so called "Warsaw Bible" or "Brytyjka")</li>--><li>The
 180 Bible (so called "TysiÄ…clecia") - unauthorized electronic version</li>
 181   <li><a
 182  href="http://www.mimuw.edu.pl/polszczyzna/Debian/sam34_3.4a.02-1_i386.deb">Analizator
 183 morfologiczny SAM v. 3.4</a> - this was used to recover lemmas
 184 missing from other texts</li>
 185 </ul>
 186 <p>This step was the most time-consuming - and it would probably be
 187 even more tedious and difficult if not for the
 188 help of
 189 <a href="http://www.python.org/">Python</a>. The source texts had to be
 190 brought to a common encoding (UTF-8) - some of them used quite ancient
 191 encodings like Mazovia or DHN - and then scripts were written to
 192 collect all lemmas and
 193 inflected forms from the source texts. In cases when the source text
 194 was not
 195 tagged,
 196 I used the SAM analyzer to produce lemmas. In cases of ambiguous
 197 lemmatization I decided to put references to inflected forms from all
 198 base forms.<br>
 199 </p>
 200 <p>All grammatical categories were allowed to appear in the corpus,
 201 i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting
 202 corpus consisted of roughly 87,000+ inflection sets, i.e. each set
 203 consisted of one base form (lemma) and many inflected forms. However,
 204 because of the nature of the training method I restricted these sets to
 205 include only those where there were at least 4 inflected forms. Sets
 206 with 3 or less inflected forms were removed, so that the final corpus
 207 consisted of ~69,000 unique sets, which in turn contained ~1.5 mln
 208 inflected forms. <br>
 209 </p>
 210 <h2>Testing</h2>
 211 <p>I tested the stemmer tables produced using the implementation
 212 described above. The following sections give some details about
 213 the testing setup.
 214 </p>
 215 <h3>Testing procedure</h3>
 216 <p>The testing procedure was as follows:
 217 </p>
 218 <ul>
 219   <li>the whole corpus of ~69,000 unique sets was shuffled, so that the
 220 input sets were in random order.</li>
 221   <li>the corpus was split into two parts - one with 30,000 sets (Part
 222 1), the other with ~39,000 sets (Part 2).</li>
 223   <li>Training samples were drawn in sequential order from the Part 1.
 224 Since the sets were already randomized, the training samples were also
 225 randomized, but this procedure ensured that each larger training sample
 226 contained all smaller samples.</li>
 227   <li>Part 2 was used for testing. Note: this means that the testing
 228 run used <em>only</em> words previously unseen during the training
 229 phase. This is the worst scenario, because it means that stemmer must
 230 extrapolate the learned rules to unknown cases. This also means that in
 231 a real-life case (where the input is a mix between known and unknown
 232 words) the F-measure of the stemmer will be even higher than in the
 233 table below.</li>
 234 </ul>
 235 <h3>Test results</h3>
 236 <p>The following table summarizes test results for varying sizes
 237 of training samples. The meaning of the table columns is
 238 described below:
 239 </p>
 240 <ul>
 241   <li><b>training sets:</b> the number of training sets. One set
 242 consists of one lemma and at least 4 and up to ~80 inflected forms
 243 (including pre- and suffixed forms).</li>
 244   <li><b>testing forms:</b> the number of testing forms. Only inflected
 245 forms were used in testing.</li>
 246   <li><b>stem OK:</b> the number of cases when produced output was a
 247 correct (unique) stem. Note: quite often correct stems were also
 248 correct lemmas.</li>
 249   <li><b>lemma OK:</b> the number of cases when produced output was a
 250 correct lemma.</li>
 251   <li><b>missing:</b> the number of cases when stemmer was unable to
 252 provide any output.</li>
 253   <li><b>stem bad:</b> the number of cases when produced output was a
 254 stem, but already in use identifying a different set.</li>
 255   <li><b>lemma bad:</b> the number of cases when produced output was an
 256 incorrect lemma. Note: quite often in such case the output was a
 257 correct stem.</li>
 258   <li><b>table size:</b> the size in bytes of the stemmer table.</li>
 259 </ul>
 260 <div align="center">
 261 <table border="1" cellpadding="2" cellspacing="0">
 262   <tbody>
 263     <tr bgcolor="#a0b0c0">
 264       <th>Training sets</th>
 265       <th>Testing forms</th>
 266       <th>Stem OK</th>
 267       <th>Lemma OK</th>
 268       <th>Missing</th>
 269       <th>Stem Bad</th>
 270       <th>Lemma Bad</th>
 271       <th>Table size [B]</th>
 272     </tr>
 273     <tr align="right">
 274       <td>100</td>
 275       <td>1022985</td>
 276       <td>842209</td>
 277       <td>593632</td>
 278       <td>172711</td>
 279       <td>22331</td>
 280       <td>256642</td>
 281       <td>28438</td>
 282     </tr>
 283     <tr align="right">
 284       <td>200</td>
 285       <td>1022985</td>
 286       <td>862789</td>
 287       <td>646488</td>
 288       <td>153288</td>
 289       <td>16306</td>
 290       <td>223209</td>
 291       <td>48660</td>
 292     </tr>
 293     <tr align="right">
 294       <td>500</td>
 295       <td>1022985</td>
 296       <td>885786</td>
 297       <td>685009</td>
 298       <td>130772</td>
 299       <td>14856</td>
 300       <td>207204</td>
 301       <td>108798</td>
 302     </tr>
 303     <tr align="right">
 304       <td>700</td>
 305       <td>1022985</td>
 306       <td>909031</td>
 307       <td>704609</td>
 308       <td>107084</td>
 309       <td>15442</td>
 310       <td>211292</td>
 311       <td>139291</td>
 312     </tr>
 313     <tr align="right">
 314       <td>1000</td>
 315       <td>1022985</td>
 316       <td>926079</td>
 317       <td>725720</td>
 318       <td>90117</td>
 319       <td>14941</td>
 320       <td>207148</td>
 321       <td>183677</td>
 322     </tr>
 323     <tr align="right">
 324       <td>2000</td>
 325       <td>1022985</td>
 326       <td>942886</td>
 327       <td>746641</td>
 328       <td>73429</td>
 329       <td>14903</td>
 330       <td>202915</td>
 331       <td>313516</td>
 332     </tr>
 333     <tr align="right">
 334       <td>5000</td>
 335       <td>1022985</td>
 336       <td>954721</td>
 337       <td>759930</td>
 338       <td>61476</td>
 339       <td>14817</td>
 340       <td>201579</td>
 341       <td>640969</td>
 342     </tr>
 343     <tr align="right">
 344       <td>7000</td>
 345       <td>1022985</td>
 346       <td>956165</td>
 347       <td>764033</td>
 348       <td>60364</td>
 349       <td>14620</td>
 350       <td>198588</td>
 351       <td>839347</td>
 352     </tr>
 353     <tr align="right">
 354       <td>10000</td>
 355       <td>1022985</td>
 356       <td>965427</td>
 357       <td>775507</td>
 358       <td>50797</td>
 359       <td>14662</td>
 360       <td>196681</td>
 361       <td>1144537</td>
 362     </tr>
 363     <tr align="right">
 364       <td>12000</td>
 365       <td>1022985</td>
 366       <td>967664</td>
 367       <td>782143</td>
 368       <td>48722</td>
 369       <td>14284</td>
 370       <td>192120</td>
 371       <td>1313508</td>
 372     </tr>
 373     <tr align="right">
 374       <td>15000</td>
 375       <td>1022985</td>
 376       <td>973188</td>
 377       <td>788867</td>
 378       <td>43247</td>
 379       <td>14349</td>
 380       <td>190871</td>
 381       <td>1567902</td>
 382     </tr>
 383     <tr align="right">
 384       <td>17000</td>
 385       <td>1022985</td>
 386       <td>974203</td>
 387       <td>791804</td>
 388       <td>42319</td>
 389       <td>14333</td>
 390       <td>188862</td>
 391       <td>1733957</td>
 392     </tr>
 393     <tr align="right">
 394       <td>20000</td>
 395       <td>1022985</td>
 396       <td>976234</td>
 397       <td>791554</td>
 398       <td>40058</td>
 399       <td>14601</td>
 400       <td>191373</td>
 401       <td>1977615</td>
 402     </tr>
 403   </tbody>
 404 </table>
 405 </div>
 406 <p>I also measured the time to produce a stem (which involves
 407 traversing a trie,
 408 retrieving a patch command and applying the patch command to the input
 409 string).
 410 On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03
 411 HotSpot),
 412 for tables ranging in size from 1,000 to 20,000 cells, the time to
 413 produce a
 414 single stem varies between 5-10 microseconds.<br>
 415 </p>
 416 <p>This means that the stemmer can process up to <span
 417  style="font-weight: bold;">200,000 words per second</span>, an
 418 outstanding result when compared to other stemmers (Morfeusz - ~2,000
 419 w/s, FormAN (MS Word analyzer) - ~1,000 w/s).<br>
 420 </p>
 421 <p>The package contains a class <code>org.getopt.stempel.Benchmark</code>,
 422 which you can use to produce reports
 423 like the one below:<br>
 424 </p>
 425 <pre>--------- Stemmer benchmark report: -----------<br>Stemmer table:  /res/tables/stemmer_2000.out<br>Input file:     ../test3.txt<br>Number of runs: 3<br><br> RUN NUMBER:            1       2       3<br> Total input words      1378176 1378176 1378176<br> Missed output words    112     112     112<br> Time elapsed [ms]      6989    6940    6640<br> Hit rate percent       99.99%  99.99%  99.99%<br> Miss rate percent      00.01%  00.01%  00.01%<br> Words per second       197192  198584  207557<br> Time per word [us]     5.07    5.04    4.82<br></pre>
 426 <h2>Summary</h2>
 427 <p>The results of these tests are very encouraging. It seems that using
 428 the
 429 training corpus and the stemming algorithm described above results in a
 430 high-quality stemmer useful for most applications. Moreover, it can
 431 also
 432 be used as a better than average lemmatizer.</p>
 433 <p>Both the author of the implementation
 434 (Leo Galambos, &lt;leo.galambos AT egothor DOT org&gt;) and the author
 435 of this
 436 compilation (Andrzej Bialecki &lt;ab AT getopt DOT org&gt;) would
 437 appreciate any
 438 feedback and suggestions for further improvements.</p>
 439 <h2>Bibliography</h2>
 440 <ol>
 441   <li>Galambos, L.: Multilingual Stemmer in Web Environment, PhD
 442 Thesis,
 443 Faculty of Mathematics and Physics, Charles University in Prague, in
 444 press.</li>
 445   <li>Galambos, L.: Semi-automatic Stemmer Evaluation. International
 446 Intelligent Information Processing and Web Mining Conference, 2004,
 447 Zakopane, Poland.</li>
 448   <li>Galambos, L.: Lemmatizer for Document Information Retrieval
 449 Systems in JAVA.<span style="text-decoration: underline;"> </span><a
 450  class="moz-txt-link-rfc2396E"
 451  href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01">&lt;http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01&gt;</a>
 452 SOFSEM 2001, Piestany, Slovakia. <br>
 453   </li>
 454 </ol>
 455 <br>
 456 <br>
 457 </body>
 458 </html>