X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.5.0/lucene/contrib/analyzers/stempel/src/java/overview.html?ds=inline diff --git a/lucene-java-3.5.0/lucene/contrib/analyzers/stempel/src/java/overview.html b/lucene-java-3.5.0/lucene/contrib/analyzers/stempel/src/java/overview.html new file mode 100644 index 0000000..c5c3b12 --- /dev/null +++ b/lucene-java-3.5.0/lucene/contrib/analyzers/stempel/src/java/overview.html @@ -0,0 +1,458 @@ + + + + + Stempel - Algorithmic Stemmer for Polish Language + + + + + +

Stempel - Algorithmic Stemmer for Polish Language

+

Introduction

+

A method for conflation of different inflected word forms is an +important component of many Information Retrieval systems. It helps to +improve the system's recall and can significantly reduce the index +size. This is especially true for highly-inflectional languages like +those from the Slavic language family (Czech, Slovak, Polish, Russian, +Bulgarian, etc).

+

This page describes a software package consisting of high-quality +stemming tables for Polish, and a universal algorithmic stemmer, which +operates using these tables. The stemmer code is taken virtually +unchanged from the Egothor project.

+

The software distribution includes stemmer +tables prepared using an extensive corpus of Polish language (see +details below).

+

This work is available under Apache-style Open Source license - the +stemmer code is covered by Egothor License, the tables and other +additions are covered by Apache License 2.0. Both licenses allow to use +the code in Open Source as well as commercial (closed source) projects.

+

Terminology

+

A short explanation is in order about the terminology used in this +text.

+

In the following sections I make a distinction between stem +and lemma.

+

Lemma is a base grammatical form (dictionary form, headword) of a +word. Lemma is an existing, grammatically correct word in some human +language.

+

Stem on the other hand is just a unique token, not necessarily +making any sense in any human language, but which can serve as a unique +label instead of lemma for the same set of inflected forms. Quite often +stem is referred to as a "root" of the word - which is incorrect and +misleading (stems sometimes have very little to do with the linguistic +root of a word, i.e. a pattern found in a word which is common to all +inflected forms or within a family of languages).

+

For an IR system stems are usually sufficient, for a morphological +analysis system obviously lemmas are a must. In practice, various +stemmers produce a mix of stems and lemmas, as is the case with the +stemmer described here. Additionally, for some languages, which use +suffix-based inflection rules many stemmers based on suffix-stripping +will produce a large percentage of stems equivalent to lemmas. This is +however not the case for languages with complex, irregular inflection +rules (such as Slavic languages) - here simplistic suffix-stripping +stemmers produce very poor results.

+

Background

+

Lemmatization is a process of finding the base, non-inflected form +of a word. The result of lemmatization is a correct existing word, +often in nominative case for nouns and infinitive form for verbs. A +given inflected form may correspond to several lemmas (e.g. "found" +-> find, found) - the correct choice depends on the context.
+
+Stemming is concerned mostly with finding a unique "root" of a word, +which not necessarily results in any existing word or lemma. The +quality of stemming is measured by the rate of collisions (overstemming +- which causes words with different lemmas to be incorrectly conflated +into one "root"), and the rate of superfluous word "roots" +(understemming - which assigns several "roots" to words with the same +lemma).
+
+Both stemmer and lemmatizer can be implemented in various ways. The two +most common approaches are:
+

+ +There are many existing and well-known implementations of stemmers for +English (Porter, Lovins, Krovetz) and other European languages +(Snowball). There are also +good quality commercial lemmatizers for Polish. However, there is only +one +freely available Polish stemmer, implemented by +Dawid +Weiss, based on the "ispell" dictionary and Jan Daciuk's FSA package. That +stemmer is dictionary-based. This means that even +though it can achieve +perfect accuracy for previously known word forms found in its +dictionary, it +completely fails in case of all other word forms. This deficiency is +somewhat mitigated by the comprehensive dictionary distributed with +this stemmer (so there is a high probability that most of the words in +the input text will be found in the dictionary), however the problem +still remains (please see the page above for more detailed description).
+
+The implementation described here uses an algorithmic method. This +method +and particular algorithm implementation are described in detail in +[1][2]. +The main advantage of algorithmic stemmers is their ability to process +previously +unseen word forms with high accuracy. This particular algorithm uses a +set +of +transformation rules (patch commands), which describe how a word with a +given pattern should be transformed to its stem. These rules are first +learned from a training corpus. They don't +cover +all possible cases, so there is always some loss of precision/recall +(which +means that even the words from the training corpus are sometimes +incorrectly stemmed).
+

Algorithm and implementation

+The algorithm and its Java implementation is described in detail in the +publications cited below. Here's just a short excerpt from [2]:
+
+
+
"The aim is separation of the +stemmer execution code from the data +structures [...]. In other words, a static algorithm configurable by +data must be developed. The word transformations that happen in the +stemmer must be then encoded to the data tables.
+
+The tacit input of our method is a sample set (a so-called dictionary) +of words (as keys) and their stems. Each record can be equivalently +stored as a key and the record of key's transformation to its +respective stem. The transformation record is termed a patch command +(P-command). It must be ensured that P-commands are universal, and that +P-commands can transform any word to its stem. Our solution[6,8] is +based on the Levenstein metric [10], which produces P-command as the +minimum cost path in a directed graph.
+
+One can imagine the P-command as an algorithm for an operator (editor) +that rewrites a string to another string. The operator can use these +instructions (PP-command's): removal - +deletes a sequence of characters starting at the current cursor +position and moves the cursor to the next character. The length of this +sequence is the parameter; insertion - +inserts a character ch, without moving the cursor. The character ch is +a parameter; substitution  +- rewrites a character at the current cursor position to the character +ch and moves the cursor to the next character. The character ch is a +parameter; no operation (NOOP) +- skip a sequence of characters starting at the current cursor +position. The length of this sequence is the parameter.
+
+The P-commands are applied from the end of a word (right to left). This +assumption can reduce the set of P-command's, because the last NOOP, +moving the cursor to the end of a string without any changes, need not +be stored."
+
+
+Data structure used to keep the dictionary (words and their P-commands) +is a trie. Several optimization steps are applied in turn to reduce and +optimize the initial trie, by eliminating useless information and +shortening the paths in the trie.
+
+Finally, in order to obtain a stem from the input word, the word is +passed once through a matching path in the trie (applying at each node +the P-commands stored there). The result is a word stem.
+

Corpus

+

(to be completed...)

+

The following Polish corpora have been used:

+ +

This step was the most time-consuming - and it would probably be +even more tedious and difficult if not for the +help of +Python. The source texts had to be +brought to a common encoding (UTF-8) - some of them used quite ancient +encodings like Mazovia or DHN - and then scripts were written to +collect all lemmas and +inflected forms from the source texts. In cases when the source text +was not +tagged, +I used the SAM analyzer to produce lemmas. In cases of ambiguous +lemmatization I decided to put references to inflected forms from all +base forms.
+

+

All grammatical categories were allowed to appear in the corpus, +i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting +corpus consisted of roughly 87,000+ inflection sets, i.e. each set +consisted of one base form (lemma) and many inflected forms. However, +because of the nature of the training method I restricted these sets to +include only those where there were at least 4 inflected forms. Sets +with 3 or less inflected forms were removed, so that the final corpus +consisted of ~69,000 unique sets, which in turn contained ~1.5 mln +inflected forms.
+

+

Testing

+

I tested the stemmer tables produced using the implementation +described above. The following sections give some details about +the testing setup. +

+

Testing procedure

+

The testing procedure was as follows: +

+ +

Test results

+

The following table summarizes test results for varying sizes +of training samples. The meaning of the table columns is +described below: +

+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Training setsTesting formsStem OKLemma OKMissingStem BadLemma BadTable size [B]
10010229858422095936321727112233125664228438
20010229858627896464881532881630622320948660
500102298588578668500913077214856207204108798
700102298590903170460910708415442211292139291
100010229859260797257209011714941207148183677
200010229859428867466417342914903202915313516
500010229859547217599306147614817201579640969
700010229859561657640336036414620198588839347
10000102298596542777550750797146621966811144537
12000102298596766478214348722142841921201313508
15000102298597318878886743247143491908711567902
17000102298597420379180442319143331888621733957
20000102298597623479155440058146011913731977615
+
+

I also measured the time to produce a stem (which involves +traversing a trie, +retrieving a patch command and applying the patch command to the input +string). +On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03 +HotSpot), +for tables ranging in size from 1,000 to 20,000 cells, the time to +produce a +single stem varies between 5-10 microseconds.
+

+

This means that the stemmer can process up to 200,000 words per second, an +outstanding result when compared to other stemmers (Morfeusz - ~2,000 +w/s, FormAN (MS Word analyzer) - ~1,000 w/s).
+

+

The package contains a class org.getopt.stempel.Benchmark, +which you can use to produce reports +like the one below:
+

+
--------- Stemmer benchmark report: -----------
Stemmer table: /res/tables/stemmer_2000.out
Input file: ../test3.txt
Number of runs: 3

RUN NUMBER: 1 2 3
Total input words 1378176 1378176 1378176
Missed output words 112 112 112
Time elapsed [ms] 6989 6940 6640
Hit rate percent 99.99% 99.99% 99.99%
Miss rate percent 00.01% 00.01% 00.01%
Words per second 197192 198584 207557
Time per word [us] 5.07 5.04 4.82
+

Summary

+

The results of these tests are very encouraging. It seems that using +the +training corpus and the stemming algorithm described above results in a +high-quality stemmer useful for most applications. Moreover, it can +also +be used as a better than average lemmatizer.

+

Both the author of the implementation +(Leo Galambos, <leo.galambos AT egothor DOT org>) and the author +of this +compilation (Andrzej Bialecki <ab AT getopt DOT org>) would +appreciate any +feedback and suggestions for further improvements.

+

Bibliography

+
    +
  1. Galambos, L.: Multilingual Stemmer in Web Environment, PhD +Thesis, +Faculty of Mathematics and Physics, Charles University in Prague, in +press.
  2. +
  3. Galambos, L.: Semi-automatic Stemmer Evaluation. International +Intelligent Information Processing and Web Mining Conference, 2004, +Zakopane, Poland.
  4. +
  5. Galambos, L.: Lemmatizer for Document Information Retrieval +Systems in JAVA. <http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01> +SOFSEM 2001, Piestany, Slovakia.
    +
  6. +
+
+
+ +