X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.4.0/lucene/contrib/analyzers/stempel/src/java/overview.html?ds=sidebyside diff --git a/lucene-java-3.4.0/lucene/contrib/analyzers/stempel/src/java/overview.html b/lucene-java-3.4.0/lucene/contrib/analyzers/stempel/src/java/overview.html deleted file mode 100644 index c5c3b12..0000000 --- a/lucene-java-3.4.0/lucene/contrib/analyzers/stempel/src/java/overview.html +++ /dev/null @@ -1,458 +0,0 @@ - - - - - Stempel - Algorithmic Stemmer for Polish Language - - - - - -

Stempel - Algorithmic Stemmer for Polish Language

-

Introduction

-

A method for conflation of different inflected word forms is an -important component of many Information Retrieval systems. It helps to -improve the system's recall and can significantly reduce the index -size. This is especially true for highly-inflectional languages like -those from the Slavic language family (Czech, Slovak, Polish, Russian, -Bulgarian, etc).

-

This page describes a software package consisting of high-quality -stemming tables for Polish, and a universal algorithmic stemmer, which -operates using these tables. The stemmer code is taken virtually -unchanged from the Egothor project.

-

The software distribution includes stemmer -tables prepared using an extensive corpus of Polish language (see -details below).

-

This work is available under Apache-style Open Source license - the -stemmer code is covered by Egothor License, the tables and other -additions are covered by Apache License 2.0. Both licenses allow to use -the code in Open Source as well as commercial (closed source) projects.

-

Terminology

-

A short explanation is in order about the terminology used in this -text.

-

In the following sections I make a distinction between stem -and lemma.

-

Lemma is a base grammatical form (dictionary form, headword) of a -word. Lemma is an existing, grammatically correct word in some human -language.

-

Stem on the other hand is just a unique token, not necessarily -making any sense in any human language, but which can serve as a unique -label instead of lemma for the same set of inflected forms. Quite often -stem is referred to as a "root" of the word - which is incorrect and -misleading (stems sometimes have very little to do with the linguistic -root of a word, i.e. a pattern found in a word which is common to all -inflected forms or within a family of languages).

-

For an IR system stems are usually sufficient, for a morphological -analysis system obviously lemmas are a must. In practice, various -stemmers produce a mix of stems and lemmas, as is the case with the -stemmer described here. Additionally, for some languages, which use -suffix-based inflection rules many stemmers based on suffix-stripping -will produce a large percentage of stems equivalent to lemmas. This is -however not the case for languages with complex, irregular inflection -rules (such as Slavic languages) - here simplistic suffix-stripping -stemmers produce very poor results.

-

Background

-

Lemmatization is a process of finding the base, non-inflected form -of a word. The result of lemmatization is a correct existing word, -often in nominative case for nouns and infinitive form for verbs. A -given inflected form may correspond to several lemmas (e.g. "found" --> find, found) - the correct choice depends on the context.
-
-Stemming is concerned mostly with finding a unique "root" of a word, -which not necessarily results in any existing word or lemma. The -quality of stemming is measured by the rate of collisions (overstemming -- which causes words with different lemmas to be incorrectly conflated -into one "root"), and the rate of superfluous word "roots" -(understemming - which assigns several "roots" to words with the same -lemma).
-
-Both stemmer and lemmatizer can be implemented in various ways. The two -most common approaches are:
-

- -There are many existing and well-known implementations of stemmers for -English (Porter, Lovins, Krovetz) and other European languages -(Snowball). There are also -good quality commercial lemmatizers for Polish. However, there is only -one -freely available Polish stemmer, implemented by -Dawid -Weiss, based on the "ispell" dictionary and Jan Daciuk's FSA package. That -stemmer is dictionary-based. This means that even -though it can achieve -perfect accuracy for previously known word forms found in its -dictionary, it -completely fails in case of all other word forms. This deficiency is -somewhat mitigated by the comprehensive dictionary distributed with -this stemmer (so there is a high probability that most of the words in -the input text will be found in the dictionary), however the problem -still remains (please see the page above for more detailed description).
-
-The implementation described here uses an algorithmic method. This -method -and particular algorithm implementation are described in detail in -[1][2]. -The main advantage of algorithmic stemmers is their ability to process -previously -unseen word forms with high accuracy. This particular algorithm uses a -set -of -transformation rules (patch commands), which describe how a word with a -given pattern should be transformed to its stem. These rules are first -learned from a training corpus. They don't -cover -all possible cases, so there is always some loss of precision/recall -(which -means that even the words from the training corpus are sometimes -incorrectly stemmed).
-

Algorithm and implementation

-The algorithm and its Java implementation is described in detail in the -publications cited below. Here's just a short excerpt from [2]:
-
-
-
"The aim is separation of the -stemmer execution code from the data -structures [...]. In other words, a static algorithm configurable by -data must be developed. The word transformations that happen in the -stemmer must be then encoded to the data tables.
-
-The tacit input of our method is a sample set (a so-called dictionary) -of words (as keys) and their stems. Each record can be equivalently -stored as a key and the record of key's transformation to its -respective stem. The transformation record is termed a patch command -(P-command). It must be ensured that P-commands are universal, and that -P-commands can transform any word to its stem. Our solution[6,8] is -based on the Levenstein metric [10], which produces P-command as the -minimum cost path in a directed graph.
-
-One can imagine the P-command as an algorithm for an operator (editor) -that rewrites a string to another string. The operator can use these -instructions (PP-command's): removal - -deletes a sequence of characters starting at the current cursor -position and moves the cursor to the next character. The length of this -sequence is the parameter; insertion - -inserts a character ch, without moving the cursor. The character ch is -a parameter; substitution  -- rewrites a character at the current cursor position to the character -ch and moves the cursor to the next character. The character ch is a -parameter; no operation (NOOP) -- skip a sequence of characters starting at the current cursor -position. The length of this sequence is the parameter.
-
-The P-commands are applied from the end of a word (right to left). This -assumption can reduce the set of P-command's, because the last NOOP, -moving the cursor to the end of a string without any changes, need not -be stored."
-
-
-Data structure used to keep the dictionary (words and their P-commands) -is a trie. Several optimization steps are applied in turn to reduce and -optimize the initial trie, by eliminating useless information and -shortening the paths in the trie.
-
-Finally, in order to obtain a stem from the input word, the word is -passed once through a matching path in the trie (applying at each node -the P-commands stored there). The result is a word stem.
-

Corpus

-

(to be completed...)

-

The following Polish corpora have been used:

- -

This step was the most time-consuming - and it would probably be -even more tedious and difficult if not for the -help of -Python. The source texts had to be -brought to a common encoding (UTF-8) - some of them used quite ancient -encodings like Mazovia or DHN - and then scripts were written to -collect all lemmas and -inflected forms from the source texts. In cases when the source text -was not -tagged, -I used the SAM analyzer to produce lemmas. In cases of ambiguous -lemmatization I decided to put references to inflected forms from all -base forms.
-

-

All grammatical categories were allowed to appear in the corpus, -i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting -corpus consisted of roughly 87,000+ inflection sets, i.e. each set -consisted of one base form (lemma) and many inflected forms. However, -because of the nature of the training method I restricted these sets to -include only those where there were at least 4 inflected forms. Sets -with 3 or less inflected forms were removed, so that the final corpus -consisted of ~69,000 unique sets, which in turn contained ~1.5 mln -inflected forms.
-

-

Testing

-

I tested the stemmer tables produced using the implementation -described above. The following sections give some details about -the testing setup. -

-

Testing procedure

-

The testing procedure was as follows: -

- -

Test results

-

The following table summarizes test results for varying sizes -of training samples. The meaning of the table columns is -described below: -

- -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Training setsTesting formsStem OKLemma OKMissingStem BadLemma BadTable size [B]
10010229858422095936321727112233125664228438
20010229858627896464881532881630622320948660
500102298588578668500913077214856207204108798
700102298590903170460910708415442211292139291
100010229859260797257209011714941207148183677
200010229859428867466417342914903202915313516
500010229859547217599306147614817201579640969
700010229859561657640336036414620198588839347
10000102298596542777550750797146621966811144537
12000102298596766478214348722142841921201313508
15000102298597318878886743247143491908711567902
17000102298597420379180442319143331888621733957
20000102298597623479155440058146011913731977615
-
-

I also measured the time to produce a stem (which involves -traversing a trie, -retrieving a patch command and applying the patch command to the input -string). -On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03 -HotSpot), -for tables ranging in size from 1,000 to 20,000 cells, the time to -produce a -single stem varies between 5-10 microseconds.
-

-

This means that the stemmer can process up to 200,000 words per second, an -outstanding result when compared to other stemmers (Morfeusz - ~2,000 -w/s, FormAN (MS Word analyzer) - ~1,000 w/s).
-

-

The package contains a class org.getopt.stempel.Benchmark, -which you can use to produce reports -like the one below:
-

-
--------- Stemmer benchmark report: -----------
Stemmer table: /res/tables/stemmer_2000.out
Input file: ../test3.txt
Number of runs: 3

RUN NUMBER: 1 2 3
Total input words 1378176 1378176 1378176
Missed output words 112 112 112
Time elapsed [ms] 6989 6940 6640
Hit rate percent 99.99% 99.99% 99.99%
Miss rate percent 00.01% 00.01% 00.01%
Words per second 197192 198584 207557
Time per word [us] 5.07 5.04 4.82
-

Summary

-

The results of these tests are very encouraging. It seems that using -the -training corpus and the stemming algorithm described above results in a -high-quality stemmer useful for most applications. Moreover, it can -also -be used as a better than average lemmatizer.

-

Both the author of the implementation -(Leo Galambos, <leo.galambos AT egothor DOT org>) and the author -of this -compilation (Andrzej Bialecki <ab AT getopt DOT org>) would -appreciate any -feedback and suggestions for further improvements.

-

Bibliography

-
    -
  1. Galambos, L.: Multilingual Stemmer in Web Environment, PhD -Thesis, -Faculty of Mathematics and Physics, Charles University in Prague, in -press.
  2. -
  3. Galambos, L.: Semi-automatic Stemmer Evaluation. International -Intelligent Information Processing and Web Mining Conference, 2004, -Zakopane, Poland.
  4. -
  5. Galambos, L.: Lemmatizer for Document Information Retrieval -Systems in JAVA. <http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01> -SOFSEM 2001, Piestany, Slovakia.
    -
  6. -
-
-
- -