X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.4.0/lucene/contrib/analyzers/stempel/src/java/overview.html diff --git a/lucene-java-3.4.0/lucene/contrib/analyzers/stempel/src/java/overview.html b/lucene-java-3.4.0/lucene/contrib/analyzers/stempel/src/java/overview.html deleted file mode 100644 index c5c3b12..0000000 --- a/lucene-java-3.4.0/lucene/contrib/analyzers/stempel/src/java/overview.html +++ /dev/null @@ -1,458 +0,0 @@ - - -
- -A method for conflation of different inflected word forms is an -important component of many Information Retrieval systems. It helps to -improve the system's recall and can significantly reduce the index -size. This is especially true for highly-inflectional languages like -those from the Slavic language family (Czech, Slovak, Polish, Russian, -Bulgarian, etc).
-This page describes a software package consisting of high-quality -stemming tables for Polish, and a universal algorithmic stemmer, which -operates using these tables. The stemmer code is taken virtually -unchanged from the Egothor project.
-The software distribution includes stemmer -tables prepared using an extensive corpus of Polish language (see -details below).
-This work is available under Apache-style Open Source license - the -stemmer code is covered by Egothor License, the tables and other -additions are covered by Apache License 2.0. Both licenses allow to use -the code in Open Source as well as commercial (closed source) projects.
-A short explanation is in order about the terminology used in this -text.
-In the following sections I make a distinction between stem -and lemma.
-Lemma is a base grammatical form (dictionary form, headword) of a -word. Lemma is an existing, grammatically correct word in some human -language.
-Stem on the other hand is just a unique token, not necessarily -making any sense in any human language, but which can serve as a unique -label instead of lemma for the same set of inflected forms. Quite often -stem is referred to as a "root" of the word - which is incorrect and -misleading (stems sometimes have very little to do with the linguistic -root of a word, i.e. a pattern found in a word which is common to all -inflected forms or within a family of languages).
-For an IR system stems are usually sufficient, for a morphological -analysis system obviously lemmas are a must. In practice, various -stemmers produce a mix of stems and lemmas, as is the case with the -stemmer described here. Additionally, for some languages, which use -suffix-based inflection rules many stemmers based on suffix-stripping -will produce a large percentage of stems equivalent to lemmas. This is -however not the case for languages with complex, irregular inflection -rules (such as Slavic languages) - here simplistic suffix-stripping -stemmers produce very poor results.
-Lemmatization is a process of finding the base, non-inflected form
-of a word. The result of lemmatization is a correct existing word,
-often in nominative case for nouns and infinitive form for verbs. A
-given inflected form may correspond to several lemmas (e.g. "found"
--> find, found) - the correct choice depends on the context.
-
-Stemming is concerned mostly with finding a unique "root" of a word,
-which not necessarily results in any existing word or lemma. The
-quality of stemming is measured by the rate of collisions (overstemming
-- which causes words with different lemmas to be incorrectly conflated
-into one "root"), and the rate of superfluous word "roots"
-(understemming - which assigns several "roots" to words with the same
-lemma).
-
-Both stemmer and lemmatizer can be implemented in various ways. The two
-most common approaches are:
-
(to be completed...)
-The following Polish corpora have been used:
-This step was the most time-consuming - and it would probably be
-even more tedious and difficult if not for the
-help of
-Python. The source texts had to be
-brought to a common encoding (UTF-8) - some of them used quite ancient
-encodings like Mazovia or DHN - and then scripts were written to
-collect all lemmas and
-inflected forms from the source texts. In cases when the source text
-was not
-tagged,
-I used the SAM analyzer to produce lemmas. In cases of ambiguous
-lemmatization I decided to put references to inflected forms from all
-base forms.
-
All grammatical categories were allowed to appear in the corpus,
-i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting
-corpus consisted of roughly 87,000+ inflection sets, i.e. each set
-consisted of one base form (lemma) and many inflected forms. However,
-because of the nature of the training method I restricted these sets to
-include only those where there were at least 4 inflected forms. Sets
-with 3 or less inflected forms were removed, so that the final corpus
-consisted of ~69,000 unique sets, which in turn contained ~1.5 mln
-inflected forms.
-
I tested the stemmer tables produced using the implementation -described above. The following sections give some details about -the testing setup. -
-The testing procedure was as follows: -
-The following table summarizes test results for varying sizes -of training samples. The meaning of the table columns is -described below: -
-Training sets | -Testing forms | -Stem OK | -Lemma OK | -Missing | -Stem Bad | -Lemma Bad | -Table size [B] | -
---|---|---|---|---|---|---|---|
100 | -1022985 | -842209 | -593632 | -172711 | -22331 | -256642 | -28438 | -
200 | -1022985 | -862789 | -646488 | -153288 | -16306 | -223209 | -48660 | -
500 | -1022985 | -885786 | -685009 | -130772 | -14856 | -207204 | -108798 | -
700 | -1022985 | -909031 | -704609 | -107084 | -15442 | -211292 | -139291 | -
1000 | -1022985 | -926079 | -725720 | -90117 | -14941 | -207148 | -183677 | -
2000 | -1022985 | -942886 | -746641 | -73429 | -14903 | -202915 | -313516 | -
5000 | -1022985 | -954721 | -759930 | -61476 | -14817 | -201579 | -640969 | -
7000 | -1022985 | -956165 | -764033 | -60364 | -14620 | -198588 | -839347 | -
10000 | -1022985 | -965427 | -775507 | -50797 | -14662 | -196681 | -1144537 | -
12000 | -1022985 | -967664 | -782143 | -48722 | -14284 | -192120 | -1313508 | -
15000 | -1022985 | -973188 | -788867 | -43247 | -14349 | -190871 | -1567902 | -
17000 | -1022985 | -974203 | -791804 | -42319 | -14333 | -188862 | -1733957 | -
20000 | -1022985 | -976234 | -791554 | -40058 | -14601 | -191373 | -1977615 | -
I also measured the time to produce a stem (which involves
-traversing a trie,
-retrieving a patch command and applying the patch command to the input
-string).
-On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03
-HotSpot),
-for tables ranging in size from 1,000 to 20,000 cells, the time to
-produce a
-single stem varies between 5-10 microseconds.
-
This means that the stemmer can process up to 200,000 words per second, an
-outstanding result when compared to other stemmers (Morfeusz - ~2,000
-w/s, FormAN (MS Word analyzer) - ~1,000 w/s).
-
The package contains a class org.getopt.stempel.Benchmark
,
-which you can use to produce reports
-like the one below:
-
--------- Stemmer benchmark report: ------------
Stemmer table: /res/tables/stemmer_2000.out
Input file: ../test3.txt
Number of runs: 3
RUN NUMBER: 1 2 3
Total input words 1378176 1378176 1378176
Missed output words 112 112 112
Time elapsed [ms] 6989 6940 6640
Hit rate percent 99.99% 99.99% 99.99%
Miss rate percent 00.01% 00.01% 00.01%
Words per second 197192 198584 207557
Time per word [us] 5.07 5.04 4.82
The results of these tests are very encouraging. It seems that using -the -training corpus and the stemming algorithm described above results in a -high-quality stemmer useful for most applications. Moreover, it can -also -be used as a better than average lemmatizer.
-Both the author of the implementation -(Leo Galambos, <leo.galambos AT egothor DOT org>) and the author -of this -compilation (Andrzej Bialecki <ab AT getopt DOT org>) would -appreciate any -feedback and suggestions for further improvements.
-