X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.5.0/lucene/contrib/analyzers/stempel/src/java/overview.html?ds=sidebyside diff --git a/lucene-java-3.5.0/lucene/contrib/analyzers/stempel/src/java/overview.html b/lucene-java-3.5.0/lucene/contrib/analyzers/stempel/src/java/overview.html new file mode 100644 index 0000000..c5c3b12 --- /dev/null +++ b/lucene-java-3.5.0/lucene/contrib/analyzers/stempel/src/java/overview.html @@ -0,0 +1,458 @@ + + +
+ +A method for conflation of different inflected word forms is an +important component of many Information Retrieval systems. It helps to +improve the system's recall and can significantly reduce the index +size. This is especially true for highly-inflectional languages like +those from the Slavic language family (Czech, Slovak, Polish, Russian, +Bulgarian, etc).
+This page describes a software package consisting of high-quality +stemming tables for Polish, and a universal algorithmic stemmer, which +operates using these tables. The stemmer code is taken virtually +unchanged from the Egothor project.
+The software distribution includes stemmer +tables prepared using an extensive corpus of Polish language (see +details below).
+This work is available under Apache-style Open Source license - the +stemmer code is covered by Egothor License, the tables and other +additions are covered by Apache License 2.0. Both licenses allow to use +the code in Open Source as well as commercial (closed source) projects.
+A short explanation is in order about the terminology used in this +text.
+In the following sections I make a distinction between stem +and lemma.
+Lemma is a base grammatical form (dictionary form, headword) of a +word. Lemma is an existing, grammatically correct word in some human +language.
+Stem on the other hand is just a unique token, not necessarily +making any sense in any human language, but which can serve as a unique +label instead of lemma for the same set of inflected forms. Quite often +stem is referred to as a "root" of the word - which is incorrect and +misleading (stems sometimes have very little to do with the linguistic +root of a word, i.e. a pattern found in a word which is common to all +inflected forms or within a family of languages).
+For an IR system stems are usually sufficient, for a morphological +analysis system obviously lemmas are a must. In practice, various +stemmers produce a mix of stems and lemmas, as is the case with the +stemmer described here. Additionally, for some languages, which use +suffix-based inflection rules many stemmers based on suffix-stripping +will produce a large percentage of stems equivalent to lemmas. This is +however not the case for languages with complex, irregular inflection +rules (such as Slavic languages) - here simplistic suffix-stripping +stemmers produce very poor results.
+Lemmatization is a process of finding the base, non-inflected form
+of a word. The result of lemmatization is a correct existing word,
+often in nominative case for nouns and infinitive form for verbs. A
+given inflected form may correspond to several lemmas (e.g. "found"
+-> find, found) - the correct choice depends on the context.
+
+Stemming is concerned mostly with finding a unique "root" of a word,
+which not necessarily results in any existing word or lemma. The
+quality of stemming is measured by the rate of collisions (overstemming
+- which causes words with different lemmas to be incorrectly conflated
+into one "root"), and the rate of superfluous word "roots"
+(understemming - which assigns several "roots" to words with the same
+lemma).
+
+Both stemmer and lemmatizer can be implemented in various ways. The two
+most common approaches are:
+
(to be completed...)
+The following Polish corpora have been used:
+This step was the most time-consuming - and it would probably be
+even more tedious and difficult if not for the
+help of
+Python. The source texts had to be
+brought to a common encoding (UTF-8) - some of them used quite ancient
+encodings like Mazovia or DHN - and then scripts were written to
+collect all lemmas and
+inflected forms from the source texts. In cases when the source text
+was not
+tagged,
+I used the SAM analyzer to produce lemmas. In cases of ambiguous
+lemmatization I decided to put references to inflected forms from all
+base forms.
+
All grammatical categories were allowed to appear in the corpus,
+i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting
+corpus consisted of roughly 87,000+ inflection sets, i.e. each set
+consisted of one base form (lemma) and many inflected forms. However,
+because of the nature of the training method I restricted these sets to
+include only those where there were at least 4 inflected forms. Sets
+with 3 or less inflected forms were removed, so that the final corpus
+consisted of ~69,000 unique sets, which in turn contained ~1.5 mln
+inflected forms.
+
I tested the stemmer tables produced using the implementation +described above. The following sections give some details about +the testing setup. +
+The testing procedure was as follows: +
+The following table summarizes test results for varying sizes +of training samples. The meaning of the table columns is +described below: +
+Training sets | +Testing forms | +Stem OK | +Lemma OK | +Missing | +Stem Bad | +Lemma Bad | +Table size [B] | +
---|---|---|---|---|---|---|---|
100 | +1022985 | +842209 | +593632 | +172711 | +22331 | +256642 | +28438 | +
200 | +1022985 | +862789 | +646488 | +153288 | +16306 | +223209 | +48660 | +
500 | +1022985 | +885786 | +685009 | +130772 | +14856 | +207204 | +108798 | +
700 | +1022985 | +909031 | +704609 | +107084 | +15442 | +211292 | +139291 | +
1000 | +1022985 | +926079 | +725720 | +90117 | +14941 | +207148 | +183677 | +
2000 | +1022985 | +942886 | +746641 | +73429 | +14903 | +202915 | +313516 | +
5000 | +1022985 | +954721 | +759930 | +61476 | +14817 | +201579 | +640969 | +
7000 | +1022985 | +956165 | +764033 | +60364 | +14620 | +198588 | +839347 | +
10000 | +1022985 | +965427 | +775507 | +50797 | +14662 | +196681 | +1144537 | +
12000 | +1022985 | +967664 | +782143 | +48722 | +14284 | +192120 | +1313508 | +
15000 | +1022985 | +973188 | +788867 | +43247 | +14349 | +190871 | +1567902 | +
17000 | +1022985 | +974203 | +791804 | +42319 | +14333 | +188862 | +1733957 | +
20000 | +1022985 | +976234 | +791554 | +40058 | +14601 | +191373 | +1977615 | +
I also measured the time to produce a stem (which involves
+traversing a trie,
+retrieving a patch command and applying the patch command to the input
+string).
+On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03
+HotSpot),
+for tables ranging in size from 1,000 to 20,000 cells, the time to
+produce a
+single stem varies between 5-10 microseconds.
+
This means that the stemmer can process up to 200,000 words per second, an
+outstanding result when compared to other stemmers (Morfeusz - ~2,000
+w/s, FormAN (MS Word analyzer) - ~1,000 w/s).
+
The package contains a class org.getopt.stempel.Benchmark
,
+which you can use to produce reports
+like the one below:
+
--------- Stemmer benchmark report: -----------+
Stemmer table: /res/tables/stemmer_2000.out
Input file: ../test3.txt
Number of runs: 3
RUN NUMBER: 1 2 3
Total input words 1378176 1378176 1378176
Missed output words 112 112 112
Time elapsed [ms] 6989 6940 6640
Hit rate percent 99.99% 99.99% 99.99%
Miss rate percent 00.01% 00.01% 00.01%
Words per second 197192 198584 207557
Time per word [us] 5.07 5.04 4.82
The results of these tests are very encouraging. It seems that using +the +training corpus and the stemming algorithm described above results in a +high-quality stemmer useful for most applications. Moreover, it can +also +be used as a better than average lemmatizer.
+Both the author of the implementation +(Leo Galambos, <leo.galambos AT egothor DOT org>) and the author +of this +compilation (Andrzej Bialecki <ab AT getopt DOT org>) would +appreciate any +feedback and suggestions for further improvements.
+