lucene-java-3.4.0/lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis/compound/package.html

   1 <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
   2 <!--
   3  Licensed to the Apache Software Foundation (ASF) under one or more
   4  contributor license agreements.  See the NOTICE file distributed with
   5  this work for additional information regarding copyright ownership.
   6  The ASF licenses this file to You under the Apache License, Version 2.0
   7  (the "License"); you may not use this file except in compliance with
   8  the License.  You may obtain a copy of the License at
   9
  10      http://www.apache.org/licenses/LICENSE-2.0
  11
  12  Unless required by applicable law or agreed to in writing, software
  13  distributed under the License is distributed on an "AS IS" BASIS,
  14  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  15  See the License for the specific language governing permissions and
  16  limitations under the License.
  17 -->
  18 <html>
  19 <head>
  20 <title>CompoundWordTokenFilter</title>
  21 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></meta>
  22 </head>
  23 <body>
  24 A filter that decomposes compound words you find in many Germanic
  25 languages into the word parts. This example shows what it does:
  26 <table border="1">
  27         <tr>
  28                 <th>Input token stream</th>
  29         </tr>
  30         <tr>
  31                 <td>Rindfleisch&uuml;berwachungsgesetz Drahtschere abba</td>
  32         </tr>
  33 </table>
  34 <br>
  35 <table border="1">
  36         <tr>
  37                 <th>Output token stream</th>
  38         </tr>
  39         <tr>
  40                 <td>(Rindfleisch&uuml;berwachungsgesetz,0,29)</td>
  41         </tr>
  42         <tr>
  43                 <td>(Rind,0,4,posIncr=0)</td>
  44         </tr>
  45         <tr>
  46                 <td>(fleisch,4,11,posIncr=0)</td>
  47         </tr>
  48         <tr>
  49                 <td>(&uuml;berwachung,11,22,posIncr=0)</td>
  50         </tr>
  51         <tr>
  52                 <td>(gesetz,23,29,posIncr=0)</td>
  53         </tr>
  54         <tr>
  55                 <td>(Drahtschere,30,41)</td>
  56         </tr>
  57         <tr>
  58                 <td>(Draht,30,35,posIncr=0)</td>
  59         </tr>
  60         <tr>
  61                 <td>(schere,35,41,posIncr=0)</td>
  62         </tr>
  63         <tr>
  64                 <td>(abba,42,46)</td>
  65         </tr>
  66 </table>
  67
  68 The input token is always preserved and the filters do not alter the case of word parts. There are two variants of the
  69 filter available:
  70 <ul>
  71         <li><i>HyphenationCompoundWordTokenFilter</i>: it uses a
  72         hyphenation grammar based approach to find potential word parts of a
  73         given word.</li>
  74         <li><i>DictionaryCompoundWordTokenFilter</i>: it uses a
  75         brute-force dictionary-only based approach to find the word parts of a given
  76         word.</li>
  77 </ul>
  78
  79 <h3>Compound word token filters</h3>
  80 <h4>HyphenationCompoundWordTokenFilter</h4>
  81 The {@link
  82 org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
  83 HyphenationCompoundWordTokenFilter} uses hyphenation grammars to find
  84 potential subwords that a worth to check against the dictionary. It can be used
  85 without a dictionary as well but then produces a lot of "nonword" tokens.
  86 The quality of the output tokens is directly connected to the quality of the
  87 grammar file you use. For languages like German they are quite good.
  88 <h5>Grammar file</h5>
  89 Unfortunately we cannot bundle the hyphenation grammar files with Lucene
  90 because they do not use an ASF compatible license (they use the LaTeX
  91 Project Public License instead). You can find the XML based grammar
  92 files at the
  93 <a href="http://offo.sourceforge.net/hyphenation/index.html">Objects
  94 For Formatting Objects</a>
  95 (OFFO) Sourceforge project (direct link to download the pattern files:
  96 <a href="http://downloads.sourceforge.net/offo/offo-hyphenation.zip">http://downloads.sourceforge.net/offo/offo-hyphenation.zip</a>
  97 ). The files you need are in the subfolder
  98 <i>offo-hyphenation/hyph/</i>
  99 .
 100 <br />
 101 Credits for the hyphenation code go to the
 102 <a href="http://xmlgraphics.apache.org/fop/">Apache FOP project</a>
 103 .
 104
 105 <h4>DictionaryCompoundWordTokenFilter</h4>
 106 The {@link
 107 org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
 108 DictionaryCompoundWordTokenFilter} uses a dictionary-only approach to
 109 find subwords in a compound word. It is much slower than the one that
 110 uses the hyphenation grammars. You can use it as a first start to
 111 see if your dictionary is good or not because it is much simpler in design.
 112
 113 <h3>Dictionary</h3>
 114 The output quality of both token filters is directly connected to the
 115 quality of the dictionary you use. They are language dependent of course.
 116 You always should use a dictionary
 117 that fits to the text you want to index. If you index medical text for
 118 example then you should use a dictionary that contains medical words.
 119 A good start for general text are the dictionaries you find at the
 120 <a href="http://wiki.services.openoffice.org/wiki/Dictionaries">OpenOffice
 121 dictionaries</a>
 122 Wiki.
 123
 124 <h3>Which variant should I use?</h3>
 125 This decision matrix should help you:
 126 <table border="1">
 127         <tr>
 128                 <th>Token filter</th>
 129                 <th>Output quality</th>
 130                 <th>Performance</th>
 131         </tr>
 132         <tr>
 133                 <td>HyphenationCompoundWordTokenFilter</td>
 134                 <td>good if grammar file is good &ndash; acceptable otherwise</td>
 135                 <td>fast</td>
 136         </tr>
 137         <tr>
 138                 <td>DictionaryCompoundWordTokenFilter</td>
 139                 <td>good</td>
 140                 <td>slow</td>
 141         </tr>
 142 </table>
 143 <h3>Examples</h3>
 144 <pre>
 145   public void testHyphenationCompoundWordsDE() throws Exception {
 146     String[] dict = { "Rind", "Fleisch", "Draht", "Schere", "Gesetz",
 147         "Aufgabe", "&Uuml;berwachung" };
 148
 149     Reader reader = new FileReader("de_DR.xml");
 150
 151     HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter
 152         .getHyphenationTree(reader);
 153
 154     HyphenationCompoundWordTokenFilter tf = new HyphenationCompoundWordTokenFilter(
 155         new WhitespaceTokenizer(new StringReader(
 156             "Rindfleisch&uuml;berwachungsgesetz Drahtschere abba")), hyphenator,
 157         dict, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
 158         CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE,
 159         CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false);
 160
 161     CharTermAttribute t = tf.addAttribute(CharTermAttribute.class);
 162     while (tf.incrementToken()) {
 163        System.out.println(t);
 164     }
 165   }
 166
 167   public void testHyphenationCompoundWordsWithoutDictionaryDE() throws Exception {
 168     Reader reader = new FileReader("de_DR.xml");
 169
 170     HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter
 171         .getHyphenationTree(reader);
 172
 173     HyphenationCompoundWordTokenFilter tf = new HyphenationCompoundWordTokenFilter(
 174         new WhitespaceTokenizer(new StringReader(
 175             "Rindfleisch&uuml;berwachungsgesetz Drahtschere abba")), hyphenator);
 176
 177     CharTermAttribute t = tf.addAttribute(CharTermAttribute.class);
 178     while (tf.incrementToken()) {
 179        System.out.println(t);
 180     }
 181   }
 182
 183   public void testDumbCompoundWordsSE() throws Exception {
 184     String[] dict = { "Bil", "D&ouml;rr", "Motor", "Tak", "Borr", "Slag", "Hammar",
 185         "Pelar", "Glas", "&Ouml;gon", "Fodral", "Bas", "Fiol", "Makare", "Ges&auml;ll",
 186         "Sko", "Vind", "Rute", "Torkare", "Blad" };
 187
 188     DictionaryCompoundWordTokenFilter tf = new DictionaryCompoundWordTokenFilter(
 189         new WhitespaceTokenizer(
 190             new StringReader(
 191                 "Bild&ouml;rr Bilmotor Biltak Slagborr Hammarborr Pelarborr Glas&ouml;gonfodral Basfiolsfodral Basfiolsfodralmakareges&auml;ll Skomakare Vindrutetorkare Vindrutetorkarblad abba")),
 192         dict);
 193     CharTermAttribute t = tf.addAttribute(CharTermAttribute.class);
 194     while (tf.incrementToken()) {
 195        System.out.println(t);
 196     }
 197   }
 198 </pre>
 199 </body>
 200 </html>