lucene-java-3.4.0/lucene/contrib/icu/src/java/overview.html

   1 <!--
   2  Licensed to the Apache Software Foundation (ASF) under one or more
   3  contributor license agreements.  See the NOTICE file distributed with
   4  this work for additional information regarding copyright ownership.
   5  The ASF licenses this file to You under the Apache License, Version 2.0
   6  (the "License"); you may not use this file except in compliance with
   7  the License.  You may obtain a copy of the License at
   8
   9      http://www.apache.org/licenses/LICENSE-2.0
  10
  11  Unless required by applicable law or agreed to in writing, software
  12  distributed under the License is distributed on an "AS IS" BASIS,
  13  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  14  See the License for the specific language governing permissions and
  15  limitations under the License.
  16 -->
  17 <html>
  18   <head>
  19     <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
  20     <title>
  21       Apache Lucene ICU integration module
  22     </title>
  23   </head>
  24 <body>
  25 <p>
  26 This module exposes functionality from
  27 <a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java
  28 library that enhances Java's internationalization support by improving
  29 performance, keeping current with the Unicode Standard, and providing richer
  30 APIs. This module exposes the following functionality:
  31 </p>
  32 <ul>
  33   <li><a href="#segmentation">Text Segmentation</a>: Tokenizes text based on
  34   properties and rules defined in Unicode.</li>
  35   <li><a href="#collation">Collation</a>: Compare strings according to the
  36   conventions and standards of a particular language, region or country.</li>
  37   <li><a href="#normalization">Normalization</a>: Converts text to a unique,
  38   equivalent form.</li>
  39   <li><a href="#casefolding">Case Folding</a>: Removes case distinctions with
  40   Unicode's Default Caseless Matching algorithm.</li>
  41   <li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions
  42   (such as accent marks) between similar characters for a loose or fuzzy search.</li>
  43   <li><a href="#transform">Text Transformation</a>: Transforms Unicode text in
  44   a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese</li>
  45 </ul>
  46 <hr/>
  47 <h1><a name="segmentation">Text Segmentation</a></h1>
  48 <p>
  49 Text Segmentation (Tokenization) divides document and query text into index terms
  50 (typically words). Unicode provides special properties and rules so that this can
  51 be done in a manner that works well with most languages.
  52 </p>
  53 <p>
  54 Text Segmentation implements the word segmentation specified in
  55 <a href="http://unicode.org/reports/tr29/">Unicode Text Segmentation</a>.
  56 Additionally the algorithm can be tailored based on writing system, for example
  57 text in the Thai script is automatically delegated to a dictionary-based segmentation
  58 algorithm.
  59 </p>
  60 <h2>Use Cases</h2>
  61 <ul>
  62   <li>
  63     As a more thorough replacement for StandardTokenizer that works well for
  64     most languages.
  65   </li>
  66 </ul>
  67 <h2>Example Usages</h2>
  68 <h3>Tokenizing multilanguage text</h3>
  69 <pre class="prettyprint">
  70   /**
  71    * This tokenizer will work well in general for most languages.
  72    */
  73   Tokenizer tokenizer = new ICUTokenizer(reader);
  74 </pre>
  75 <hr/>
  76 <h1><a name="collation">Collation</a></h1>
  77 <p>
  78   <code>ICUCollationKeyFilter</code>
  79   converts each token into its binary <code>CollationKey</code> using the
  80   provided <code>Collator</code>, and then encode the <code>CollationKey</code>
  81   as a String using
  82   {@link org.apache.lucene.util.IndexableBinaryStringTools}, to allow it to be
  83   stored as an index term.
  84 </p>
  85 <p>
  86   <code>ICUCollationKeyFilter</code> depends on ICU4J 4.4 to produce the
  87   <code>CollationKey</code>s.  <code>icu4j-4.4.jar</code>
  88   is included in Lucene's Subversion repository at <code>contrib/icu/lib/</code>.
  89 </p>
  90
  91 <h2>Use Cases</h2>
  92
  93 <ul>
  94   <li>
  95     Efficient sorting of terms in languages that use non-Unicode character
  96     orderings.  (Lucene Sort using a Locale can be very slow.)
  97   </li>
  98   <li>
  99     Efficient range queries over fields that contain terms in languages that
 100     use non-Unicode character orderings.  (Range queries using a Locale can be
 101     very slow.)
 102   </li>
 103   <li>
 104     Effective Locale-specific normalization (case differences, diacritics, etc.).
 105     ({@link org.apache.lucene.analysis.LowerCaseFilter} and
 106     {@link org.apache.lucene.analysis.ASCIIFoldingFilter} provide these services
 107     in a generic way that doesn't take into account locale-specific needs.)
 108   </li>
 109 </ul>
 110
 111 <h2>Example Usages</h2>
 112
 113 <h3>Farsi Range Queries</h3>
 114 <pre class="prettyprint">
 115   Collator collator = Collator.getInstance(new Locale("ar"));
 116   ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(collator);
 117   RAMDirectory ramDir = new RAMDirectory();
 118   IndexWriter writer = new IndexWriter
 119     (ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
 120   Document doc = new Document();
 121   doc.add(new Field("content", "\u0633\u0627\u0628",
 122                     Field.Store.YES, Field.Index.ANALYZED));
 123   writer.addDocument(doc);
 124   writer.close();
 125   IndexSearcher is = new IndexSearcher(ramDir, true);
 126
 127   // The AnalyzingQueryParser in Lucene's contrib allows terms in range queries
 128   // to be passed through an analyzer - Lucene's standard QueryParser does not
 129   // allow this.
 130   AnalyzingQueryParser aqp = new AnalyzingQueryParser("content", analyzer);
 131   aqp.setLowercaseExpandedTerms(false);
 132
 133   // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
 134   // orders the U+0698 character before the U+0633 character, so the single
 135   // indexed Term above should NOT be returned by a ConstantScoreRangeQuery
 136   // with a Farsi Collator (or an Arabic one for the case when Farsi is not
 137   // supported).
 138   ScoreDoc[] result
 139     = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
 140   assertEquals("The index Term should not be included.", 0, result.length);
 141 </pre>
 142
 143 <h3>Danish Sorting</h3>
 144 <pre class="prettyprint">
 145   Analyzer analyzer
 146     = new ICUCollationKeyAnalyzer(Collator.getInstance(new Locale("da", "dk")));
 147   RAMDirectory indexStore = new RAMDirectory();
 148   IndexWriter writer = new IndexWriter
 149     (indexStore, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
 150   String[] tracer = new String[] { "A", "B", "C", "D", "E" };
 151   String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
 152   String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
 153   for (int i = 0 ; i < data.length ; ++i) {
 154     Document doc = new Document();
 155     doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO));
 156     doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED));
 157     writer.addDocument(doc);
 158   }
 159   writer.close();
 160   Searcher searcher = new IndexSearcher(indexStore, true);
 161   Sort sort = new Sort();
 162   sort.setSort(new SortField("contents", SortField.STRING));
 163   Query query = new MatchAllDocsQuery();
 164   ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs;
 165   for (int i = 0 ; i < result.length ; ++i) {
 166     Document doc = searcher.doc(result[i].doc);
 167     assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
 168   }
 169 </pre>
 170
 171 <h3>Turkish Case Normalization</h3>
 172 <pre class="prettyprint">
 173   Collator collator = Collator.getInstance(new Locale("tr", "TR"));
 174   collator.setStrength(Collator.PRIMARY);
 175   Analyzer analyzer = new ICUCollationKeyAnalyzer(collator);
 176   RAMDirectory ramDir = new RAMDirectory();
 177   IndexWriter writer = new IndexWriter
 178     (ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
 179   Document doc = new Document();
 180   doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED));
 181   writer.addDocument(doc);
 182   writer.close();
 183   IndexSearcher is = new IndexSearcher(ramDir, true);
 184   QueryParser parser = new QueryParser("contents", analyzer);
 185   Query query = parser.parse("d\u0131gy");   // U+0131: dotless i
 186   ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
 187   assertEquals("The index Term should be included.", 1, result.length);
 188 </pre>
 189
 190 <h2>Caveats and Comparisons</h2>
 191 <p>
 192   <strong>WARNING:</strong> Make sure you use exactly the same
 193   <code>Collator</code> at index and query time -- <code>CollationKey</code>s
 194   are only comparable when produced by
 195   the same <code>Collator</code>.  Since {@link java.text.RuleBasedCollator}s
 196   are not independently versioned, it is unsafe to search against stored
 197   <code>CollationKey</code>s unless the following are exactly the same (best
 198   practice is to store this information with the index and check that they
 199   remain the same at query time):
 200 </p>
 201 <ol>
 202   <li>JVM vendor</li>
 203   <li>JVM version, including patch version</li>
 204   <li>
 205     The language (and country and variant, if specified) of the Locale
 206     used when constructing the collator via
 207     {@link java.text.Collator#getInstance(java.util.Locale)}.
 208   </li>
 209   <li>
 210     The collation strength used - see {@link java.text.Collator#setStrength(int)}
 211   </li>
 212 </ol>
 213 <p>
 214   <code>ICUCollationKeyFilter</code> uses ICU4J's <code>Collator</code>, which
 215   makes its version available, thus allowing collation to be versioned
 216   independently from the JVM.  <code>ICUCollationKeyFilter</code> is also
 217   significantly faster and generates significantly shorter keys than
 218   <code>CollationKeyFilter</code>.  See
 219   <a href="http://site.icu-project.org/charts/collation-icu4j-sun"
 220     >http://site.icu-project.org/charts/collation-icu4j-sun</a> for key
 221   generation timing and key length comparisons between ICU4J and
 222   <code>java.text.Collator</code> over several languages.
 223 </p>
 224 <p>
 225   <code>CollationKey</code>s generated by <code>java.text.Collator</code>s are
 226   not compatible with those those generated by ICU Collators.  Specifically, if
 227   you use <code>CollationKeyFilter</code> to generate index terms, do not use
 228   <code>ICUCollationKeyFilter</code> on the query side, or vice versa.
 229 </p>
 230 <hr/>
 231 <h1><a name="normalization">Normalization</a></h1>
 232 <p>
 233   <code>ICUNormalizer2Filter</code> normalizes term text to a
 234   <a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so
 235   that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a>
 236   forms are standardized to a unique form.
 237 </p>
 238 <h2>Use Cases</h2>
 239 <ul>
 240   <li> Removing differences in width for Asian-language text.
 241   </li>
 242   <li> Standardizing complex text with non-spacing marks so that characters are
 243   ordered consistently.
 244   </li>
 245 </ul>
 246 <h2>Example Usages</h2>
 247 <h3>Normalizing text to NFC</h3>
 248 <pre class="prettyprint">
 249   /**
 250    * Normalizer2 objects are unmodifiable and immutable.
 251    */
 252   Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
 253   /**
 254    * This filter will normalize to NFC.
 255    */
 256   TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
 257 </pre>
 258 <hr/>
 259 <h1><a name="casefolding">Case Folding</a></h1>
 260 <p>
 261 Default caseless matching, or case-folding is more than just conversion to
 262 lowercase. For example, it handles cases such as the Greek sigma, so that
 263 "Μάϊος" and "ΜΆΪΟΣ" will match correctly.
 264 </p>
 265 <p>
 266 Case-folding is still only an approximation of the language-specific rules
 267 governing case. If the specific language is known, consider using
 268 ICUCollationKeyFilter and indexing collation keys instead. This implementation
 269 performs the "full" case-folding specified in the Unicode standard, and this
 270 may change the length of the term. For example, the German ß is case-folded
 271 to the string 'ss'.
 272 </p>
 273 <p>
 274 Case folding is related to normalization, and as such is coupled with it in
 275 this integration. To perform case-folding, you use normalization with the form
 276 "nfkc_cf" (which is the default).
 277 </p>
 278 <h2>Use Cases</h2>
 279 <ul>
 280   <li>
 281     As a more thorough replacement for LowerCaseFilter that has good behavior
 282     for most languages.
 283   </li>
 284 </ul>
 285 <h2>Example Usages</h2>
 286 <h3>Lowercasing text</h3>
 287 <pre class="prettyprint">
 288   /**
 289    * This filter will case-fold and normalize to NFKC.
 290    */
 291   TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
 292 </pre>
 293 <hr/>
 294 <h1><a name="searchfolding">Search Term Folding</a></h1>
 295 <p>
 296 Search term folding removes distinctions (such as accent marks) between
 297 similar characters. It is useful for a fuzzy or loose search.
 298 </p>
 299 <p>
 300 Search term folding implements many of the foldings specified in
 301 <a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a>
 302 as a special normalization form.  This folding applies NFKC, Case Folding, and
 303 many character foldings recursively.
 304 </p>
 305 <h2>Use Cases</h2>
 306 <ul>
 307   <li>
 308     As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter
 309     that applies the same ideas to many more languages.
 310   </li>
 311 </ul>
 312 <h2>Example Usages</h2>
 313 <h3>Removing accents</h3>
 314 <pre class="prettyprint">
 315   /**
 316    * This filter will case-fold, remove accents and other distinctions, and
 317    * normalize to NFKC.
 318    */
 319   TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
 320 </pre>
 321 <hr/>
 322 <h1><a name="transform">Text Transformation</a></h1>
 323 <p>
 324 ICU provides text-transformation functionality via its Transliteration API. This allows
 325 you to transform text in a variety of ways, taking context into account.
 326 </p>
 327 <p>
 328 For more information, see the
 329 <a href="http://userguide.icu-project.org/transforms/general">User's Guide</a>
 330 and
 331 <a href="http://userguide.icu-project.org/transforms/general/rules">Rule Tutorial</a>.
 332 </p>
 333 <h2>Use Cases</h2>
 334 <ul>
 335   <li>
 336     Convert Traditional to Simplified
 337   </li>
 338   <li>
 339     Transliterate between different writing systems: e.g. Romanization
 340   </li>
 341 </ul>
 342 <h2>Example Usages</h2>
 343 <h3>Convert Traditional to Simplified</h3>
 344 <pre class="prettyprint">
 345   /**
 346    * This filter will map Traditional Chinese to Simplified Chinese
 347    */
 348   TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
 349 </pre>
 350 <h3>Transliterate Serbian Cyrillic to Serbian Latin</h3>
 351 <pre class="prettyprint">
 352   /**
 353    * This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
 354    */
 355   TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
 356 </pre>
 357 <hr/>
 358 <h1><a name="backcompat">Backwards Compatibility</a></h1>
 359 <p>
 360 This module exists to provide up-to-date Unicode functionality that supports
 361 the most recent version of Unicode (currently 6.0). However, some users who wish
 362 for stronger backwards compatibility can restrict
 363 {@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only
 364 a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}.
 365 </p>
 366 <h2>Example Usages</h2>
 367 <h3>Restricting normalization to Unicode 5.0</h3>
 368 <pre class="prettyprint">
 369   /**
 370    * This filter will do NFC normalization, but will ignore any characters that
 371    * did not exist as of Unicode 5.0. Because of the normalization stability policy
 372    * of Unicode, this is an easy way to force normalization to a specific version.
 373    */
 374     Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
 375     UnicodeSet set = new UnicodeSet("[:age=5.0:]");
 376     // see FilteredNormalizer2 docs, the set should be frozen or performance will suffer
 377     set.freeze();
 378     FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
 379     TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);
 380 </pre>
 381 </body>
 382 </html>