+++ /dev/null
-<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
--->
-<html>
- <head>
- <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
- <title>
- Apache Lucene ICU integration module
- </title>
- </head>
-<body>
-<p>
-This module exposes functionality from
-<a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java
-library that enhances Java's internationalization support by improving
-performance, keeping current with the Unicode Standard, and providing richer
-APIs. This module exposes the following functionality:
-</p>
-<ul>
- <li><a href="#segmentation">Text Segmentation</a>: Tokenizes text based on
- properties and rules defined in Unicode.</li>
- <li><a href="#collation">Collation</a>: Compare strings according to the
- conventions and standards of a particular language, region or country.</li>
- <li><a href="#normalization">Normalization</a>: Converts text to a unique,
- equivalent form.</li>
- <li><a href="#casefolding">Case Folding</a>: Removes case distinctions with
- Unicode's Default Caseless Matching algorithm.</li>
- <li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions
- (such as accent marks) between similar characters for a loose or fuzzy search.</li>
- <li><a href="#transform">Text Transformation</a>: Transforms Unicode text in
- a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese</li>
-</ul>
-<hr/>
-<h1><a name="segmentation">Text Segmentation</a></h1>
-<p>
-Text Segmentation (Tokenization) divides document and query text into index terms
-(typically words). Unicode provides special properties and rules so that this can
-be done in a manner that works well with most languages.
-</p>
-<p>
-Text Segmentation implements the word segmentation specified in
-<a href="http://unicode.org/reports/tr29/">Unicode Text Segmentation</a>.
-Additionally the algorithm can be tailored based on writing system, for example
-text in the Thai script is automatically delegated to a dictionary-based segmentation
-algorithm.
-</p>
-<h2>Use Cases</h2>
-<ul>
- <li>
- As a more thorough replacement for StandardTokenizer that works well for
- most languages.
- </li>
-</ul>
-<h2>Example Usages</h2>
-<h3>Tokenizing multilanguage text</h3>
-<pre class="prettyprint">
- /**
- * This tokenizer will work well in general for most languages.
- */
- Tokenizer tokenizer = new ICUTokenizer(reader);
-</pre>
-<hr/>
-<h1><a name="collation">Collation</a></h1>
-<p>
- <code>ICUCollationKeyFilter</code>
- converts each token into its binary <code>CollationKey</code> using the
- provided <code>Collator</code>, and then encode the <code>CollationKey</code>
- as a String using
- {@link org.apache.lucene.util.IndexableBinaryStringTools}, to allow it to be
- stored as an index term.
-</p>
-<p>
- <code>ICUCollationKeyFilter</code> depends on ICU4J 4.4 to produce the
- <code>CollationKey</code>s. <code>icu4j-4.4.jar</code>
- is included in Lucene's Subversion repository at <code>contrib/icu/lib/</code>.
-</p>
-
-<h2>Use Cases</h2>
-
-<ul>
- <li>
- Efficient sorting of terms in languages that use non-Unicode character
- orderings. (Lucene Sort using a Locale can be very slow.)
- </li>
- <li>
- Efficient range queries over fields that contain terms in languages that
- use non-Unicode character orderings. (Range queries using a Locale can be
- very slow.)
- </li>
- <li>
- Effective Locale-specific normalization (case differences, diacritics, etc.).
- ({@link org.apache.lucene.analysis.LowerCaseFilter} and
- {@link org.apache.lucene.analysis.ASCIIFoldingFilter} provide these services
- in a generic way that doesn't take into account locale-specific needs.)
- </li>
-</ul>
-
-<h2>Example Usages</h2>
-
-<h3>Farsi Range Queries</h3>
-<pre class="prettyprint">
- Collator collator = Collator.getInstance(new Locale("ar"));
- ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(collator);
- RAMDirectory ramDir = new RAMDirectory();
- IndexWriter writer = new IndexWriter
- (ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
- Document doc = new Document();
- doc.add(new Field("content", "\u0633\u0627\u0628",
- Field.Store.YES, Field.Index.ANALYZED));
- writer.addDocument(doc);
- writer.close();
- IndexSearcher is = new IndexSearcher(ramDir, true);
-
- // The AnalyzingQueryParser in Lucene's contrib allows terms in range queries
- // to be passed through an analyzer - Lucene's standard QueryParser does not
- // allow this.
- AnalyzingQueryParser aqp = new AnalyzingQueryParser("content", analyzer);
- aqp.setLowercaseExpandedTerms(false);
-
- // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
- // orders the U+0698 character before the U+0633 character, so the single
- // indexed Term above should NOT be returned by a ConstantScoreRangeQuery
- // with a Farsi Collator (or an Arabic one for the case when Farsi is not
- // supported).
- ScoreDoc[] result
- = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
- assertEquals("The index Term should not be included.", 0, result.length);
-</pre>
-
-<h3>Danish Sorting</h3>
-<pre class="prettyprint">
- Analyzer analyzer
- = new ICUCollationKeyAnalyzer(Collator.getInstance(new Locale("da", "dk")));
- RAMDirectory indexStore = new RAMDirectory();
- IndexWriter writer = new IndexWriter
- (indexStore, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
- String[] tracer = new String[] { "A", "B", "C", "D", "E" };
- String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
- String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
- for (int i = 0 ; i < data.length ; ++i) {
- Document doc = new Document();
- doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO));
- doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED));
- writer.addDocument(doc);
- }
- writer.close();
- Searcher searcher = new IndexSearcher(indexStore, true);
- Sort sort = new Sort();
- sort.setSort(new SortField("contents", SortField.STRING));
- Query query = new MatchAllDocsQuery();
- ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs;
- for (int i = 0 ; i < result.length ; ++i) {
- Document doc = searcher.doc(result[i].doc);
- assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
- }
-</pre>
-
-<h3>Turkish Case Normalization</h3>
-<pre class="prettyprint">
- Collator collator = Collator.getInstance(new Locale("tr", "TR"));
- collator.setStrength(Collator.PRIMARY);
- Analyzer analyzer = new ICUCollationKeyAnalyzer(collator);
- RAMDirectory ramDir = new RAMDirectory();
- IndexWriter writer = new IndexWriter
- (ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
- Document doc = new Document();
- doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED));
- writer.addDocument(doc);
- writer.close();
- IndexSearcher is = new IndexSearcher(ramDir, true);
- QueryParser parser = new QueryParser("contents", analyzer);
- Query query = parser.parse("d\u0131gy"); // U+0131: dotless i
- ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
- assertEquals("The index Term should be included.", 1, result.length);
-</pre>
-
-<h2>Caveats and Comparisons</h2>
-<p>
- <strong>WARNING:</strong> Make sure you use exactly the same
- <code>Collator</code> at index and query time -- <code>CollationKey</code>s
- are only comparable when produced by
- the same <code>Collator</code>. Since {@link java.text.RuleBasedCollator}s
- are not independently versioned, it is unsafe to search against stored
- <code>CollationKey</code>s unless the following are exactly the same (best
- practice is to store this information with the index and check that they
- remain the same at query time):
-</p>
-<ol>
- <li>JVM vendor</li>
- <li>JVM version, including patch version</li>
- <li>
- The language (and country and variant, if specified) of the Locale
- used when constructing the collator via
- {@link java.text.Collator#getInstance(java.util.Locale)}.
- </li>
- <li>
- The collation strength used - see {@link java.text.Collator#setStrength(int)}
- </li>
-</ol>
-<p>
- <code>ICUCollationKeyFilter</code> uses ICU4J's <code>Collator</code>, which
- makes its version available, thus allowing collation to be versioned
- independently from the JVM. <code>ICUCollationKeyFilter</code> is also
- significantly faster and generates significantly shorter keys than
- <code>CollationKeyFilter</code>. See
- <a href="http://site.icu-project.org/charts/collation-icu4j-sun"
- >http://site.icu-project.org/charts/collation-icu4j-sun</a> for key
- generation timing and key length comparisons between ICU4J and
- <code>java.text.Collator</code> over several languages.
-</p>
-<p>
- <code>CollationKey</code>s generated by <code>java.text.Collator</code>s are
- not compatible with those those generated by ICU Collators. Specifically, if
- you use <code>CollationKeyFilter</code> to generate index terms, do not use
- <code>ICUCollationKeyFilter</code> on the query side, or vice versa.
-</p>
-<hr/>
-<h1><a name="normalization">Normalization</a></h1>
-<p>
- <code>ICUNormalizer2Filter</code> normalizes term text to a
- <a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so
- that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a>
- forms are standardized to a unique form.
-</p>
-<h2>Use Cases</h2>
-<ul>
- <li> Removing differences in width for Asian-language text.
- </li>
- <li> Standardizing complex text with non-spacing marks so that characters are
- ordered consistently.
- </li>
-</ul>
-<h2>Example Usages</h2>
-<h3>Normalizing text to NFC</h3>
-<pre class="prettyprint">
- /**
- * Normalizer2 objects are unmodifiable and immutable.
- */
- Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
- /**
- * This filter will normalize to NFC.
- */
- TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
-</pre>
-<hr/>
-<h1><a name="casefolding">Case Folding</a></h1>
-<p>
-Default caseless matching, or case-folding is more than just conversion to
-lowercase. For example, it handles cases such as the Greek sigma, so that
-"Μάϊος" and "ΜΆΪΟΣ" will match correctly.
-</p>
-<p>
-Case-folding is still only an approximation of the language-specific rules
-governing case. If the specific language is known, consider using
-ICUCollationKeyFilter and indexing collation keys instead. This implementation
-performs the "full" case-folding specified in the Unicode standard, and this
-may change the length of the term. For example, the German ß is case-folded
-to the string 'ss'.
-</p>
-<p>
-Case folding is related to normalization, and as such is coupled with it in
-this integration. To perform case-folding, you use normalization with the form
-"nfkc_cf" (which is the default).
-</p>
-<h2>Use Cases</h2>
-<ul>
- <li>
- As a more thorough replacement for LowerCaseFilter that has good behavior
- for most languages.
- </li>
-</ul>
-<h2>Example Usages</h2>
-<h3>Lowercasing text</h3>
-<pre class="prettyprint">
- /**
- * This filter will case-fold and normalize to NFKC.
- */
- TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
-</pre>
-<hr/>
-<h1><a name="searchfolding">Search Term Folding</a></h1>
-<p>
-Search term folding removes distinctions (such as accent marks) between
-similar characters. It is useful for a fuzzy or loose search.
-</p>
-<p>
-Search term folding implements many of the foldings specified in
-<a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a>
-as a special normalization form. This folding applies NFKC, Case Folding, and
-many character foldings recursively.
-</p>
-<h2>Use Cases</h2>
-<ul>
- <li>
- As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter
- that applies the same ideas to many more languages.
- </li>
-</ul>
-<h2>Example Usages</h2>
-<h3>Removing accents</h3>
-<pre class="prettyprint">
- /**
- * This filter will case-fold, remove accents and other distinctions, and
- * normalize to NFKC.
- */
- TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
-</pre>
-<hr/>
-<h1><a name="transform">Text Transformation</a></h1>
-<p>
-ICU provides text-transformation functionality via its Transliteration API. This allows
-you to transform text in a variety of ways, taking context into account.
-</p>
-<p>
-For more information, see the
-<a href="http://userguide.icu-project.org/transforms/general">User's Guide</a>
-and
-<a href="http://userguide.icu-project.org/transforms/general/rules">Rule Tutorial</a>.
-</p>
-<h2>Use Cases</h2>
-<ul>
- <li>
- Convert Traditional to Simplified
- </li>
- <li>
- Transliterate between different writing systems: e.g. Romanization
- </li>
-</ul>
-<h2>Example Usages</h2>
-<h3>Convert Traditional to Simplified</h3>
-<pre class="prettyprint">
- /**
- * This filter will map Traditional Chinese to Simplified Chinese
- */
- TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
-</pre>
-<h3>Transliterate Serbian Cyrillic to Serbian Latin</h3>
-<pre class="prettyprint">
- /**
- * This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
- */
- TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
-</pre>
-<hr/>
-<h1><a name="backcompat">Backwards Compatibility</a></h1>
-<p>
-This module exists to provide up-to-date Unicode functionality that supports
-the most recent version of Unicode (currently 6.0). However, some users who wish
-for stronger backwards compatibility can restrict
-{@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only
-a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}.
-</p>
-<h2>Example Usages</h2>
-<h3>Restricting normalization to Unicode 5.0</h3>
-<pre class="prettyprint">
- /**
- * This filter will do NFC normalization, but will ignore any characters that
- * did not exist as of Unicode 5.0. Because of the normalization stability policy
- * of Unicode, this is an easy way to force normalization to a specific version.
- */
- Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
- UnicodeSet set = new UnicodeSet("[:age=5.0:]");
- // see FilteredNormalizer2 docs, the set should be frozen or performance will suffer
- set.freeze();
- FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
- TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);
-</pre>
-</body>
-</html>