2 Licensed to the Apache Software Foundation (ASF) under one or more
3 contributor license agreements. See the NOTICE file distributed with
4 this work for additional information regarding copyright ownership.
5 The ASF licenses this file to You under the Apache License, Version 2.0
6 (the "License"); you may not use this file except in compliance with
7 the License. You may obtain a copy of the License at
9 http://www.apache.org/licenses/LICENSE-2.0
11 Unless required by applicable law or agreed to in writing, software
12 distributed under the License is distributed on an "AS IS" BASIS,
13 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 See the License for the specific language governing permissions and
15 limitations under the License.
19 <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
21 Apache Lucene ICU integration module
26 This module exposes functionality from
27 <a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java
28 library that enhances Java's internationalization support by improving
29 performance, keeping current with the Unicode Standard, and providing richer
30 APIs. This module exposes the following functionality:
33 <li><a href="#segmentation">Text Segmentation</a>: Tokenizes text based on
34 properties and rules defined in Unicode.</li>
35 <li><a href="#collation">Collation</a>: Compare strings according to the
36 conventions and standards of a particular language, region or country.</li>
37 <li><a href="#normalization">Normalization</a>: Converts text to a unique,
39 <li><a href="#casefolding">Case Folding</a>: Removes case distinctions with
40 Unicode's Default Caseless Matching algorithm.</li>
41 <li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions
42 (such as accent marks) between similar characters for a loose or fuzzy search.</li>
43 <li><a href="#transform">Text Transformation</a>: Transforms Unicode text in
44 a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese</li>
47 <h1><a name="segmentation">Text Segmentation</a></h1>
49 Text Segmentation (Tokenization) divides document and query text into index terms
50 (typically words). Unicode provides special properties and rules so that this can
51 be done in a manner that works well with most languages.
54 Text Segmentation implements the word segmentation specified in
55 <a href="http://unicode.org/reports/tr29/">Unicode Text Segmentation</a>.
56 Additionally the algorithm can be tailored based on writing system, for example
57 text in the Thai script is automatically delegated to a dictionary-based segmentation
63 As a more thorough replacement for StandardTokenizer that works well for
67 <h2>Example Usages</h2>
68 <h3>Tokenizing multilanguage text</h3>
69 <pre class="prettyprint">
71 * This tokenizer will work well in general for most languages.
73 Tokenizer tokenizer = new ICUTokenizer(reader);
76 <h1><a name="collation">Collation</a></h1>
78 <code>ICUCollationKeyFilter</code>
79 converts each token into its binary <code>CollationKey</code> using the
80 provided <code>Collator</code>, and then encode the <code>CollationKey</code>
82 {@link org.apache.lucene.util.IndexableBinaryStringTools}, to allow it to be
83 stored as an index term.
86 <code>ICUCollationKeyFilter</code> depends on ICU4J 4.4 to produce the
87 <code>CollationKey</code>s. <code>icu4j-4.4.jar</code>
88 is included in Lucene's Subversion repository at <code>contrib/icu/lib/</code>.
95 Efficient sorting of terms in languages that use non-Unicode character
96 orderings. (Lucene Sort using a Locale can be very slow.)
99 Efficient range queries over fields that contain terms in languages that
100 use non-Unicode character orderings. (Range queries using a Locale can be
104 Effective Locale-specific normalization (case differences, diacritics, etc.).
105 ({@link org.apache.lucene.analysis.LowerCaseFilter} and
106 {@link org.apache.lucene.analysis.ASCIIFoldingFilter} provide these services
107 in a generic way that doesn't take into account locale-specific needs.)
111 <h2>Example Usages</h2>
113 <h3>Farsi Range Queries</h3>
114 <pre class="prettyprint">
115 Collator collator = Collator.getInstance(new Locale("ar"));
116 ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(collator);
117 RAMDirectory ramDir = new RAMDirectory();
118 IndexWriter writer = new IndexWriter
119 (ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
120 Document doc = new Document();
121 doc.add(new Field("content", "\u0633\u0627\u0628",
122 Field.Store.YES, Field.Index.ANALYZED));
123 writer.addDocument(doc);
125 IndexSearcher is = new IndexSearcher(ramDir, true);
127 // The AnalyzingQueryParser in Lucene's contrib allows terms in range queries
128 // to be passed through an analyzer - Lucene's standard QueryParser does not
130 AnalyzingQueryParser aqp = new AnalyzingQueryParser("content", analyzer);
131 aqp.setLowercaseExpandedTerms(false);
133 // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
134 // orders the U+0698 character before the U+0633 character, so the single
135 // indexed Term above should NOT be returned by a ConstantScoreRangeQuery
136 // with a Farsi Collator (or an Arabic one for the case when Farsi is not
139 = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
140 assertEquals("The index Term should not be included.", 0, result.length);
143 <h3>Danish Sorting</h3>
144 <pre class="prettyprint">
146 = new ICUCollationKeyAnalyzer(Collator.getInstance(new Locale("da", "dk")));
147 RAMDirectory indexStore = new RAMDirectory();
148 IndexWriter writer = new IndexWriter
149 (indexStore, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
150 String[] tracer = new String[] { "A", "B", "C", "D", "E" };
151 String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
152 String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
153 for (int i = 0 ; i < data.length ; ++i) {
154 Document doc = new Document();
155 doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO));
156 doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED));
157 writer.addDocument(doc);
160 Searcher searcher = new IndexSearcher(indexStore, true);
161 Sort sort = new Sort();
162 sort.setSort(new SortField("contents", SortField.STRING));
163 Query query = new MatchAllDocsQuery();
164 ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs;
165 for (int i = 0 ; i < result.length ; ++i) {
166 Document doc = searcher.doc(result[i].doc);
167 assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
171 <h3>Turkish Case Normalization</h3>
172 <pre class="prettyprint">
173 Collator collator = Collator.getInstance(new Locale("tr", "TR"));
174 collator.setStrength(Collator.PRIMARY);
175 Analyzer analyzer = new ICUCollationKeyAnalyzer(collator);
176 RAMDirectory ramDir = new RAMDirectory();
177 IndexWriter writer = new IndexWriter
178 (ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
179 Document doc = new Document();
180 doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED));
181 writer.addDocument(doc);
183 IndexSearcher is = new IndexSearcher(ramDir, true);
184 QueryParser parser = new QueryParser("contents", analyzer);
185 Query query = parser.parse("d\u0131gy"); // U+0131: dotless i
186 ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
187 assertEquals("The index Term should be included.", 1, result.length);
190 <h2>Caveats and Comparisons</h2>
192 <strong>WARNING:</strong> Make sure you use exactly the same
193 <code>Collator</code> at index and query time -- <code>CollationKey</code>s
194 are only comparable when produced by
195 the same <code>Collator</code>. Since {@link java.text.RuleBasedCollator}s
196 are not independently versioned, it is unsafe to search against stored
197 <code>CollationKey</code>s unless the following are exactly the same (best
198 practice is to store this information with the index and check that they
199 remain the same at query time):
203 <li>JVM version, including patch version</li>
205 The language (and country and variant, if specified) of the Locale
206 used when constructing the collator via
207 {@link java.text.Collator#getInstance(java.util.Locale)}.
210 The collation strength used - see {@link java.text.Collator#setStrength(int)}
214 <code>ICUCollationKeyFilter</code> uses ICU4J's <code>Collator</code>, which
215 makes its version available, thus allowing collation to be versioned
216 independently from the JVM. <code>ICUCollationKeyFilter</code> is also
217 significantly faster and generates significantly shorter keys than
218 <code>CollationKeyFilter</code>. See
219 <a href="http://site.icu-project.org/charts/collation-icu4j-sun"
220 >http://site.icu-project.org/charts/collation-icu4j-sun</a> for key
221 generation timing and key length comparisons between ICU4J and
222 <code>java.text.Collator</code> over several languages.
225 <code>CollationKey</code>s generated by <code>java.text.Collator</code>s are
226 not compatible with those those generated by ICU Collators. Specifically, if
227 you use <code>CollationKeyFilter</code> to generate index terms, do not use
228 <code>ICUCollationKeyFilter</code> on the query side, or vice versa.
231 <h1><a name="normalization">Normalization</a></h1>
233 <code>ICUNormalizer2Filter</code> normalizes term text to a
234 <a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so
235 that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a>
236 forms are standardized to a unique form.
240 <li> Removing differences in width for Asian-language text.
242 <li> Standardizing complex text with non-spacing marks so that characters are
243 ordered consistently.
246 <h2>Example Usages</h2>
247 <h3>Normalizing text to NFC</h3>
248 <pre class="prettyprint">
250 * Normalizer2 objects are unmodifiable and immutable.
252 Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
254 * This filter will normalize to NFC.
256 TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
259 <h1><a name="casefolding">Case Folding</a></h1>
261 Default caseless matching, or case-folding is more than just conversion to
262 lowercase. For example, it handles cases such as the Greek sigma, so that
263 "Μάϊος" and "ΜΆΪΟΣ" will match correctly.
266 Case-folding is still only an approximation of the language-specific rules
267 governing case. If the specific language is known, consider using
268 ICUCollationKeyFilter and indexing collation keys instead. This implementation
269 performs the "full" case-folding specified in the Unicode standard, and this
270 may change the length of the term. For example, the German ß is case-folded
274 Case folding is related to normalization, and as such is coupled with it in
275 this integration. To perform case-folding, you use normalization with the form
276 "nfkc_cf" (which is the default).
281 As a more thorough replacement for LowerCaseFilter that has good behavior
285 <h2>Example Usages</h2>
286 <h3>Lowercasing text</h3>
287 <pre class="prettyprint">
289 * This filter will case-fold and normalize to NFKC.
291 TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
294 <h1><a name="searchfolding">Search Term Folding</a></h1>
296 Search term folding removes distinctions (such as accent marks) between
297 similar characters. It is useful for a fuzzy or loose search.
300 Search term folding implements many of the foldings specified in
301 <a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a>
302 as a special normalization form. This folding applies NFKC, Case Folding, and
303 many character foldings recursively.
308 As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter
309 that applies the same ideas to many more languages.
312 <h2>Example Usages</h2>
313 <h3>Removing accents</h3>
314 <pre class="prettyprint">
316 * This filter will case-fold, remove accents and other distinctions, and
319 TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
322 <h1><a name="transform">Text Transformation</a></h1>
324 ICU provides text-transformation functionality via its Transliteration API. This allows
325 you to transform text in a variety of ways, taking context into account.
328 For more information, see the
329 <a href="http://userguide.icu-project.org/transforms/general">User's Guide</a>
331 <a href="http://userguide.icu-project.org/transforms/general/rules">Rule Tutorial</a>.
336 Convert Traditional to Simplified
339 Transliterate between different writing systems: e.g. Romanization
342 <h2>Example Usages</h2>
343 <h3>Convert Traditional to Simplified</h3>
344 <pre class="prettyprint">
346 * This filter will map Traditional Chinese to Simplified Chinese
348 TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
350 <h3>Transliterate Serbian Cyrillic to Serbian Latin</h3>
351 <pre class="prettyprint">
353 * This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
355 TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
358 <h1><a name="backcompat">Backwards Compatibility</a></h1>
360 This module exists to provide up-to-date Unicode functionality that supports
361 the most recent version of Unicode (currently 6.0). However, some users who wish
362 for stronger backwards compatibility can restrict
363 {@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only
364 a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}.
366 <h2>Example Usages</h2>
367 <h3>Restricting normalization to Unicode 5.0</h3>
368 <pre class="prettyprint">
370 * This filter will do NFC normalization, but will ignore any characters that
371 * did not exist as of Unicode 5.0. Because of the normalization stability policy
372 * of Unicode, this is an easy way to force normalization to a specific version.
374 Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
375 UnicodeSet set = new UnicodeSet("[:age=5.0:]");
376 // see FilteredNormalizer2 docs, the set should be frozen or performance will suffer
378 FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
379 TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);