1 <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
3 Licensed to the Apache Software Foundation (ASF) under one or more
4 contributor license agreements. See the NOTICE file distributed with
5 this work for additional information regarding copyright ownership.
6 The ASF licenses this file to You under the Apache License, Version 2.0
7 (the "License"); you may not use this file except in compliance with
8 the License. You may obtain a copy of the License at
10 http://www.apache.org/licenses/LICENSE-2.0
12 Unless required by applicable law or agreed to in writing, software
13 distributed under the License is distributed on an "AS IS" BASIS,
14 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 See the License for the specific language governing permissions and
16 limitations under the License.
20 <title>Lucene Collation Package</title>
24 <code>CollationKeyFilter</code>
25 converts each token into its binary <code>CollationKey</code> using the
26 provided <code>Collator</code>, and then encode the <code>CollationKey</code>
28 {@link org.apache.lucene.util.IndexableBinaryStringTools}, to allow it to be
29 stored as an index term.
36 Efficient sorting of terms in languages that use non-Unicode character
37 orderings. (Lucene Sort using a Locale can be very slow.)
40 Efficient range queries over fields that contain terms in languages that
41 use non-Unicode character orderings. (Range queries using a Locale can be
45 Effective Locale-specific normalization (case differences, diacritics, etc.).
46 ({@link org.apache.lucene.analysis.LowerCaseFilter} and
47 {@link org.apache.lucene.analysis.ASCIIFoldingFilter} provide these services
48 in a generic way that doesn't take into account locale-specific needs.)
52 <h2>Example Usages</h2>
54 <h3>Farsi Range Queries</h3>
55 <pre class="prettyprint">
56 // "fa" Locale is not supported by Sun JDK 1.4 or 1.5
57 Collator collator = Collator.getInstance(new Locale("ar"));
58 CollationKeyAnalyzer analyzer = new CollationKeyAnalyzer(collator);
59 RAMDirectory ramDir = new RAMDirectory();
60 IndexWriter writer = new IndexWriter
61 (ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
62 Document doc = new Document();
63 doc.add(new Field("content", "\u0633\u0627\u0628",
64 Field.Store.YES, Field.Index.ANALYZED));
65 writer.addDocument(doc);
67 IndexSearcher is = new IndexSearcher(ramDir, true);
69 // The AnalyzingQueryParser in Lucene's contrib allows terms in range queries
70 // to be passed through an analyzer - Lucene's standard QueryParser does not
72 AnalyzingQueryParser aqp = new AnalyzingQueryParser("content", analyzer);
73 aqp.setLowercaseExpandedTerms(false);
75 // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
76 // orders the U+0698 character before the U+0633 character, so the single
77 // indexed Term above should NOT be returned by a ConstantScoreRangeQuery
78 // with a Farsi Collator (or an Arabic one for the case when Farsi is not
81 = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
82 assertEquals("The index Term should not be included.", 0, result.length);
85 <h3>Danish Sorting</h3>
86 <pre class="prettyprint">
88 = new CollationKeyAnalyzer(Collator.getInstance(new Locale("da", "dk")));
89 RAMDirectory indexStore = new RAMDirectory();
90 IndexWriter writer = new IndexWriter
91 (indexStore, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
92 String[] tracer = new String[] { "A", "B", "C", "D", "E" };
93 String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
94 String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
95 for (int i = 0 ; i < data.length ; ++i) {
96 Document doc = new Document();
97 doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO));
98 doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED));
99 writer.addDocument(doc);
102 Searcher searcher = new IndexSearcher(indexStore, true);
103 Sort sort = new Sort();
104 sort.setSort(new SortField("contents", SortField.STRING));
105 Query query = new MatchAllDocsQuery();
106 ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs;
107 for (int i = 0 ; i < result.length ; ++i) {
108 Document doc = searcher.doc(result[i].doc);
109 assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
113 <h3>Turkish Case Normalization</h3>
114 <pre class="prettyprint">
115 Collator collator = Collator.getInstance(new Locale("tr", "TR"));
116 collator.setStrength(Collator.PRIMARY);
117 Analyzer analyzer = new CollationKeyAnalyzer(collator);
118 RAMDirectory ramDir = new RAMDirectory();
119 IndexWriter writer = new IndexWriter
120 (ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
121 Document doc = new Document();
122 doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED));
123 writer.addDocument(doc);
125 IndexSearcher is = new IndexSearcher(ramDir, true);
126 QueryParser parser = new QueryParser("contents", analyzer);
127 Query query = parser.parse("d\u0131gy"); // U+0131: dotless i
128 ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
129 assertEquals("The index Term should be included.", 1, result.length);
132 <h2>Caveats and Comparisons</h2>
134 <strong>WARNING:</strong> Make sure you use exactly the same
135 <code>Collator</code> at index and query time -- <code>CollationKey</code>s
136 are only comparable when produced by
137 the same <code>Collator</code>. Since {@link java.text.RuleBasedCollator}s
138 are not independently versioned, it is unsafe to search against stored
139 <code>CollationKey</code>s unless the following are exactly the same (best
140 practice is to store this information with the index and check that they
141 remain the same at query time):
145 <li>JVM version, including patch version</li>
147 The language (and country and variant, if specified) of the Locale
148 used when constructing the collator via
149 {@link java.text.Collator#getInstance(java.util.Locale)}.
152 The collation strength used - see {@link java.text.Collator#setStrength(int)}
156 <code>ICUCollationKeyFilter</code>, available in the icu package in Lucene's contrib area,
157 uses ICU4J's <code>Collator</code>, which
158 makes its version available, thus allowing collation to be versioned
159 independently from the JVM. <code>ICUCollationKeyFilter</code> is also
160 significantly faster and generates significantly shorter keys than
161 <code>CollationKeyFilter</code>. See
162 <a href="http://site.icu-project.org/charts/collation-icu4j-sun"
163 >http://site.icu-project.org/charts/collation-icu4j-sun</a> for key
164 generation timing and key length comparisons between ICU4J and
165 <code>java.text.Collator</code> over several languages.
168 <code>CollationKey</code>s generated by <code>java.text.Collator</code>s are
169 not compatible with those those generated by ICU Collators. Specifically, if
170 you use <code>CollationKeyFilter</code> to generate index terms, do not use
171 <code>ICUCollationKeyFilter</code> on the query side, or vice versa.