X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.5.0/lucene/contrib/icu/src/java/overview.html diff --git a/lucene-java-3.5.0/lucene/contrib/icu/src/java/overview.html b/lucene-java-3.5.0/lucene/contrib/icu/src/java/overview.html new file mode 100644 index 0000000..0e55ea7 --- /dev/null +++ b/lucene-java-3.5.0/lucene/contrib/icu/src/java/overview.html @@ -0,0 +1,382 @@ + + + + + + Apache Lucene ICU integration module + + + +

+This module exposes functionality from +ICU to Apache Lucene. ICU4J is a Java +library that enhances Java's internationalization support by improving +performance, keeping current with the Unicode Standard, and providing richer +APIs. This module exposes the following functionality: +

Text Segmentation: Tokenizes text based on + properties and rules defined in Unicode.
Collation: Compare strings according to the + conventions and standards of a particular language, region or country.
Normalization: Converts text to a unique, + equivalent form.
Case Folding: Removes case distinctions with + Unicode's Default Caseless Matching algorithm.
Search Term Folding: Removes distinctions + (such as accent marks) between similar characters for a loose or fuzzy search.
Text Transformation: Transforms Unicode text in + a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese

Text Segmentation

+Text Segmentation (Tokenization) divides document and query text into index terms +(typically words). Unicode provides special properties and rules so that this can +be done in a manner that works well with most languages. +

+Text Segmentation implements the word segmentation specified in +Unicode Text Segmentation. +Additionally the algorithm can be tailored based on writing system, for example +text in the Thai script is automatically delegated to a dictionary-based segmentation +algorithm. +

Use Cases

+ As a more thorough replacement for StandardTokenizer that works well for + most languages. +

Example Usages

Tokenizing multilanguage text

+  /**
+   * This tokenizer will work well in general for most languages.
+   */
+  Tokenizer tokenizer = new ICUTokenizer(reader);
+

Collation

+ ICUCollationKeyFilter + converts each token into its binary CollationKey using the + provided Collator, and then encode the CollationKey + as a String using + {@link org.apache.lucene.util.IndexableBinaryStringTools}, to allow it to be + stored as an index term. +

+ ICUCollationKeyFilter depends on ICU4J 4.4 to produce the + CollationKeys. icu4j-4.4.jar + is included in Lucene's Subversion repository at contrib/icu/lib/. +

+ +

Use Cases

+ +

+ Efficient sorting of terms in languages that use non-Unicode character + orderings. (Lucene Sort using a Locale can be very slow.) +
+ Efficient range queries over fields that contain terms in languages that + use non-Unicode character orderings. (Range queries using a Locale can be + very slow.) +
+ Effective Locale-specific normalization (case differences, diacritics, etc.). + ({@link org.apache.lucene.analysis.LowerCaseFilter} and + {@link org.apache.lucene.analysis.ASCIIFoldingFilter} provide these services + in a generic way that doesn't take into account locale-specific needs.) +

+ +

Example Usages

+ +

Farsi Range Queries

+  Collator collator = Collator.getInstance(new Locale("ar"));
+  ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(collator);
+  RAMDirectory ramDir = new RAMDirectory();
+  IndexWriter writer = new IndexWriter
+    (ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
+  Document doc = new Document();
+  doc.add(new Field("content", "\u0633\u0627\u0628", 
+                    Field.Store.YES, Field.Index.ANALYZED));
+  writer.addDocument(doc);
+  writer.close();
+  IndexSearcher is = new IndexSearcher(ramDir, true);
+
+  // The AnalyzingQueryParser in Lucene's contrib allows terms in range queries
+  // to be passed through an analyzer - Lucene's standard QueryParser does not
+  // allow this.
+  AnalyzingQueryParser aqp = new AnalyzingQueryParser("content", analyzer);
+  aqp.setLowercaseExpandedTerms(false);
+  
+  // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
+  // orders the U+0698 character before the U+0633 character, so the single
+  // indexed Term above should NOT be returned by a ConstantScoreRangeQuery
+  // with a Farsi Collator (or an Arabic one for the case when Farsi is not
+  // supported).
+  ScoreDoc[] result
+    = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
+  assertEquals("The index Term should not be included.", 0, result.length);
+

+ +

Danish Sorting

+  Analyzer analyzer 
+    = new ICUCollationKeyAnalyzer(Collator.getInstance(new Locale("da", "dk")));
+  RAMDirectory indexStore = new RAMDirectory();
+  IndexWriter writer = new IndexWriter 
+    (indexStore, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
+  String[] tracer = new String[] { "A", "B", "C", "D", "E" };
+  String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
+  String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
+  for (int i = 0 ; i < data.length ; ++i) {
+    Document doc = new Document();
+    doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO));
+    doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED));
+    writer.addDocument(doc);
+  }
+  writer.close();
+  Searcher searcher = new IndexSearcher(indexStore, true);
+  Sort sort = new Sort();
+  sort.setSort(new SortField("contents", SortField.STRING));
+  Query query = new MatchAllDocsQuery();
+  ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs;
+  for (int i = 0 ; i < result.length ; ++i) {
+    Document doc = searcher.doc(result[i].doc);
+    assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
+  }
+

+ +

Turkish Case Normalization

+  Collator collator = Collator.getInstance(new Locale("tr", "TR"));
+  collator.setStrength(Collator.PRIMARY);
+  Analyzer analyzer = new ICUCollationKeyAnalyzer(collator);
+  RAMDirectory ramDir = new RAMDirectory();
+  IndexWriter writer = new IndexWriter
+    (ramDir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);
+  Document doc = new Document();
+  doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED));
+  writer.addDocument(doc);
+  writer.close();
+  IndexSearcher is = new IndexSearcher(ramDir, true);
+  QueryParser parser = new QueryParser("contents", analyzer);
+  Query query = parser.parse("d\u0131gy");   // U+0131: dotless i
+  ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
+  assertEquals("The index Term should be included.", 1, result.length);
+

+ +

Caveats and Comparisons

+ WARNING: Make sure you use exactly the same + Collator at index and query time -- CollationKeys + are only comparable when produced by + the same Collator. Since {@link java.text.RuleBasedCollator}s + are not independently versioned, it is unsafe to search against stored + CollationKeys unless the following are exactly the same (best + practice is to store this information with the index and check that they + remain the same at query time): +

JVM vendor
JVM version, including patch version
+ The language (and country and variant, if specified) of the Locale + used when constructing the collator via + {@link java.text.Collator#getInstance(java.util.Locale)}. +
+ The collation strength used - see {@link java.text.Collator#setStrength(int)} +

+ ICUCollationKeyFilter uses ICU4J's Collator, which + makes its version available, thus allowing collation to be versioned + independently from the JVM. ICUCollationKeyFilter is also + significantly faster and generates significantly shorter keys than + CollationKeyFilter. See + http://site.icu-project.org/charts/collation-icu4j-sun for key + generation timing and key length comparisons between ICU4J and + java.text.Collator over several languages. +

+ CollationKeys generated by java.text.Collators are + not compatible with those those generated by ICU Collators. Specifically, if + you use CollationKeyFilter to generate index terms, do not use + ICUCollationKeyFilter on the query side, or vice versa. +

Normalization

+ ICUNormalizer2Filter normalizes term text to a + Unicode Normalization Form, so + that equivalent + forms are standardized to a unique form. +

Use Cases

Removing differences in width for Asian-language text. +
Standardizing complex text with non-spacing marks so that characters are + ordered consistently. +

Example Usages

Normalizing text to NFC

+  /**
+   * Normalizer2 objects are unmodifiable and immutable.
+   */
+  Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
+  /**
+   * This filter will normalize to NFC.
+   */
+  TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
+

Case Folding

+Default caseless matching, or case-folding is more than just conversion to +lowercase. For example, it handles cases such as the Greek sigma, so that +"ÎÎ¬ÏÎ¿Ï" and "ÎÎÎªÎÎ£" will match correctly. +

+Case-folding is still only an approximation of the language-specific rules +governing case. If the specific language is known, consider using +ICUCollationKeyFilter and indexing collation keys instead. This implementation +performs the "full" case-folding specified in the Unicode standard, and this +may change the length of the term. For example, the German Ã is case-folded +to the string 'ss'. +

+Case folding is related to normalization, and as such is coupled with it in +this integration. To perform case-folding, you use normalization with the form +"nfkc_cf" (which is the default). +

Use Cases

+ As a more thorough replacement for LowerCaseFilter that has good behavior + for most languages. +

Example Usages

Lowercasing text

+  /**
+   * This filter will case-fold and normalize to NFKC.
+   */
+  TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
+

Search Term Folding

+Search term folding removes distinctions (such as accent marks) between +similar characters. It is useful for a fuzzy or loose search. +

+Search term folding implements many of the foldings specified in +Character Foldings +as a special normalization form. This folding applies NFKC, Case Folding, and +many character foldings recursively. +

Use Cases

+ As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter + that applies the same ideas to many more languages. +

Example Usages

Removing accents

+  /**
+   * This filter will case-fold, remove accents and other distinctions, and
+   * normalize to NFKC.
+   */
+  TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
+

Text Transformation

+ICU provides text-transformation functionality via its Transliteration API. This allows +you to transform text in a variety of ways, taking context into account. +

+For more information, see the +User's Guide +and +Rule Tutorial. +

Use Cases

+ Convert Traditional to Simplified +
+ Transliterate between different writing systems: e.g. Romanization +

Example Usages

Convert Traditional to Simplified

+  /**
+   * This filter will map Traditional Chinese to Simplified Chinese
+   */
+  TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
+

Transliterate Serbian Cyrillic to Serbian Latin

+  /**
+   * This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
+   */
+  TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
+

Backwards Compatibility

+This module exists to provide up-to-date Unicode functionality that supports +the most recent version of Unicode (currently 6.0). However, some users who wish +for stronger backwards compatibility can restrict +{@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only +a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}. +

Example Usages

Restricting normalization to Unicode 5.0

+  /**
+   * This filter will do NFC normalization, but will ignore any characters that
+   * did not exist as of Unicode 5.0. Because of the normalization stability policy
+   * of Unicode, this is an easy way to force normalization to a specific version.
+   */
+    Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
+    UnicodeSet set = new UnicodeSet("[:age=5.0:]");
+    // see FilteredNormalizer2 docs, the set should be frozen or performance will suffer
+    set.freeze(); 
+    FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
+    TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);
+

+ +