X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.5.0/lucene/src/java/org/apache/lucene/analysis/package.html?ds=sidebyside diff --git a/lucene-java-3.5.0/lucene/src/java/org/apache/lucene/analysis/package.html b/lucene-java-3.5.0/lucene/src/java/org/apache/lucene/analysis/package.html new file mode 100644 index 0000000..c849a48 --- /dev/null +++ b/lucene-java-3.5.0/lucene/src/java/org/apache/lucene/analysis/package.html @@ -0,0 +1,635 @@ + + + + + + + +

API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.

+

Parsing? Tokenization? Analysis!

+

+Lucene, indexing and search library, accepts only plain text input. +

+

Parsing

+

+Applications that build their search capabilities upon Lucene may support documents in various formats – HTML, XML, PDF, Word – just to name a few. +Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the +application using Lucene to use an appropriate Parser to convert the original format into plain text before passing that plain text to Lucene. +

+

Tokenization

+

+Plain text passed to Lucene for indexing goes through a process generally called tokenization. Tokenization is the process +of breaking input text into small indexing elements – tokens. +The way input text is broken into tokens heavily influences how people will then be able to search for that text. +For instance, sentences beginnings and endings can be identified to provide for more accurate phrase +and proximity searches (though sentence identification is not provided by Lucene). +

+In some cases simply breaking the input text into tokens is not enough – a deeper Analysis may be needed. +There are many post tokenization steps that can be done, including (but not limited to): +

+

+

Core Analysis

+

+ The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There + are three main classes in the package from which all analysis processes are derived. These are: +

+ Lucene 2.9 introduces a new TokenStream API. Please see the section "New TokenStream API" below for more details. +

+

Hints, Tips and Traps

+

+ The synergy between {@link org.apache.lucene.analysis.Analyzer} and {@link org.apache.lucene.analysis.Tokenizer} + is sometimes confusing. To ease on this confusion, some clarifications: +

+

+

+ Lucene Java provides a number of analysis capabilities, the most commonly used one being the {@link + org.apache.lucene.analysis.standard.StandardAnalyzer}. Many applications will have a long and industrious life with nothing more + than the StandardAnalyzer. However, there are a few other classes/packages that are worth mentioning: +

    +
  1. {@link org.apache.lucene.analysis.PerFieldAnalyzerWrapper} – Most Analyzers perform the same operation on all + {@link org.apache.lucene.document.Field}s. The PerFieldAnalyzerWrapper can be used to associate a different Analyzer with different + {@link org.apache.lucene.document.Field}s.
  2. +
  3. The contrib/analyzers library located at the root of the Lucene distribution has a number of different Analyzer implementations to solve a variety + of different problems related to searching. Many of the Analyzers are designed to analyze non-English languages.
  4. +
  5. The contrib/snowball library + located at the root of the Lucene distribution has Analyzer and TokenFilter + implementations for a variety of Snowball stemmers. + See http://snowball.tartarus.org + for more information on Snowball stemmers.
  6. +
  7. There are a variety of Tokenizer and TokenFilter implementations in this package. Take a look around, chances are someone has implemented what you need.
  8. +
+

+

+ Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases). + Perhaps your application would be just fine using the simple {@link org.apache.lucene.analysis.WhitespaceTokenizer} combined with a + {@link org.apache.lucene.analysis.StopFilter}. The contrib/benchmark library can be useful for testing out the speed of the analysis process. +

+

Invoking the Analyzer

+

+ Applications usually do not invoke analysis – Lucene does it for them: +

+ However an application might invoke Analysis of any text for testing or for any other purpose, something like: +
+      Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer
+      TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here"));
+      while (ts.incrementToken()) {
+        System.out.println("token: "+ts));
+      }
+  
+

+

Indexing Analysis vs. Search Analysis

+

+ Selecting the "correct" analyzer is crucial + for search quality, and can also affect indexing and search performance. + The "correct" analyzer differs between applications. + Lucene java's wiki page + AnalysisParalysis + provides some data on "analyzing your analyzer". + Here are some rules of thumb: +

    +
  1. Test test test... (did we say test?)
  2. +
  3. Beware of over analysis – might hurt indexing performance.
  4. +
  5. Start with same analyzer for indexing and search, otherwise searches would not find what they are supposed to...
  6. +
  7. In some cases a different analyzer is required for indexing and search, for instance: + + This might sometimes require a modified analyzer – see the next section on how to do that. +
  8. +
+

+

Implementing your own Analyzer

+

Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and set of TokenFilters to create a new Analyzer +or creating both the Analyzer and a Tokenizer or TokenFilter. Before pursuing this approach, you may find it worthwhile +to explore the contrib/analyzers library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists. +If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at +the source code of any one of the many samples located in this package. +

+

+ The following sections discuss some aspects of implementing your own analyzer. +

+

Field Section Boundaries

+

+ When {@link org.apache.lucene.document.Document#add(org.apache.lucene.document.Fieldable) document.add(field)} + is called multiple times for the same field name, we could say that each such call creates a new + section for that field in that document. + In fact, a separate call to + {@link org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String, java.io.Reader) tokenStream(field,reader)} + would take place for each of these so called "sections". + However, the default Analyzer behavior is to treat all these sections as one large section. + This allows phrase search and proximity search to seamlessly cross + boundaries between these "sections". + In other words, if a certain field "f" is added like this: +

+      document.add(new Field("f","first ends",...);
+      document.add(new Field("f","starts two",...);
+      indexWriter.addDocument(document);
+  
+ Then, a phrase search for "ends starts" would find that document. + Where desired, this behavior can be modified by introducing a "position gap" between consecutive field "sections", + simply by overriding + {@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap(java.lang.String) Analyzer.getPositionIncrementGap(fieldName)}: +
+      Analyzer myAnalyzer = new StandardAnalyzer() {
+         public int getPositionIncrementGap(String fieldName) {
+           return 10;
+         }
+      };
+  
+

+

Token Position Increments

+

+ By default, all tokens created by Analyzers and Tokenizers have a + {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#getPositionIncrement() position increment} of one. + This means that the position stored for that token in the index would be one more than + that of the previous token. + Recall that phrase and proximity searches rely on position info. +

+

+ If the selected analyzer filters the stop words "is" and "the", then for a document + containing the string "blue is the sky", only the tokens "blue", "sky" are indexed, + with position("sky") = 1 + position("blue"). Now, a phrase query "blue is the sky" + would find that document, because the same analyzer filters the same stop words from + that query. But also the phrase query "blue sky" would find that document. +

+

+ If this behavior does not fit the application needs, + a modified analyzer can be used, that would increment further the positions of + tokens following a removed stop word, using + {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#setPositionIncrement(int)}. + This can be done with something like: +

+      public TokenStream tokenStream(final String fieldName, Reader reader) {
+        final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader);
+        TokenStream res = new TokenStream() {
+          TermAttribute termAtt = addAttribute(TermAttribute.class);
+          PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
+        
+          public boolean incrementToken() throws IOException {
+            int extraIncrement = 0;
+            while (true) {
+              boolean hasNext = ts.incrementToken();
+              if (hasNext) {
+                if (stopWords.contains(termAtt.term())) {
+                  extraIncrement++; // filter this word
+                  continue;
+                } 
+                if (extraIncrement>0) {
+                  posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+extraIncrement);
+                }
+              }
+              return hasNext;
+            }
+          }
+        };
+        return res;
+      }
+   
+ Now, with this modified analyzer, the phrase query "blue sky" would find that document. + But note that this is yet not a perfect solution, because any phrase query "blue w1 w2 sky" + where both w1 and w2 are stop words would match that document. +

+

+ Few more use cases for modifying position increments are: +

    +
  1. Inhibiting phrase and proximity matches in sentence boundaries – for this, a tokenizer that + identifies a new sentence can add 1 to the position increment of the first token of the new sentence.
  2. +
  3. Injecting synonyms – here, synonyms of a token should be added after that token, + and their position increment should be set to 0. + As result, all synonyms of a token would be considered to appear in exactly the + same position as that token, and so would they be seen by phrase and proximity searches.
  4. +
+

+

New TokenStream API

+

+ With Lucene 2.9 we introduce a new TokenStream API. The old API used to produce Tokens. A Token + has getter and setter methods for different properties like positionIncrement and termText. + While this approach was sufficient for the default indexing format, it is not versatile enough for + Flexible Indexing, a term which summarizes the effort of making the Lucene indexer pluggable and extensible for custom + index formats. +

+

+A fully customizable indexer means that users will be able to store custom data structures on disk. Therefore an API +is necessary that can transport custom types of data from the documents to the indexer. +

+

Attribute and AttributeSource

+Lucene 2.9 therefore introduces a new pair of classes called {@link org.apache.lucene.util.Attribute} and +{@link org.apache.lucene.util.AttributeSource}. An Attribute serves as a +particular piece of information about a text token. For example, {@link org.apache.lucene.analysis.tokenattributes.TermAttribute} + contains the term text of a token, and {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute} contains the start and end character offsets of a token. +An AttributeSource is a collection of Attributes with a restriction: there may be only one instance of each attribute type. TokenStream now extends AttributeSource, which +means that one can add Attributes to a TokenStream. Since TokenFilter extends TokenStream, all filters are also +AttributeSources. +

+ Lucene now provides six Attributes out of the box, which replace the variables the Token class has: +

+

+

Using the new TokenStream API

+There are a few important things to know in order to use the new API efficiently which are summarized here. You may want +to walk through the example below first and come back to this section afterwards. +
  1. +Please keep in mind that an AttributeSource can only have one instance of a particular Attribute. Furthermore, if +a chain of a TokenStream and multiple TokenFilters is used, then all TokenFilters in that chain share the Attributes +with the TokenStream. +
  2. +
    +
  3. +Attribute instances are reused for all tokens of a document. Thus, a TokenStream/-Filter needs to update +the appropriate Attribute(s) in incrementToken(). The consumer, commonly the Lucene indexer, consumes the data in the +Attributes and then calls incrementToken() again until it returns false, which indicates that the end of the stream +was reached. This means that in each call of incrementToken() a TokenStream/-Filter can safely overwrite the data in +the Attribute instances. +
  4. +
    +
  5. +For performance reasons a TokenStream/-Filter should add/get Attributes during instantiation; i.e., create an attribute in the +constructor and store references to it in an instance variable. Using an instance variable instead of calling addAttribute()/getAttribute() +in incrementToken() will avoid attribute lookups for every token in the document. +
  6. +
    +
  7. +All methods in AttributeSource are idempotent, which means calling them multiple times always yields the same +result. This is especially important to know for addAttribute(). The method takes the type (Class) +of an Attribute as an argument and returns an instance. If an Attribute of the same type was previously added, then +the already existing instance is returned, otherwise a new instance is created and returned. Therefore TokenStreams/-Filters +can safely call addAttribute() with the same Attribute type multiple times. Even consumers of TokenStreams should +normally call addAttribute() instead of getAttribute(), because it would not fail if the TokenStream does not have this +Attribute (getAttribute() would throw an IllegalArgumentException, if the Attribute is missing). More advanced code +could simply check with hasAttribute(), if a TokenStream has it, and may conditionally leave out processing for +extra performance. +
+

Example

+In this example we will create a WhiteSpaceTokenizer and use a LengthFilter to suppress all words that only +have two or less characters. The LengthFilter is part of the Lucene core and its implementation will be explained +here to illustrate the usage of the new TokenStream API.
+Then we will develop a custom Attribute, a PartOfSpeechAttribute, and add another filter to the chain which +utilizes the new custom attribute, and call it PartOfSpeechTaggingFilter. +

Whitespace tokenization

+
+public class MyAnalyzer extends Analyzer {
+
+  public TokenStream tokenStream(String fieldName, Reader reader) {
+    TokenStream stream = new WhitespaceTokenizer(reader);
+    return stream;
+  }
+  
+  public static void main(String[] args) throws IOException {
+    // text to tokenize
+    final String text = "This is a demo of the new TokenStream API";
+    
+    MyAnalyzer analyzer = new MyAnalyzer();
+    TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
+    
+    // get the TermAttribute from the TokenStream
+    TermAttribute termAtt = stream.addAttribute(TermAttribute.class);
+
+    stream.reset();
+    
+    // print all tokens until stream is exhausted
+    while (stream.incrementToken()) {
+      System.out.println(termAtt.term());
+    }
+    
+    stream.end()
+    stream.close();
+  }
+}
+
+In this easy example a simple white space tokenization is performed. In main() a loop consumes the stream and +prints the term text of the tokens by accessing the TermAttribute that the WhitespaceTokenizer provides. +Here is the output: +
+This
+is
+a
+demo
+of
+the
+new
+TokenStream
+API
+
+

Adding a LengthFilter

+We want to suppress all tokens that have 2 or less characters. We can do that easily by adding a LengthFilter +to the chain. Only the tokenStream() method in our analyzer needs to be changed: +
+  public TokenStream tokenStream(String fieldName, Reader reader) {
+    TokenStream stream = new WhitespaceTokenizer(reader);
+    stream = new LengthFilter(stream, 3, Integer.MAX_VALUE);
+    return stream;
+  }
+
+Note how now only words with 3 or more characters are contained in the output: +
+This
+demo
+the
+new
+TokenStream
+API
+
+Now let's take a look how the LengthFilter is implemented (it is part of Lucene's core): +
+public final class LengthFilter extends TokenFilter {
+
+  final int min;
+  final int max;
+  
+  private TermAttribute termAtt;
+
+  /**
+   * Build a filter that removes words that are too long or too
+   * short from the text.
+   */
+  public LengthFilter(TokenStream in, int min, int max)
+  {
+    super(in);
+    this.min = min;
+    this.max = max;
+    termAtt = addAttribute(TermAttribute.class);
+  }
+  
+  /**
+   * Returns the next input Token whose term() is the right len
+   */
+  public final boolean incrementToken() throws IOException
+  {
+    assert termAtt != null;
+    // return the first non-stop word found
+    while (input.incrementToken()) {
+      int len = termAtt.termLength();
+      if (len >= min && len <= max) {
+          return true;
+      }
+      // note: else we ignore it but should we index each part of it?
+    }
+    // reached EOS -- return null
+    return false;
+  }
+}
+
+The TermAttribute is added in the constructor and stored in the instance variable termAtt. +Remember that there can only be a single instance of TermAttribute in the chain, so in our example the +addAttribute() call in LengthFilter returns the TermAttribute that the WhitespaceTokenizer already added. The tokens +are retrieved from the input stream in the incrementToken() method. By looking at the term text +in the TermAttribute the length of the term can be determined and too short or too long tokens are skipped. +Note how incrementToken() can efficiently access the instance variable; no attribute lookup +is neccessary. The same is true for the consumer, which can simply use local references to the Attributes. + +

Adding a custom Attribute

+Now we're going to implement our own custom Attribute for part-of-speech tagging and call it consequently +PartOfSpeechAttribute. First we need to define the interface of the new Attribute: +
+  public interface PartOfSpeechAttribute extends Attribute {
+    public static enum PartOfSpeech {
+      Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Conjunction, Article, Unknown
+    }
+  
+    public void setPartOfSpeech(PartOfSpeech pos);
+  
+    public PartOfSpeech getPartOfSpeech();
+  }
+
+ +Now we also need to write the implementing class. The name of that class is important here: By default, Lucene +checks if there is a class with the name of the Attribute with the postfix 'Impl'. In this example, we would +consequently call the implementing class PartOfSpeechAttributeImpl.
+This should be the usual behavior. However, there is also an expert-API that allows changing these naming conventions: +{@link org.apache.lucene.util.AttributeSource.AttributeFactory}. The factory accepts an Attribute interface as argument +and returns an actual instance. You can implement your own factory if you need to change the default behavior.

+ +Now here is the actual class that implements our new Attribute. Notice that the class has to extend +{@link org.apache.lucene.util.AttributeImpl}: + +
+public final class PartOfSpeechAttributeImpl extends AttributeImpl 
+                            implements PartOfSpeechAttribute{
+  
+  private PartOfSpeech pos = PartOfSpeech.Unknown;
+  
+  public void setPartOfSpeech(PartOfSpeech pos) {
+    this.pos = pos;
+  }
+  
+  public PartOfSpeech getPartOfSpeech() {
+    return pos;
+  }
+
+  public void clear() {
+    pos = PartOfSpeech.Unknown;
+  }
+
+  public void copyTo(AttributeImpl target) {
+    ((PartOfSpeechAttributeImpl) target).pos = pos;
+  }
+
+  public boolean equals(Object other) {
+    if (other == this) {
+      return true;
+    }
+    
+    if (other instanceof PartOfSpeechAttributeImpl) {
+      return pos == ((PartOfSpeechAttributeImpl) other).pos;
+    }
+ 
+    return false;
+  }
+
+  public int hashCode() {
+    return pos.ordinal();
+  }
+}
+
+This is a simple Attribute implementation has only a single variable that stores the part-of-speech of a token. It extends the +new AttributeImpl class and therefore implements its abstract methods clear(), copyTo(), equals(), hashCode(). +Now we need a TokenFilter that can set this new PartOfSpeechAttribute for each token. In this example we show a very naive filter +that tags every word with a leading upper-case letter as a 'Noun' and all other words as 'Unknown'. +
+  public static class PartOfSpeechTaggingFilter extends TokenFilter {
+    PartOfSpeechAttribute posAtt;
+    TermAttribute termAtt;
+    
+    protected PartOfSpeechTaggingFilter(TokenStream input) {
+      super(input);
+      posAtt = addAttribute(PartOfSpeechAttribute.class);
+      termAtt = addAttribute(TermAttribute.class);
+    }
+    
+    public boolean incrementToken() throws IOException {
+      if (!input.incrementToken()) {return false;}
+      posAtt.setPartOfSpeech(determinePOS(termAtt.termBuffer(), 0, termAtt.termLength()));
+      return true;
+    }
+    
+    // determine the part of speech for the given term
+    protected PartOfSpeech determinePOS(char[] term, int offset, int length) {
+      // naive implementation that tags every uppercased word as noun
+      if (length > 0 && Character.isUpperCase(term[0])) {
+        return PartOfSpeech.Noun;
+      }
+      return PartOfSpeech.Unknown;
+    }
+  }
+
+Just like the LengthFilter, this new filter accesses the attributes it needs in the constructor and +stores references in instance variables. Notice how you only need to pass in the interface of the new +Attribute and instantiating the correct class is automatically been taken care of. +Now we need to add the filter to the chain: +
+  public TokenStream tokenStream(String fieldName, Reader reader) {
+    TokenStream stream = new WhitespaceTokenizer(reader);
+    stream = new LengthFilter(stream, 3, Integer.MAX_VALUE);
+    stream = new PartOfSpeechTaggingFilter(stream);
+    return stream;
+  }
+
+Now let's look at the output: +
+This
+demo
+the
+new
+TokenStream
+API
+
+Apparently it hasn't changed, which shows that adding a custom attribute to a TokenStream/Filter chain does not +affect any existing consumers, simply because they don't know the new Attribute. Now let's change the consumer +to make use of the new PartOfSpeechAttribute and print it out: +
+  public static void main(String[] args) throws IOException {
+    // text to tokenize
+    final String text = "This is a demo of the new TokenStream API";
+    
+    MyAnalyzer analyzer = new MyAnalyzer();
+    TokenStream stream = analyzer.tokenStream("field", new StringReader(text));
+    
+    // get the TermAttribute from the TokenStream
+    TermAttribute termAtt = stream.addAttribute(TermAttribute.class);
+    
+    // get the PartOfSpeechAttribute from the TokenStream
+    PartOfSpeechAttribute posAtt = stream.addAttribute(PartOfSpeechAttribute.class);
+    
+    stream.reset();
+
+    // print all tokens until stream is exhausted
+    while (stream.incrementToken()) {
+      System.out.println(termAtt.term() + ": " + posAtt.getPartOfSpeech());
+    }
+    
+    stream.end();
+    stream.close();
+  }
+
+The change that was made is to get the PartOfSpeechAttribute from the TokenStream and print out its contents in +the while loop that consumes the stream. Here is the new output: +
+This: Noun
+demo: Unknown
+the: Unknown
+new: Unknown
+TokenStream: Noun
+API: Noun
+
+Each word is now followed by its assigned PartOfSpeech tag. Of course this is a naive +part-of-speech tagging. The word 'This' should not even be tagged as noun; it is only spelled capitalized because it +is the first word of a sentence. Actually this is a good opportunity for an excerise. To practice the usage of the new +API the reader could now write an Attribute and TokenFilter that can specify for each word if it was the first token +of a sentence or not. Then the PartOfSpeechTaggingFilter can make use of this knowledge and only tag capitalized words +as nouns if not the first word of a sentence (we know, this is still not a correct behavior, but hey, it's a good exercise). +As a small hint, this is how the new Attribute class could begin: +
+  public class FirstTokenOfSentenceAttributeImpl extends Attribute
+                   implements FirstTokenOfSentenceAttribute {
+    
+    private boolean firstToken;
+    
+    public void setFirstToken(boolean firstToken) {
+      this.firstToken = firstToken;
+    }
+    
+    public boolean getFirstToken() {
+      return firstToken;
+    }
+
+    public void clear() {
+      firstToken = false;
+    }
+
+  ...
+
+ +