X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.5.0/lucene/src/java/org/apache/lucene/analysis/package.html?ds=inline diff --git a/lucene-java-3.5.0/lucene/src/java/org/apache/lucene/analysis/package.html b/lucene-java-3.5.0/lucene/src/java/org/apache/lucene/analysis/package.html new file mode 100644 index 0000000..c849a48 --- /dev/null +++ b/lucene-java-3.5.0/lucene/src/java/org/apache/lucene/analysis/package.html @@ -0,0 +1,635 @@ + + + +
+ + + +API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.
++Lucene, indexing and search library, accepts only plain text input. +
+
+Applications that build their search capabilities upon Lucene may support documents in various formats – HTML, XML, PDF, Word – just to name a few. +Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the +application using Lucene to use an appropriate Parser to convert the original format into plain text before passing that plain text to Lucene. +
+
+Plain text passed to Lucene for indexing goes through a process generally called tokenization. Tokenization is the process +of breaking input text into small indexing elements – tokens. +The way input text is broken into tokens heavily influences how people will then be able to search for that text. +For instance, sentences beginnings and endings can be identified to provide for more accurate phrase +and proximity searches (though sentence identification is not provided by Lucene). +
+In some cases simply breaking the input text into tokens is not enough – a deeper Analysis may be needed. +There are many post tokenization steps that can be done, including (but not limited to): +
+
+ The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There + are three main classes in the package from which all analysis processes are derived. These are: +
+ The synergy between {@link org.apache.lucene.analysis.Analyzer} and {@link org.apache.lucene.analysis.Tokenizer} + is sometimes confusing. To ease on this confusion, some clarifications: +
+ Lucene Java provides a number of analysis capabilities, the most commonly used one being the {@link + org.apache.lucene.analysis.standard.StandardAnalyzer}. Many applications will have a long and industrious life with nothing more + than the StandardAnalyzer. However, there are a few other classes/packages that are worth mentioning: +
+ Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases). + Perhaps your application would be just fine using the simple {@link org.apache.lucene.analysis.WhitespaceTokenizer} combined with a + {@link org.apache.lucene.analysis.StopFilter}. The contrib/benchmark library can be useful for testing out the speed of the analysis process. +
++ Applications usually do not invoke analysis – Lucene does it for them: +
+ Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer + TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here")); + while (ts.incrementToken()) { + System.out.println("token: "+ts)); + } ++ +
+ Selecting the "correct" analyzer is crucial + for search quality, and can also affect indexing and search performance. + The "correct" analyzer differs between applications. + Lucene java's wiki page + AnalysisParalysis + provides some data on "analyzing your analyzer". + Here are some rules of thumb: +
Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and set of TokenFilters to create a new Analyzer +or creating both the Analyzer and a Tokenizer or TokenFilter. Before pursuing this approach, you may find it worthwhile +to explore the contrib/analyzers library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists. +If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at +the source code of any one of the many samples located in this package. +
++ The following sections discuss some aspects of implementing your own analyzer. +
++ When {@link org.apache.lucene.document.Document#add(org.apache.lucene.document.Fieldable) document.add(field)} + is called multiple times for the same field name, we could say that each such call creates a new + section for that field in that document. + In fact, a separate call to + {@link org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String, java.io.Reader) tokenStream(field,reader)} + would take place for each of these so called "sections". + However, the default Analyzer behavior is to treat all these sections as one large section. + This allows phrase search and proximity search to seamlessly cross + boundaries between these "sections". + In other words, if a certain field "f" is added like this: +
+ document.add(new Field("f","first ends",...); + document.add(new Field("f","starts two",...); + indexWriter.addDocument(document); ++ Then, a phrase search for "ends starts" would find that document. + Where desired, this behavior can be modified by introducing a "position gap" between consecutive field "sections", + simply by overriding + {@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap(java.lang.String) Analyzer.getPositionIncrementGap(fieldName)}: +
+ Analyzer myAnalyzer = new StandardAnalyzer() { + public int getPositionIncrementGap(String fieldName) { + return 10; + } + }; ++ +
+ By default, all tokens created by Analyzers and Tokenizers have a + {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#getPositionIncrement() position increment} of one. + This means that the position stored for that token in the index would be one more than + that of the previous token. + Recall that phrase and proximity searches rely on position info. +
++ If the selected analyzer filters the stop words "is" and "the", then for a document + containing the string "blue is the sky", only the tokens "blue", "sky" are indexed, + with position("sky") = 1 + position("blue"). Now, a phrase query "blue is the sky" + would find that document, because the same analyzer filters the same stop words from + that query. But also the phrase query "blue sky" would find that document. +
++ If this behavior does not fit the application needs, + a modified analyzer can be used, that would increment further the positions of + tokens following a removed stop word, using + {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#setPositionIncrement(int)}. + This can be done with something like: +
+ public TokenStream tokenStream(final String fieldName, Reader reader) { + final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader); + TokenStream res = new TokenStream() { + TermAttribute termAtt = addAttribute(TermAttribute.class); + PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class); + + public boolean incrementToken() throws IOException { + int extraIncrement = 0; + while (true) { + boolean hasNext = ts.incrementToken(); + if (hasNext) { + if (stopWords.contains(termAtt.term())) { + extraIncrement++; // filter this word + continue; + } + if (extraIncrement>0) { + posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+extraIncrement); + } + } + return hasNext; + } + } + }; + return res; + } ++ Now, with this modified analyzer, the phrase query "blue sky" would find that document. + But note that this is yet not a perfect solution, because any phrase query "blue w1 w2 sky" + where both w1 and w2 are stop words would match that document. + +
+ Few more use cases for modifying position increments are: +
+ With Lucene 2.9 we introduce a new TokenStream API. The old API used to produce Tokens. A Token + has getter and setter methods for different properties like positionIncrement and termText. + While this approach was sufficient for the default indexing format, it is not versatile enough for + Flexible Indexing, a term which summarizes the effort of making the Lucene indexer pluggable and extensible for custom + index formats. +
++A fully customizable indexer means that users will be able to store custom data structures on disk. Therefore an API +is necessary that can transport custom types of data from the documents to the indexer. +
++ Lucene now provides six Attributes out of the box, which replace the variables the Token class has: +
The term text of a token.
The start and end offset of token in characters.
See above for detailed information about position increment.
The payload that a Token can optionally have.
The type of the token. Default is 'word'.
Optional flags a token can have.
Class
)
+of an Attribute as an argument and returns an instance. If an Attribute of the same type was previously added, then
+the already existing instance is returned, otherwise a new instance is created and returned. Therefore TokenStreams/-Filters
+can safely call addAttribute() with the same Attribute type multiple times. Even consumers of TokenStreams should
+normally call addAttribute() instead of getAttribute(), because it would not fail if the TokenStream does not have this
+Attribute (getAttribute() would throw an IllegalArgumentException, if the Attribute is missing). More advanced code
+could simply check with hasAttribute(), if a TokenStream has it, and may conditionally leave out processing for
+extra performance.
++public class MyAnalyzer extends Analyzer { + + public TokenStream tokenStream(String fieldName, Reader reader) { + TokenStream stream = new WhitespaceTokenizer(reader); + return stream; + } + + public static void main(String[] args) throws IOException { + // text to tokenize + final String text = "This is a demo of the new TokenStream API"; + + MyAnalyzer analyzer = new MyAnalyzer(); + TokenStream stream = analyzer.tokenStream("field", new StringReader(text)); + + // get the TermAttribute from the TokenStream + TermAttribute termAtt = stream.addAttribute(TermAttribute.class); + + stream.reset(); + + // print all tokens until stream is exhausted + while (stream.incrementToken()) { + System.out.println(termAtt.term()); + } + + stream.end() + stream.close(); + } +} ++In this easy example a simple white space tokenization is performed. In main() a loop consumes the stream and +prints the term text of the tokens by accessing the TermAttribute that the WhitespaceTokenizer provides. +Here is the output: +
+This +is +a +demo +of +the +new +TokenStream +API ++
+ public TokenStream tokenStream(String fieldName, Reader reader) { + TokenStream stream = new WhitespaceTokenizer(reader); + stream = new LengthFilter(stream, 3, Integer.MAX_VALUE); + return stream; + } ++Note how now only words with 3 or more characters are contained in the output: +
+This +demo +the +new +TokenStream +API ++Now let's take a look how the LengthFilter is implemented (it is part of Lucene's core): +
+public final class LengthFilter extends TokenFilter { + + final int min; + final int max; + + private TermAttribute termAtt; + + /** + * Build a filter that removes words that are too long or too + * short from the text. + */ + public LengthFilter(TokenStream in, int min, int max) + { + super(in); + this.min = min; + this.max = max; + termAtt = addAttribute(TermAttribute.class); + } + + /** + * Returns the next input Token whose term() is the right len + */ + public final boolean incrementToken() throws IOException + { + assert termAtt != null; + // return the first non-stop word found + while (input.incrementToken()) { + int len = termAtt.termLength(); + if (len >= min && len <= max) { + return true; + } + // note: else we ignore it but should we index each part of it? + } + // reached EOS -- return null + return false; + } +} ++The TermAttribute is added in the constructor and stored in the instance variable
termAtt
.
+Remember that there can only be a single instance of TermAttribute in the chain, so in our example the
+addAttribute()
call in LengthFilter returns the TermAttribute that the WhitespaceTokenizer already added. The tokens
+are retrieved from the input stream in the incrementToken()
method. By looking at the term text
+in the TermAttribute the length of the term can be determined and too short or too long tokens are skipped.
+Note how incrementToken()
can efficiently access the instance variable; no attribute lookup
+is neccessary. The same is true for the consumer, which can simply use local references to the Attributes.
+
+PartOfSpeechAttribute
. First we need to define the interface of the new Attribute:
++ public interface PartOfSpeechAttribute extends Attribute { + public static enum PartOfSpeech { + Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Conjunction, Article, Unknown + } + + public void setPartOfSpeech(PartOfSpeech pos); + + public PartOfSpeech getPartOfSpeech(); + } ++ +Now we also need to write the implementing class. The name of that class is important here: By default, Lucene +checks if there is a class with the name of the Attribute with the postfix 'Impl'. In this example, we would +consequently call the implementing class
PartOfSpeechAttributeImpl
. +public final class PartOfSpeechAttributeImpl extends AttributeImpl + implements PartOfSpeechAttribute{ + + private PartOfSpeech pos = PartOfSpeech.Unknown; + + public void setPartOfSpeech(PartOfSpeech pos) { + this.pos = pos; + } + + public PartOfSpeech getPartOfSpeech() { + return pos; + } + + public void clear() { + pos = PartOfSpeech.Unknown; + } + + public void copyTo(AttributeImpl target) { + ((PartOfSpeechAttributeImpl) target).pos = pos; + } + + public boolean equals(Object other) { + if (other == this) { + return true; + } + + if (other instanceof PartOfSpeechAttributeImpl) { + return pos == ((PartOfSpeechAttributeImpl) other).pos; + } + + return false; + } + + public int hashCode() { + return pos.ordinal(); + } +} ++This is a simple Attribute implementation has only a single variable that stores the part-of-speech of a token. It extends the +new
AttributeImpl
class and therefore implements its abstract methods clear(), copyTo(), equals(), hashCode()
.
+Now we need a TokenFilter that can set this new PartOfSpeechAttribute for each token. In this example we show a very naive filter
+that tags every word with a leading upper-case letter as a 'Noun' and all other words as 'Unknown'.
++ public static class PartOfSpeechTaggingFilter extends TokenFilter { + PartOfSpeechAttribute posAtt; + TermAttribute termAtt; + + protected PartOfSpeechTaggingFilter(TokenStream input) { + super(input); + posAtt = addAttribute(PartOfSpeechAttribute.class); + termAtt = addAttribute(TermAttribute.class); + } + + public boolean incrementToken() throws IOException { + if (!input.incrementToken()) {return false;} + posAtt.setPartOfSpeech(determinePOS(termAtt.termBuffer(), 0, termAtt.termLength())); + return true; + } + + // determine the part of speech for the given term + protected PartOfSpeech determinePOS(char[] term, int offset, int length) { + // naive implementation that tags every uppercased word as noun + if (length > 0 && Character.isUpperCase(term[0])) { + return PartOfSpeech.Noun; + } + return PartOfSpeech.Unknown; + } + } ++Just like the LengthFilter, this new filter accesses the attributes it needs in the constructor and +stores references in instance variables. Notice how you only need to pass in the interface of the new +Attribute and instantiating the correct class is automatically been taken care of. +Now we need to add the filter to the chain: +
+ public TokenStream tokenStream(String fieldName, Reader reader) { + TokenStream stream = new WhitespaceTokenizer(reader); + stream = new LengthFilter(stream, 3, Integer.MAX_VALUE); + stream = new PartOfSpeechTaggingFilter(stream); + return stream; + } ++Now let's look at the output: +
+This +demo +the +new +TokenStream +API ++Apparently it hasn't changed, which shows that adding a custom attribute to a TokenStream/Filter chain does not +affect any existing consumers, simply because they don't know the new Attribute. Now let's change the consumer +to make use of the new PartOfSpeechAttribute and print it out: +
+ public static void main(String[] args) throws IOException { + // text to tokenize + final String text = "This is a demo of the new TokenStream API"; + + MyAnalyzer analyzer = new MyAnalyzer(); + TokenStream stream = analyzer.tokenStream("field", new StringReader(text)); + + // get the TermAttribute from the TokenStream + TermAttribute termAtt = stream.addAttribute(TermAttribute.class); + + // get the PartOfSpeechAttribute from the TokenStream + PartOfSpeechAttribute posAtt = stream.addAttribute(PartOfSpeechAttribute.class); + + stream.reset(); + + // print all tokens until stream is exhausted + while (stream.incrementToken()) { + System.out.println(termAtt.term() + ": " + posAtt.getPartOfSpeech()); + } + + stream.end(); + stream.close(); + } ++The change that was made is to get the PartOfSpeechAttribute from the TokenStream and print out its contents in +the while loop that consumes the stream. Here is the new output: +
+This: Noun +demo: Unknown +the: Unknown +new: Unknown +TokenStream: Noun +API: Noun ++Each word is now followed by its assigned PartOfSpeech tag. Of course this is a naive +part-of-speech tagging. The word 'This' should not even be tagged as noun; it is only spelled capitalized because it +is the first word of a sentence. Actually this is a good opportunity for an excerise. To practice the usage of the new +API the reader could now write an Attribute and TokenFilter that can specify for each word if it was the first token +of a sentence or not. Then the PartOfSpeechTaggingFilter can make use of this knowledge and only tag capitalized words +as nouns if not the first word of a sentence (we know, this is still not a correct behavior, but hey, it's a good exercise). +As a small hint, this is how the new Attribute class could begin: +
+ public class FirstTokenOfSentenceAttributeImpl extends Attribute + implements FirstTokenOfSentenceAttribute { + + private boolean firstToken; + + public void setFirstToken(boolean firstToken) { + this.firstToken = firstToken; + } + + public boolean getFirstToken() { + return firstToken; + } + + public void clear() { + firstToken = false; + } + + ... ++ +