X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.4.0/lucene/src/java/org/apache/lucene/analysis/package.html?ds=inline diff --git a/lucene-java-3.4.0/lucene/src/java/org/apache/lucene/analysis/package.html b/lucene-java-3.4.0/lucene/src/java/org/apache/lucene/analysis/package.html deleted file mode 100644 index c849a48..0000000 --- a/lucene-java-3.4.0/lucene/src/java/org/apache/lucene/analysis/package.html +++ /dev/null @@ -1,635 +0,0 @@ - - - -
- - - -API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.
--Lucene, indexing and search library, accepts only plain text input. -
-
-Applications that build their search capabilities upon Lucene may support documents in various formats – HTML, XML, PDF, Word – just to name a few. -Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the -application using Lucene to use an appropriate Parser to convert the original format into plain text before passing that plain text to Lucene. -
-
-Plain text passed to Lucene for indexing goes through a process generally called tokenization. Tokenization is the process -of breaking input text into small indexing elements – tokens. -The way input text is broken into tokens heavily influences how people will then be able to search for that text. -For instance, sentences beginnings and endings can be identified to provide for more accurate phrase -and proximity searches (though sentence identification is not provided by Lucene). -
-In some cases simply breaking the input text into tokens is not enough – a deeper Analysis may be needed. -There are many post tokenization steps that can be done, including (but not limited to): -
-
- The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There - are three main classes in the package from which all analysis processes are derived. These are: -
- The synergy between {@link org.apache.lucene.analysis.Analyzer} and {@link org.apache.lucene.analysis.Tokenizer} - is sometimes confusing. To ease on this confusion, some clarifications: -
- Lucene Java provides a number of analysis capabilities, the most commonly used one being the {@link - org.apache.lucene.analysis.standard.StandardAnalyzer}. Many applications will have a long and industrious life with nothing more - than the StandardAnalyzer. However, there are a few other classes/packages that are worth mentioning: -
- Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases). - Perhaps your application would be just fine using the simple {@link org.apache.lucene.analysis.WhitespaceTokenizer} combined with a - {@link org.apache.lucene.analysis.StopFilter}. The contrib/benchmark library can be useful for testing out the speed of the analysis process. -
-- Applications usually do not invoke analysis – Lucene does it for them: -
- Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer - TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here")); - while (ts.incrementToken()) { - System.out.println("token: "+ts)); - } -- -
- Selecting the "correct" analyzer is crucial - for search quality, and can also affect indexing and search performance. - The "correct" analyzer differs between applications. - Lucene java's wiki page - AnalysisParalysis - provides some data on "analyzing your analyzer". - Here are some rules of thumb: -
Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and set of TokenFilters to create a new Analyzer -or creating both the Analyzer and a Tokenizer or TokenFilter. Before pursuing this approach, you may find it worthwhile -to explore the contrib/analyzers library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists. -If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at -the source code of any one of the many samples located in this package. -
-- The following sections discuss some aspects of implementing your own analyzer. -
-- When {@link org.apache.lucene.document.Document#add(org.apache.lucene.document.Fieldable) document.add(field)} - is called multiple times for the same field name, we could say that each such call creates a new - section for that field in that document. - In fact, a separate call to - {@link org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String, java.io.Reader) tokenStream(field,reader)} - would take place for each of these so called "sections". - However, the default Analyzer behavior is to treat all these sections as one large section. - This allows phrase search and proximity search to seamlessly cross - boundaries between these "sections". - In other words, if a certain field "f" is added like this: -
- document.add(new Field("f","first ends",...); - document.add(new Field("f","starts two",...); - indexWriter.addDocument(document); -- Then, a phrase search for "ends starts" would find that document. - Where desired, this behavior can be modified by introducing a "position gap" between consecutive field "sections", - simply by overriding - {@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap(java.lang.String) Analyzer.getPositionIncrementGap(fieldName)}: -
- Analyzer myAnalyzer = new StandardAnalyzer() { - public int getPositionIncrementGap(String fieldName) { - return 10; - } - }; -- -
- By default, all tokens created by Analyzers and Tokenizers have a - {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#getPositionIncrement() position increment} of one. - This means that the position stored for that token in the index would be one more than - that of the previous token. - Recall that phrase and proximity searches rely on position info. -
-- If the selected analyzer filters the stop words "is" and "the", then for a document - containing the string "blue is the sky", only the tokens "blue", "sky" are indexed, - with position("sky") = 1 + position("blue"). Now, a phrase query "blue is the sky" - would find that document, because the same analyzer filters the same stop words from - that query. But also the phrase query "blue sky" would find that document. -
-- If this behavior does not fit the application needs, - a modified analyzer can be used, that would increment further the positions of - tokens following a removed stop word, using - {@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#setPositionIncrement(int)}. - This can be done with something like: -
- public TokenStream tokenStream(final String fieldName, Reader reader) { - final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader); - TokenStream res = new TokenStream() { - TermAttribute termAtt = addAttribute(TermAttribute.class); - PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class); - - public boolean incrementToken() throws IOException { - int extraIncrement = 0; - while (true) { - boolean hasNext = ts.incrementToken(); - if (hasNext) { - if (stopWords.contains(termAtt.term())) { - extraIncrement++; // filter this word - continue; - } - if (extraIncrement>0) { - posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement()+extraIncrement); - } - } - return hasNext; - } - } - }; - return res; - } -- Now, with this modified analyzer, the phrase query "blue sky" would find that document. - But note that this is yet not a perfect solution, because any phrase query "blue w1 w2 sky" - where both w1 and w2 are stop words would match that document. - -
- Few more use cases for modifying position increments are: -
- With Lucene 2.9 we introduce a new TokenStream API. The old API used to produce Tokens. A Token - has getter and setter methods for different properties like positionIncrement and termText. - While this approach was sufficient for the default indexing format, it is not versatile enough for - Flexible Indexing, a term which summarizes the effort of making the Lucene indexer pluggable and extensible for custom - index formats. -
--A fully customizable indexer means that users will be able to store custom data structures on disk. Therefore an API -is necessary that can transport custom types of data from the documents to the indexer. -
-- Lucene now provides six Attributes out of the box, which replace the variables the Token class has: -
The term text of a token.
The start and end offset of token in characters.
See above for detailed information about position increment.
The payload that a Token can optionally have.
The type of the token. Default is 'word'.
Optional flags a token can have.
Class
)
-of an Attribute as an argument and returns an instance. If an Attribute of the same type was previously added, then
-the already existing instance is returned, otherwise a new instance is created and returned. Therefore TokenStreams/-Filters
-can safely call addAttribute() with the same Attribute type multiple times. Even consumers of TokenStreams should
-normally call addAttribute() instead of getAttribute(), because it would not fail if the TokenStream does not have this
-Attribute (getAttribute() would throw an IllegalArgumentException, if the Attribute is missing). More advanced code
-could simply check with hasAttribute(), if a TokenStream has it, and may conditionally leave out processing for
-extra performance.
--public class MyAnalyzer extends Analyzer { - - public TokenStream tokenStream(String fieldName, Reader reader) { - TokenStream stream = new WhitespaceTokenizer(reader); - return stream; - } - - public static void main(String[] args) throws IOException { - // text to tokenize - final String text = "This is a demo of the new TokenStream API"; - - MyAnalyzer analyzer = new MyAnalyzer(); - TokenStream stream = analyzer.tokenStream("field", new StringReader(text)); - - // get the TermAttribute from the TokenStream - TermAttribute termAtt = stream.addAttribute(TermAttribute.class); - - stream.reset(); - - // print all tokens until stream is exhausted - while (stream.incrementToken()) { - System.out.println(termAtt.term()); - } - - stream.end() - stream.close(); - } -} --In this easy example a simple white space tokenization is performed. In main() a loop consumes the stream and -prints the term text of the tokens by accessing the TermAttribute that the WhitespaceTokenizer provides. -Here is the output: -
-This -is -a -demo -of -the -new -TokenStream -API --
- public TokenStream tokenStream(String fieldName, Reader reader) { - TokenStream stream = new WhitespaceTokenizer(reader); - stream = new LengthFilter(stream, 3, Integer.MAX_VALUE); - return stream; - } --Note how now only words with 3 or more characters are contained in the output: -
-This -demo -the -new -TokenStream -API --Now let's take a look how the LengthFilter is implemented (it is part of Lucene's core): -
-public final class LengthFilter extends TokenFilter { - - final int min; - final int max; - - private TermAttribute termAtt; - - /** - * Build a filter that removes words that are too long or too - * short from the text. - */ - public LengthFilter(TokenStream in, int min, int max) - { - super(in); - this.min = min; - this.max = max; - termAtt = addAttribute(TermAttribute.class); - } - - /** - * Returns the next input Token whose term() is the right len - */ - public final boolean incrementToken() throws IOException - { - assert termAtt != null; - // return the first non-stop word found - while (input.incrementToken()) { - int len = termAtt.termLength(); - if (len >= min && len <= max) { - return true; - } - // note: else we ignore it but should we index each part of it? - } - // reached EOS -- return null - return false; - } -} --The TermAttribute is added in the constructor and stored in the instance variable
termAtt
.
-Remember that there can only be a single instance of TermAttribute in the chain, so in our example the
-addAttribute()
call in LengthFilter returns the TermAttribute that the WhitespaceTokenizer already added. The tokens
-are retrieved from the input stream in the incrementToken()
method. By looking at the term text
-in the TermAttribute the length of the term can be determined and too short or too long tokens are skipped.
-Note how incrementToken()
can efficiently access the instance variable; no attribute lookup
-is neccessary. The same is true for the consumer, which can simply use local references to the Attributes.
-
-PartOfSpeechAttribute
. First we need to define the interface of the new Attribute:
-- public interface PartOfSpeechAttribute extends Attribute { - public static enum PartOfSpeech { - Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Conjunction, Article, Unknown - } - - public void setPartOfSpeech(PartOfSpeech pos); - - public PartOfSpeech getPartOfSpeech(); - } -- -Now we also need to write the implementing class. The name of that class is important here: By default, Lucene -checks if there is a class with the name of the Attribute with the postfix 'Impl'. In this example, we would -consequently call the implementing class
PartOfSpeechAttributeImpl
. -public final class PartOfSpeechAttributeImpl extends AttributeImpl - implements PartOfSpeechAttribute{ - - private PartOfSpeech pos = PartOfSpeech.Unknown; - - public void setPartOfSpeech(PartOfSpeech pos) { - this.pos = pos; - } - - public PartOfSpeech getPartOfSpeech() { - return pos; - } - - public void clear() { - pos = PartOfSpeech.Unknown; - } - - public void copyTo(AttributeImpl target) { - ((PartOfSpeechAttributeImpl) target).pos = pos; - } - - public boolean equals(Object other) { - if (other == this) { - return true; - } - - if (other instanceof PartOfSpeechAttributeImpl) { - return pos == ((PartOfSpeechAttributeImpl) other).pos; - } - - return false; - } - - public int hashCode() { - return pos.ordinal(); - } -} --This is a simple Attribute implementation has only a single variable that stores the part-of-speech of a token. It extends the -new
AttributeImpl
class and therefore implements its abstract methods clear(), copyTo(), equals(), hashCode()
.
-Now we need a TokenFilter that can set this new PartOfSpeechAttribute for each token. In this example we show a very naive filter
-that tags every word with a leading upper-case letter as a 'Noun' and all other words as 'Unknown'.
-- public static class PartOfSpeechTaggingFilter extends TokenFilter { - PartOfSpeechAttribute posAtt; - TermAttribute termAtt; - - protected PartOfSpeechTaggingFilter(TokenStream input) { - super(input); - posAtt = addAttribute(PartOfSpeechAttribute.class); - termAtt = addAttribute(TermAttribute.class); - } - - public boolean incrementToken() throws IOException { - if (!input.incrementToken()) {return false;} - posAtt.setPartOfSpeech(determinePOS(termAtt.termBuffer(), 0, termAtt.termLength())); - return true; - } - - // determine the part of speech for the given term - protected PartOfSpeech determinePOS(char[] term, int offset, int length) { - // naive implementation that tags every uppercased word as noun - if (length > 0 && Character.isUpperCase(term[0])) { - return PartOfSpeech.Noun; - } - return PartOfSpeech.Unknown; - } - } --Just like the LengthFilter, this new filter accesses the attributes it needs in the constructor and -stores references in instance variables. Notice how you only need to pass in the interface of the new -Attribute and instantiating the correct class is automatically been taken care of. -Now we need to add the filter to the chain: -
- public TokenStream tokenStream(String fieldName, Reader reader) { - TokenStream stream = new WhitespaceTokenizer(reader); - stream = new LengthFilter(stream, 3, Integer.MAX_VALUE); - stream = new PartOfSpeechTaggingFilter(stream); - return stream; - } --Now let's look at the output: -
-This -demo -the -new -TokenStream -API --Apparently it hasn't changed, which shows that adding a custom attribute to a TokenStream/Filter chain does not -affect any existing consumers, simply because they don't know the new Attribute. Now let's change the consumer -to make use of the new PartOfSpeechAttribute and print it out: -
- public static void main(String[] args) throws IOException { - // text to tokenize - final String text = "This is a demo of the new TokenStream API"; - - MyAnalyzer analyzer = new MyAnalyzer(); - TokenStream stream = analyzer.tokenStream("field", new StringReader(text)); - - // get the TermAttribute from the TokenStream - TermAttribute termAtt = stream.addAttribute(TermAttribute.class); - - // get the PartOfSpeechAttribute from the TokenStream - PartOfSpeechAttribute posAtt = stream.addAttribute(PartOfSpeechAttribute.class); - - stream.reset(); - - // print all tokens until stream is exhausted - while (stream.incrementToken()) { - System.out.println(termAtt.term() + ": " + posAtt.getPartOfSpeech()); - } - - stream.end(); - stream.close(); - } --The change that was made is to get the PartOfSpeechAttribute from the TokenStream and print out its contents in -the while loop that consumes the stream. Here is the new output: -
-This: Noun -demo: Unknown -the: Unknown -new: Unknown -TokenStream: Noun -API: Noun --Each word is now followed by its assigned PartOfSpeech tag. Of course this is a naive -part-of-speech tagging. The word 'This' should not even be tagged as noun; it is only spelled capitalized because it -is the first word of a sentence. Actually this is a good opportunity for an excerise. To practice the usage of the new -API the reader could now write an Attribute and TokenFilter that can specify for each word if it was the first token -of a sentence or not. Then the PartOfSpeechTaggingFilter can make use of this knowledge and only tag capitalized words -as nouns if not the first word of a sentence (we know, this is still not a correct behavior, but hey, it's a good exercise). -As a small hint, this is how the new Attribute class could begin: -
- public class FirstTokenOfSentenceAttributeImpl extends Attribute - implements FirstTokenOfSentenceAttribute { - - private boolean firstToken; - - public void setFirstToken(boolean firstToken) { - this.firstToken = firstToken; - } - - public boolean getFirstToken() { - return firstToken; - } - - public void clear() { - firstToken = false; - } - - ... -- -