1 <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
3 Licensed to the Apache Software Foundation (ASF) under one or more
4 contributor license agreements. See the NOTICE file distributed with
5 this work for additional information regarding copyright ownership.
6 The ASF licenses this file to You under the Apache License, Version 2.0
7 (the "License"); you may not use this file except in compliance with
8 the License. You may obtain a copy of the License at
10 http://www.apache.org/licenses/LICENSE-2.0
12 Unless required by applicable law or agreed to in writing, software
13 distributed under the License is distributed on an "AS IS" BASIS,
14 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 See the License for the specific language governing permissions and
16 limitations under the License.
20 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
23 Code to search indices.
25 <h2>Table Of Contents</h2>
28 <li><a href="#search">Search Basics</a></li>
29 <li><a href="#query">The Query Classes</a></li>
30 <li><a href="#scoring">Changing the Scoring</a></li>
38 Applications usually call {@link
39 org.apache.lucene.search.Searcher#search(Query,int)} or {@link
40 org.apache.lucene.search.Searcher#search(Query,Filter,int)}.
42 <!-- FILL IN MORE HERE -->
45 <h2>Query Classes</h2>
47 <a href="TermQuery.html">TermQuery</a>
50 <p>Of the various implementations of
51 <a href="Query.html">Query</a>, the
52 <a href="TermQuery.html">TermQuery</a>
53 is the easiest to understand and the most often used in applications. A <a
54 href="TermQuery.html">TermQuery</a> matches all the documents that contain the
56 <a href="../index/Term.html">Term</a>,
57 which is a word that occurs in a certain
58 <a href="../document/Field.html">Field</a>.
59 Thus, a <a href="TermQuery.html">TermQuery</a> identifies and scores all
60 <a href="../document/Document.html">Document</a>s that have a <a
61 href="../document/Field.html">Field</a> with the specified string in it.
63 href="TermQuery.html">TermQuery</a>
66 TermQuery tq = new TermQuery(new Term("fieldName", "term"));
67 </pre>In this example, the <a href="Query.html">Query</a> identifies all <a
68 href="../document/Document.html">Document</a>s that have the <a
69 href="../document/Field.html">Field</a> named <tt>"fieldName"</tt>
70 containing the word <tt>"term"</tt>.
73 <a href="BooleanQuery.html">BooleanQuery</a>
76 <p>Things start to get interesting when one combines multiple
77 <a href="TermQuery.html">TermQuery</a> instances into a <a
78 href="BooleanQuery.html">BooleanQuery</a>.
79 A <a href="BooleanQuery.html">BooleanQuery</a> contains multiple
80 <a href="BooleanClause.html">BooleanClause</a>s,
81 where each clause contains a sub-query (<a href="Query.html">Query</a>
82 instance) and an operator (from <a
83 href="BooleanClause.Occur.html">BooleanClause.Occur</a>)
84 describing how that sub-query is combined with the other clauses:
87 <li><p>SHOULD — Use this operator when a clause can occur in the result set, but is not required.
88 If a query is made up of all SHOULD clauses, then every document in the result
89 set matches at least one of these clauses.</p></li>
91 <li><p>MUST — Use this operator when a clause is required to occur in the result set. Every
92 document in the result set will match
93 all such clauses.</p></li>
95 <li><p>MUST NOT — Use this operator when a
96 clause must not occur in the result set. No
97 document in the result set will match
98 any such clauses.</p></li>
100 Boolean queries are constructed by adding two or more
101 <a href="BooleanClause.html">BooleanClause</a>
102 instances. If too many clauses are added, a <a href="BooleanQuery.TooManyClauses.html">TooManyClauses</a>
103 exception will be thrown during searching. This most often occurs
104 when a <a href="Query.html">Query</a>
105 is rewritten into a <a href="BooleanQuery.html">BooleanQuery</a> with many
106 <a href="TermQuery.html">TermQuery</a> clauses,
107 for example by <a href="WildcardQuery.html">WildcardQuery</a>.
108 The default setting for the maximum number
109 of clauses 1024, but this can be changed via the
110 static method <a href="BooleanQuery.html#setMaxClauseCount(int)">setMaxClauseCount</a>
111 in <a href="BooleanQuery.html">BooleanQuery</a>.
116 <p>Another common search is to find documents containing certain phrases. This
117 is handled two different ways:
120 <p><a href="PhraseQuery.html">PhraseQuery</a>
121 — Matches a sequence of
122 <a href="../index/Term.html">Terms</a>.
123 <a href="PhraseQuery.html">PhraseQuery</a> uses a slop factor to determine
124 how many positions may occur between any two terms in the phrase and still be considered a match.</p>
127 <p><a href="spans/SpanNearQuery.html">SpanNearQuery</a>
128 — Matches a sequence of other
129 <a href="spans/SpanQuery.html">SpanQuery</a>
130 instances. <a href="spans/SpanNearQuery.html">SpanNearQuery</a> allows for
132 complicated phrase queries since it is constructed from other <a
133 href="spans/SpanQuery.html">SpanQuery</a>
134 instances, instead of only <a href="TermQuery.html">TermQuery</a>
141 <a href="TermRangeQuery.html">TermRangeQuery</a>
145 <a href="TermRangeQuery.html">TermRangeQuery</a>
146 matches all documents that occur in the
147 exclusive range of a lower
148 <a href="../index/Term.html">Term</a>
150 <a href="../index/Term.html">Term</a>.
151 according to {@link java.lang.String#compareTo(String)}. It is not intended
152 for numerical ranges, use <a href="NumericRangeQuery.html">NumericRangeQuery</a> instead.
154 For example, one could find all documents
155 that have terms beginning with the letters <tt>a</tt> through <tt>c</tt>. This type of <a
156 href="Query.html">Query</a> is frequently used to
158 documents that occur in a specific date range.
162 <a href="NumericRangeQuery.html">NumericRangeQuery</a>
166 <a href="NumericRangeQuery.html">NumericRangeQuery</a>
167 matches all documents that occur in a numeric range.
168 For NumericRangeQuery to work, you must index the values
169 using a special <a href="../document/NumericField.html">
174 <a href="PrefixQuery.html">PrefixQuery</a>,
175 <a href="WildcardQuery.html">WildcardQuery</a>
179 <a href="PrefixQuery.html">PrefixQuery</a>
180 has a different implementation, it is essentially a special case of the
181 <a href="WildcardQuery.html">WildcardQuery</a>.
182 The <a href="PrefixQuery.html">PrefixQuery</a> allows an application
183 to identify all documents with terms that begin with a certain string. The <a
184 href="WildcardQuery.html">WildcardQuery</a> generalizes this by allowing
185 for the use of <tt>*</tt> (matches 0 or more characters) and <tt>?</tt> (matches exactly one character) wildcards.
186 Note that the <a href="WildcardQuery.html">WildcardQuery</a> can be quite slow. Also
188 <a href="WildcardQuery.html">WildcardQuery</a> should
189 not start with <tt>*</tt> and <tt>?</tt>, as these are extremely slow.
190 To remove this protection and allow a wildcard at the beginning of a term, see method
191 <a href="../queryParser/QueryParser.html#setAllowLeadingWildcard(boolean)">setAllowLeadingWildcard</a> in
192 <a href="../queryParser/QueryParser.html">QueryParser</a>.
195 <a href="FuzzyQuery.html">FuzzyQuery</a>
199 <a href="FuzzyQuery.html">FuzzyQuery</a>
200 matches documents that contain terms similar to the specified term. Similarity is
202 <a href="http://en.wikipedia.org/wiki/Levenshtein">Levenshtein (edit) distance</a>.
203 This type of query can be useful when accounting for spelling variations in the collection.
205 <a name="changingSimilarity"></a>
206 <h2>Changing Similarity</h2>
208 <p>Chances are <a href="DefaultSimilarity.html">DefaultSimilarity</a> is sufficient for all
209 your searching needs.
210 However, in some applications it may be necessary to customize your <a
211 href="Similarity.html">Similarity</a> implementation. For instance, some
212 applications do not need to
213 distinguish between shorter and longer documents (see <a
214 href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).</p>
216 <p>To change <a href="Similarity.html">Similarity</a>, one must do so for both indexing and
217 searching, and the changes must happen before
218 either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it
219 just isn't well-defined what is going to happen.
222 <p>To make this change, implement your own <a href="Similarity.html">Similarity</a> (likely
223 you'll want to simply subclass
224 <a href="DefaultSimilarity.html">DefaultSimilarity</a>) and then use the new
226 <a href="../index/IndexWriter.html#setSimilarity(org.apache.lucene.search.Similarity)">IndexWriter.setSimilarity</a>
228 <a href="Searcher.html#setSimilarity(org.apache.lucene.search.Similarity)">Searcher.setSimilarity</a>
233 If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a
234 href="http://www.nabble.com/Overriding-Similarity-tf2128934.html">Overriding Similarity</a>.
235 In summary, here are a few use cases:
237 <li><p><a href="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> — <a
238 href="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> gives small increases
239 as the frequency increases a small amount
240 and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is
241 more significant.</p></li>
242 <li><p>Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a
243 matching term occurs. In these
244 cases people have overridden Similarity to return 1 from the tf() method.</p></li>
245 <li><p>Changing Length Normalization — By overriding <a
246 href="Similarity.html#lengthNorm(java.lang.String,%20int)">lengthNorm</a>,
247 it is possible to discount how the length of a field contributes
248 to a score. In <a href="DefaultSimilarity.html">DefaultSimilarity</a>,
249 lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
250 1 / (numTerms in field), all fields will be treated
251 <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">"fairly"</a>.</p></li>
253 In general, Chris Hostetter sums it up best in saying (from <a
254 href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>):
255 <blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just
257 it's "text" is a situation where it *might* make sense to to override your
258 Similarity method.</blockquote>
260 <a name="scoring"></a>
261 <h2>Changing Scoring — Expert Level</h2>
263 <p>Changing scoring is an expert level task, so tread carefully and be prepared to share your code if
267 <p>With the warning out of the way, it is possible to change a lot more than just the Similarity
268 when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by
269 <span >three main classes</span>:
272 <a href="Query.html">Query</a> — The abstract object representation of the
273 user's information need.</li>
275 <a href="Weight.html">Weight</a> — The internal interface representation of
276 the user's Query, so that Query objects may be reused.</li>
278 <a href="Scorer.html">Scorer</a> — An abstract class containing common
279 functionality for scoring. Provides both scoring and explanation capabilities.</li>
281 Details on each of these classes, and their children, can be found in the subsections below.
283 <h4>The Query Class</h4>
284 <p>In some sense, the
285 <a href="Query.html">Query</a>
286 class is where it all begins. Without a Query, there would be
287 nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it
289 for creating them or coordinating the functionality between them. The
290 <a href="Query.html">Query</a> class has several methods that are important for
293 <li>createWeight(Searcher searcher) — A
294 <a href="Weight.html">Weight</a> is the internal representation of the
295 Query, so each Query implementation must
296 provide an implementation of Weight. See the subsection on <a
297 href="#The Weight Interface">The Weight Interface</a> below for details on implementing the Weight
299 <li>rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are:
300 <a href="TermQuery.html">TermQuery</a>,
301 <a href="BooleanQuery.html">BooleanQuery</a>, <span
302 >and other queries that implement Query.html#createWeight(Searcher searcher)</span></li>
305 <h4>The Weight Interface</h4>
307 <a href="Weight.html">Weight</a>
308 interface provides an internal representation of the Query so that it can be reused. Any
309 <a href="Searcher.html">Searcher</a>
310 dependent state should be stored in the Weight implementation,
311 not in the Query class. The interface defines six methods that must be implemented:
314 <a href="Weight.html#getQuery()">Weight#getQuery()</a> — Pointer to the
315 Query that this Weight represents.</li>
317 <a href="Weight.html#getValue()">Weight#getValue()</a> — The weight for
318 this Query. For example, the TermQuery.TermWeight value is
319 equal to the idf^2 * boost * queryNorm <!-- DOUBLE CHECK THIS --></li>
321 <a href="Weight.html#sumOfSquaredWeights()">
322 Weight#sumOfSquaredWeights()</a> — The sum of squared weights. For TermQuery, this is (idf *
325 <a href="Weight.html#normalize(float)">
326 Weight#normalize(float)</a> — Determine the query normalization factor. The query normalization may
327 allow for comparing scores between queries.</li>
329 <a href="Weight.html#scorer(org.apache.lucene.index.IndexReader, boolean, boolean)">
330 Weight#scorer(IndexReader, boolean, boolean)</a> — Construct a new
331 <a href="Scorer.html">Scorer</a>
333 <a href="#The Scorer Class">The Scorer Class</a>
334 below for help defining a Scorer. As the name implies, the
335 Scorer is responsible for doing the actual scoring of documents given the Query.
338 <a href="Weight.html#explain(org.apache.lucene.search.Searcher, org.apache.lucene.index.IndexReader, int)">
339 Weight#explain(Searcher, IndexReader, int)</a> — Provide a means for explaining why a given document was
344 <h4>The Scorer Class</h4>
346 <a href="Scorer.html">Scorer</a>
347 abstract class provides common scoring functionality for all Scorer implementations and
348 is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not
349 yet abstract, but will be in future versions and should be considered as such now) methods which
350 must be implemented (some of them inherited from <a href="DocIdSetIterator.html">DocIdSetIterator</a> ):
353 <a href="DocIdSetIterator.html#nextDoc()">DocIdSetIterator#nextDoc()</a> — Advances to the next
354 document that matches this Query, returning true if and only
355 if there is another document that matches.</li>
357 <a href="DocIdSetIterator.html#docID()">DocIdSetIterator#docID()</a> — Returns the id of the
358 <a href="../document/Document.html">Document</a>
359 that contains the match. It is not valid until next() has been called at least once.
362 <a href="Scorer.html#score(org.apache.lucene.search.Collector)">Scorer#score(Collector)</a> —
363 Scores and collects all matching documents using the given Collector.
366 <a href="Scorer.html#score()">Scorer#score()</a> — Return the score of the
367 current document. This value can be determined in any
368 appropriate way for an application. For instance, the
369 <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/TermScorer.java?view=log">TermScorer</a>
370 returns the tf * Weight.getValue() * fieldNorm.
373 <a href="DocIdSetIterator.html#advance(int)">DocIdSetIterator#advance(int)</a> — Skip ahead in
374 the document matches to the document whose id is greater than
375 or equal to the passed in value. In many instances, advance can be
376 implemented more efficiently than simply looping through all the matching documents until
377 the target document is identified.</li>
380 <h4>Why would I want to add my own Query?</h4>
382 <p>In a nutshell, you want to add your own custom Query implementation when you think that Lucene's
383 aren't appropriate for the
384 task that you want to do. You might be doing some cutting edge research or you need more information
386 out of Lucene (similar to Doug adding SpanQuery functionality).</p>