lucene-java-3.5.0/lucene/src/java/org/apache/lucene/search/package.html

   1 <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
   2 <!--
   3  Licensed to the Apache Software Foundation (ASF) under one or more
   4  contributor license agreements.  See the NOTICE file distributed with
   5  this work for additional information regarding copyright ownership.
   6  The ASF licenses this file to You under the Apache License, Version 2.0
   7  (the "License"); you may not use this file except in compliance with
   8  the License.  You may obtain a copy of the License at
   9
  10      http://www.apache.org/licenses/LICENSE-2.0
  11
  12  Unless required by applicable law or agreed to in writing, software
  13  distributed under the License is distributed on an "AS IS" BASIS,
  14  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  15  See the License for the specific language governing permissions and
  16  limitations under the License.
  17 -->
  18 <html>
  19 <head>
  20    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  21 </head>
  22 <body>
  23 Code to search indices.
  24
  25 <h2>Table Of Contents</h2>
  26 <p>
  27     <ol>
  28         <li><a href="#search">Search Basics</a></li>
  29         <li><a href="#query">The Query Classes</a></li>
  30         <li><a href="#scoring">Changing the Scoring</a></li>
  31     </ol>
  32 </p>
  33 <a name="search"></a>
  34 <h2>Search</h2>
  35 <p>
  36 Search over indices.
  37
  38 Applications usually call {@link
  39 org.apache.lucene.search.Searcher#search(Query,int)} or {@link
  40 org.apache.lucene.search.Searcher#search(Query,Filter,int)}.
  41
  42     <!-- FILL IN MORE HERE -->
  43 </p>
  44 <a name="query"></a>
  45 <h2>Query Classes</h2>
  46 <h4>
  47     <a href="TermQuery.html">TermQuery</a>
  48 </h4>
  49
  50 <p>Of the various implementations of
  51     <a href="Query.html">Query</a>, the
  52     <a href="TermQuery.html">TermQuery</a>
  53     is the easiest to understand and the most often used in applications. A <a
  54         href="TermQuery.html">TermQuery</a> matches all the documents that contain the
  55     specified
  56     <a href="../index/Term.html">Term</a>,
  57     which is a word that occurs in a certain
  58     <a href="../document/Field.html">Field</a>.
  59     Thus, a <a href="TermQuery.html">TermQuery</a> identifies and scores all
  60     <a href="../document/Document.html">Document</a>s that have a <a
  61         href="../document/Field.html">Field</a> with the specified string in it.
  62     Constructing a <a
  63         href="TermQuery.html">TermQuery</a>
  64     is as simple as:
  65     <pre>
  66         TermQuery tq = new TermQuery(new Term("fieldName", "term"));
  67     </pre>In this example, the <a href="Query.html">Query</a> identifies all <a
  68         href="../document/Document.html">Document</a>s that have the <a
  69         href="../document/Field.html">Field</a> named <tt>"fieldName"</tt>
  70     containing the word <tt>"term"</tt>.
  71 </p>
  72 <h4>
  73     <a href="BooleanQuery.html">BooleanQuery</a>
  74 </h4>
  75
  76 <p>Things start to get interesting when one combines multiple
  77     <a href="TermQuery.html">TermQuery</a> instances into a <a
  78         href="BooleanQuery.html">BooleanQuery</a>.
  79     A <a href="BooleanQuery.html">BooleanQuery</a> contains multiple
  80     <a href="BooleanClause.html">BooleanClause</a>s,
  81     where each clause contains a sub-query (<a href="Query.html">Query</a>
  82     instance) and an operator (from <a
  83         href="BooleanClause.Occur.html">BooleanClause.Occur</a>)
  84     describing how that sub-query is combined with the other clauses:
  85     <ol>
  86
  87         <li><p>SHOULD &mdash; Use this operator when a clause can occur in the result set, but is not required.
  88             If a query is made up of all SHOULD clauses, then every document in the result
  89             set matches at least one of these clauses.</p></li>
  90
  91         <li><p>MUST &mdash; Use this operator when a clause is required to occur in the result set. Every
  92             document in the result set will match
  93             all such clauses.</p></li>
  94
  95         <li><p>MUST NOT &mdash; Use this operator when a
  96             clause must not occur in the result set. No
  97             document in the result set will match
  98             any such clauses.</p></li>
  99     </ol>
 100     Boolean queries are constructed by adding two or more
 101     <a href="BooleanClause.html">BooleanClause</a>
 102     instances. If too many clauses are added, a <a href="BooleanQuery.TooManyClauses.html">TooManyClauses</a>
 103     exception will be thrown during searching. This most often occurs
 104     when a <a href="Query.html">Query</a>
 105     is rewritten into a <a href="BooleanQuery.html">BooleanQuery</a> with many
 106     <a href="TermQuery.html">TermQuery</a> clauses,
 107     for example by <a href="WildcardQuery.html">WildcardQuery</a>.
 108     The default setting for the maximum number
 109     of clauses 1024, but this can be changed via the
 110     static method <a href="BooleanQuery.html#setMaxClauseCount(int)">setMaxClauseCount</a>
 111     in <a href="BooleanQuery.html">BooleanQuery</a>.
 112 </p>
 113
 114 <h4>Phrases</h4>
 115
 116 <p>Another common search is to find documents containing certain phrases. This
 117     is handled two different ways:
 118     <ol>
 119         <li>
 120             <p><a href="PhraseQuery.html">PhraseQuery</a>
 121                 &mdash; Matches a sequence of
 122                 <a href="../index/Term.html">Terms</a>.
 123                 <a href="PhraseQuery.html">PhraseQuery</a> uses a slop factor to determine
 124                 how many positions may occur between any two terms in the phrase and still be considered a match.</p>
 125         </li>
 126         <li>
 127             <p><a href="spans/SpanNearQuery.html">SpanNearQuery</a>
 128                 &mdash; Matches a sequence of other
 129                 <a href="spans/SpanQuery.html">SpanQuery</a>
 130                 instances. <a href="spans/SpanNearQuery.html">SpanNearQuery</a> allows for
 131                 much more
 132                 complicated phrase queries since it is constructed from other <a
 133                     href="spans/SpanQuery.html">SpanQuery</a>
 134                 instances, instead of only <a href="TermQuery.html">TermQuery</a>
 135                 instances.</p>
 136         </li>
 137     </ol>
 138 </p>
 139
 140 <h4>
 141     <a href="TermRangeQuery.html">TermRangeQuery</a>
 142 </h4>
 143
 144 <p>The
 145     <a href="TermRangeQuery.html">TermRangeQuery</a>
 146     matches all documents that occur in the
 147     exclusive range of a lower
 148     <a href="../index/Term.html">Term</a>
 149     and an upper
 150     <a href="../index/Term.html">Term</a>.
 151     according to {@link java.lang.String#compareTo(String)}. It is not intended
 152     for numerical ranges, use <a href="NumericRangeQuery.html">NumericRangeQuery</a> instead.
 153
 154     For example, one could find all documents
 155     that have terms beginning with the letters <tt>a</tt> through <tt>c</tt>. This type of <a
 156         href="Query.html">Query</a> is frequently used to
 157     find
 158     documents that occur in a specific date range.
 159 </p>
 160
 161 <h4>
 162     <a href="NumericRangeQuery.html">NumericRangeQuery</a>
 163 </h4>
 164
 165 <p>The
 166     <a href="NumericRangeQuery.html">NumericRangeQuery</a>
 167     matches all documents that occur in a numeric range.
 168     For NumericRangeQuery to work, you must index the values
 169     using a special <a href="../document/NumericField.html">
 170     NumericField</a>.
 171 </p>
 172
 173 <h4>
 174     <a href="PrefixQuery.html">PrefixQuery</a>,
 175     <a href="WildcardQuery.html">WildcardQuery</a>
 176 </h4>
 177
 178 <p>While the
 179     <a href="PrefixQuery.html">PrefixQuery</a>
 180     has a different implementation, it is essentially a special case of the
 181     <a href="WildcardQuery.html">WildcardQuery</a>.
 182     The <a href="PrefixQuery.html">PrefixQuery</a> allows an application
 183     to identify all documents with terms that begin with a certain string. The <a
 184         href="WildcardQuery.html">WildcardQuery</a> generalizes this by allowing
 185     for the use of <tt>*</tt> (matches 0 or more characters) and <tt>?</tt> (matches exactly one character) wildcards.
 186     Note that the <a href="WildcardQuery.html">WildcardQuery</a> can be quite slow. Also
 187     note that
 188     <a href="WildcardQuery.html">WildcardQuery</a> should
 189     not start with <tt>*</tt> and <tt>?</tt>, as these are extremely slow.
 190         To remove this protection and allow a wildcard at the beginning of a term, see method
 191         <a href="../queryParser/QueryParser.html#setAllowLeadingWildcard(boolean)">setAllowLeadingWildcard</a> in
 192         <a href="../queryParser/QueryParser.html">QueryParser</a>.
 193 </p>
 194 <h4>
 195     <a href="FuzzyQuery.html">FuzzyQuery</a>
 196 </h4>
 197
 198 <p>A
 199     <a href="FuzzyQuery.html">FuzzyQuery</a>
 200     matches documents that contain terms similar to the specified term. Similarity is
 201     determined using
 202     <a href="http://en.wikipedia.org/wiki/Levenshtein">Levenshtein (edit) distance</a>.
 203     This type of query can be useful when accounting for spelling variations in the collection.
 204 </p>
 205 <a name="changingSimilarity"></a>
 206 <h2>Changing Similarity</h2>
 207
 208 <p>Chances are <a href="DefaultSimilarity.html">DefaultSimilarity</a> is sufficient for all
 209     your searching needs.
 210     However, in some applications it may be necessary to customize your <a
 211         href="Similarity.html">Similarity</a> implementation. For instance, some
 212     applications do not need to
 213     distinguish between shorter and longer documents (see <a
 214         href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).</p>
 215
 216 <p>To change <a href="Similarity.html">Similarity</a>, one must do so for both indexing and
 217     searching, and the changes must happen before
 218     either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it
 219     just isn't well-defined what is going to happen.
 220 </p>
 221
 222 <p>To make this change, implement your own <a href="Similarity.html">Similarity</a> (likely
 223     you'll want to simply subclass
 224     <a href="DefaultSimilarity.html">DefaultSimilarity</a>) and then use the new
 225     class by calling
 226     <a href="../index/IndexWriter.html#setSimilarity(org.apache.lucene.search.Similarity)">IndexWriter.setSimilarity</a>
 227     before indexing and
 228     <a href="Searcher.html#setSimilarity(org.apache.lucene.search.Similarity)">Searcher.setSimilarity</a>
 229     before searching.
 230 </p>
 231
 232 <p>
 233     If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a
 234         href="http://www.nabble.com/Overriding-Similarity-tf2128934.html">Overriding Similarity</a>.
 235     In summary, here are a few use cases:
 236     <ol>
 237         <li><p><a href="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> &mdash; <a
 238                 href="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> gives small increases
 239             as the frequency increases a small amount
 240             and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is
 241             more significant.</p></li>
 242         <li><p>Overriding tf &mdash; In some applications, it doesn't matter what the score of a document is as long as a
 243             matching term occurs. In these
 244             cases people have overridden Similarity to return 1 from the tf() method.</p></li>
 245         <li><p>Changing Length Normalization &mdash; By overriding <a
 246                 href="Similarity.html#lengthNorm(java.lang.String,%20int)">lengthNorm</a>,
 247             it is possible to discount how the length of a field contributes
 248             to a score. In <a href="DefaultSimilarity.html">DefaultSimilarity</a>,
 249             lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
 250             1 / (numTerms in field), all fields will be treated
 251             <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">"fairly"</a>.</p></li>
 252     </ol>
 253     In general, Chris Hostetter sums it up best in saying (from <a
 254         href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>):
 255     <blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just
 256         that
 257         it's "text" is a situation where it *might* make sense to to override your
 258         Similarity method.</blockquote>
 259 </p>
 260 <a name="scoring"></a>
 261 <h2>Changing Scoring &mdash; Expert Level</h2>
 262
 263 <p>Changing scoring is an expert level task, so tread carefully and be prepared to share your code if
 264     you want help.
 265 </p>
 266
 267 <p>With the warning out of the way, it is possible to change a lot more than just the Similarity
 268     when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by
 269     <span >three main classes</span>:
 270     <ol>
 271         <li>
 272             <a href="Query.html">Query</a> &mdash; The abstract object representation of the
 273             user's information need.</li>
 274         <li>
 275             <a href="Weight.html">Weight</a> &mdash; The internal interface representation of
 276             the user's Query, so that Query objects may be reused.</li>
 277         <li>
 278             <a href="Scorer.html">Scorer</a> &mdash; An abstract class containing common
 279             functionality for scoring. Provides both scoring and explanation capabilities.</li>
 280     </ol>
 281     Details on each of these classes, and their children, can be found in the subsections below.
 282 </p>
 283 <h4>The Query Class</h4>
 284     <p>In some sense, the
 285         <a href="Query.html">Query</a>
 286         class is where it all begins. Without a Query, there would be
 287         nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it
 288         is often responsible
 289         for creating them or coordinating the functionality between them. The
 290         <a href="Query.html">Query</a> class has several methods that are important for
 291         derived classes:
 292         <ol>
 293             <li>createWeight(Searcher searcher) &mdash; A
 294                 <a href="Weight.html">Weight</a> is the internal representation of the
 295                 Query, so each Query implementation must
 296                 provide an implementation of Weight. See the subsection on <a
 297                     href="#The Weight Interface">The Weight Interface</a> below for details on implementing the Weight
 298                 interface.</li>
 299             <li>rewrite(IndexReader reader) &mdash; Rewrites queries into primitive queries. Primitive queries are:
 300                 <a href="TermQuery.html">TermQuery</a>,
 301                 <a href="BooleanQuery.html">BooleanQuery</a>, <span
 302                     >and other queries that implement Query.html#createWeight(Searcher searcher)</span></li>
 303         </ol>
 304     </p>
 305 <h4>The Weight Interface</h4>
 306     <p>The
 307         <a href="Weight.html">Weight</a>
 308         interface provides an internal representation of the Query so that it can be reused. Any
 309         <a href="Searcher.html">Searcher</a>
 310         dependent state should be stored in the Weight implementation,
 311         not in the Query class. The interface defines six methods that must be implemented:
 312         <ol>
 313             <li>
 314                 <a href="Weight.html#getQuery()">Weight#getQuery()</a> &mdash; Pointer to the
 315                 Query that this Weight represents.</li>
 316             <li>
 317                 <a href="Weight.html#getValue()">Weight#getValue()</a> &mdash; The weight for
 318                 this Query. For example, the TermQuery.TermWeight value is
 319                 equal to the idf^2 * boost * queryNorm <!-- DOUBLE CHECK THIS --></li>
 320             <li>
 321                 <a href="Weight.html#sumOfSquaredWeights()">
 322                     Weight#sumOfSquaredWeights()</a> &mdash; The sum of squared weights. For TermQuery, this is (idf *
 323                 boost)^2</li>
 324             <li>
 325                 <a href="Weight.html#normalize(float)">
 326                     Weight#normalize(float)</a> &mdash; Determine the query normalization factor. The query normalization may
 327                 allow for comparing scores between queries.</li>
 328             <li>
 329                 <a href="Weight.html#scorer(org.apache.lucene.index.IndexReader, boolean, boolean)">
 330                     Weight#scorer(IndexReader, boolean, boolean)</a> &mdash; Construct a new
 331                 <a href="Scorer.html">Scorer</a>
 332                 for this Weight. See
 333                 <a href="#The Scorer Class">The Scorer Class</a>
 334                 below for help defining a Scorer. As the name implies, the
 335                 Scorer is responsible for doing the actual scoring of documents given the Query.
 336             </li>
 337             <li>
 338                 <a href="Weight.html#explain(org.apache.lucene.search.Searcher, org.apache.lucene.index.IndexReader, int)">
 339                     Weight#explain(Searcher, IndexReader, int)</a> &mdash; Provide a means for explaining why a given document was
 340                 scored
 341                 the way it was.</li>
 342         </ol>
 343     </p>
 344 <h4>The Scorer Class</h4>
 345     <p>The
 346         <a href="Scorer.html">Scorer</a>
 347         abstract class provides common scoring functionality for all Scorer implementations and
 348         is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not
 349         yet abstract, but will be in future versions and should be considered as such now) methods which
 350         must be implemented (some of them inherited from <a href="DocIdSetIterator.html">DocIdSetIterator</a> ):
 351         <ol>
 352             <li>
 353                 <a href="DocIdSetIterator.html#nextDoc()">DocIdSetIterator#nextDoc()</a> &mdash; Advances to the next
 354                 document that matches this Query, returning true if and only
 355                 if there is another document that matches.</li>
 356             <li>
 357                 <a href="DocIdSetIterator.html#docID()">DocIdSetIterator#docID()</a> &mdash; Returns the id of the
 358                 <a href="../document/Document.html">Document</a>
 359                 that contains the match. It is not valid until next() has been called at least once.
 360             </li>
 361             <li>
 362                 <a href="Scorer.html#score(org.apache.lucene.search.Collector)">Scorer#score(Collector)</a> &mdash;
 363                 Scores and collects all matching documents using the given Collector.
 364             </li>
 365             <li>
 366                 <a href="Scorer.html#score()">Scorer#score()</a> &mdash; Return the score of the
 367                 current document. This value can be determined in any
 368                 appropriate way for an application. For instance, the
 369                 <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/TermScorer.java?view=log">TermScorer</a>
 370                 returns the tf * Weight.getValue() * fieldNorm.
 371             </li>
 372             <li>
 373                 <a href="DocIdSetIterator.html#advance(int)">DocIdSetIterator#advance(int)</a> &mdash; Skip ahead in
 374                 the document matches to the document whose id is greater than
 375                 or equal to the passed in value. In many instances, advance can be
 376                 implemented more efficiently than simply looping through all the matching documents until
 377                 the target document is identified.</li>
 378         </ol>
 379     </p>
 380 <h4>Why would I want to add my own Query?</h4>
 381
 382     <p>In a nutshell, you want to add your own custom Query implementation when you think that Lucene's
 383         aren't appropriate for the
 384         task that you want to do. You might be doing some cutting edge research or you need more information
 385         back
 386         out of Lucene (similar to Doug adding SpanQuery functionality).</p>
 387
 388 </body>
 389 </html>