pylucene 3.5.0-3
[pylucene.git] / lucene-java-3.4.0 / lucene / src / site / src / documentation / content / xdocs / scoring.xml
diff --git a/lucene-java-3.4.0/lucene/src/site/src/documentation/content/xdocs/scoring.xml b/lucene-java-3.4.0/lucene/src/site/src/documentation/content/xdocs/scoring.xml
deleted file mode 100644 (file)
index 0fd2e1a..0000000
+++ /dev/null
@@ -1,277 +0,0 @@
-<?xml version="1.0"?>
-
-<document>
-       <header>
-        <title>
-       Apache Lucene - Scoring
-               </title>
-       </header>
-    <properties>
-        <author email="gsingers at apache.org">Grant Ingersoll</author>
-    </properties>
-
-    <body>
-
-        <section id="Introduction"><title>Introduction</title>
-            <p>Lucene scoring is the heart of why we all love Lucene.  It is blazingly fast and it hides almost all of the complexity from the user.
-                In a nutshell, it works.  At least, that is, until it doesn't work, or doesn't work as one would expect it to
-            work.  Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
-            scores lower than a different document with only one of the query terms. </p>
-            <p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
-            help you figure out the what and why of Lucene scoring.</p>
-            <p>Lucene scoring uses a combination of the
-                <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
-                    Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
-                to determine
-                how relevant a given Document is to a User's query.  In general, the idea behind the VSM is the more
-                times a query term appears in a document relative to
-                the number of times the term appears in all the documents in the collection, the more relevant that
-                document is to the query.  It uses the Boolean model to first narrow down the documents that need to
-                be scored based on the use of boolean logic in the Query specification.  Lucene also adds some
-                capabilities and refinements onto this model to support boolean and fuzzy searching, but it
-                essentially remains a VSM based system at the heart.
-                For some valuable references on VSM and IR in general refer to the
-                <a href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references</a>.
-            </p>
-            <p>The rest of this document will cover <a href="#Scoring">Scoring</a> basics and how to change your
-                <a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a>.  Next it will cover ways you can
-                customize the Lucene internals in <a href="#Changing your Scoring -- Expert Level">Changing your Scoring
-                -- Expert Level</a> which gives details on implementing your own
-                <a href="api/core/org/apache/lucene/search/Query.html">Query</a> class and related functionality.  Finally, we
-                will finish up with some reference material in the <a href="#Appendix">Appendix</a>.
-            </p>
-        </section>
-        <section id="Scoring"><title>Scoring</title>
-            <p>Scoring is very much dependent on the way documents are indexed,
-                so it is important to understand indexing (see
-                <a href="gettingstarted.html">Apache Lucene - Getting Started Guide</a>
-                and the Lucene
-                <a href="fileformats.html">file formats</a>
-                before continuing on with this section.)  It is also assumed that readers know how to use the
-                <a href="api/core/org/apache/lucene/search/Searcher.html#explain(Query query, int doc)">Searcher.explain(Query query, int doc)</a> functionality,
-                which can go a long way in informing why a score is returned.
-            </p>
-            <section id="Fields and Documents"><title>Fields and Documents</title>
-                <p>In Lucene, the objects we are scoring are
-                    <a href="api/core/org/apache/lucene/document/Document.html">Documents</a>.  A Document is a collection
-                of
-                    <a href="api/core/org/apache/lucene/document/Field.html">Fields</a>.  Each Field has semantics about how
-                it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.)  It is important to
-                    note that Lucene scoring works on Fields and then combines the results to return Documents.  This is
-                    important because two Documents with the exact same content, but one having the content in two Fields
-                    and the other in one Field will return different scores for the same query due to length normalization
-                    (assumming the
-                    <a href="api/core/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
-                    on the Fields).
-                </p>
-            </section>
-            <section id="Score Boosting"><title>Score Boosting</title>
-                <p>Lucene allows influencing search results by "boosting" in more than one level:
-                  <ul>
-                    <li><b>Document level boosting</b>
-                    - while indexing - by calling
-                    <a href="api/core/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
-                    before a document is added to the index.
-                    </li>
-                    <li><b>Document's Field level boosting</b>
-                    - while indexing - by calling
-                    <a href="api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
-                    before adding a field to the document (and before adding the document to the index).
-                    </li>
-                    <li><b>Query level boosting</b>
-                     - during search, by setting a boost on a query clause, calling
-                     <a href="api/core/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
-                    </li>
-                  </ul>
-                </p>
-                <p>Indexing time boosts are preprocessed for storage efficiency and written to
-                  the directory (when writing the document) in a single byte (!) as follows:
-                  For each field of a document, all boosts of that field
-                  (i.e. all boosts under the same field name in that doc) are multiplied.
-                  The result is multiplied by the boost of the document,
-                  and also multiplied by a "field length norm" value
-                  that represents the length of that field in that doc
-                  (so shorter fields are automatically boosted up).
-                  The result is decoded as a single byte
-                  (with some precision loss of course) and stored in the directory.
-                  The similarity object in effect at indexing computes the length-norm of the field.
-                </p>
-                <p>This composition of 1-byte representation of norms
-                (that is, indexing time multiplication of field boosts &amp; doc boost &amp; field-length-norm)
-                is nicely described in
-                <a href="api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
-                </p>
-                <p>Encoding and decoding of the resulted float norm in a single byte are done by the
-                static methods of the class Similarity:
-                <a href="api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
-                <a href="api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
-                Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
-                e.g. decode(encode(0.89)) = 0.75.
-                At scoring (search) time, this norm is brought into the score of document
-                as <b>norm(t, d)</b>, as shown by the formula in
-                <a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a>.
-                </p>
-            </section>
-            <section id="Understanding the Scoring Formula"><title>Understanding the Scoring Formula</title>
-
-                <p>
-                This scoring formula is described in the
-                    <a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a> class.  Please take the time to study this formula, as it contains much of the information about how the
-                    basics of Lucene scoring work, especially the
-                    <a href="api/core/org/apache/lucene/search/TermQuery.html">TermQuery</a>.
-                </p>
-            </section>
-            <section id="The Big Picture"><title>The Big Picture</title>
-                <p>OK, so the tf-idf formula and the
-                    <a href="api/core/org/apache/lucene/search/Similarity.html">Similarity</a>
-                    is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are
-                    the use and interactions between the
-                    <a href="api/core/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in
-                    response to a user's information need.
-                </p>
-                <p>In this regard, Lucene offers a wide variety of <a href="api/core/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the
-                    <a href="api/core/org/apache/lucene/search/package-summary.html">org.apache.lucene.search</a> package.
-                    These implementations can be combined in a wide variety of ways to provide complex querying
-                    capabilities along with
-                    information about where matches took place in the document collection. The <a href="#Query Classes">Query</a>
-                    section below
-                    highlights some of the more important Query classes.  For information on the other ones, see the
-                    <a href="api/core/org/apache/lucene/search/package-summary.html">package summary</a>.  For details on implementing
-                    your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --
-                    Expert Level</a> below.
-                </p>
-                <p>Once a Query has been created and submitted to the
-                    <a href="api/core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process
-                begins.  (See the <a
-                href="#Appendix">Appendix</a> Algorithm section for more notes on the process.)  After some infrastructure setup,
-                control finally passes to the <a href="api/core/org/apache/lucene/search/Weight.html">Weight</a> implementation and its
-                    <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a> instance.  In the case of any type of
-                    <a href="api/core/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
-                    <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a>
-                    (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class) or
-                    <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight</a>
-                    (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class).
-                </p>
-                <p>
-                    Assuming the use of the BooleanWeight2, a
-                    BooleanScorer2 is created by bringing together
-                    all of the
-                    <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.
-                    When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
-                    of clauses in the Query.  This internal Scorer essentially loops over the sub scorers and sums the scores
-                    provided by each scorer while factoring in the coord() score.
-                    <!-- Do we want to fill in the details of the counting sum scorer, disjunction scorer, etc.? -->
-                </p>
-            </section>
-            <section id="Query Classes"><title>Query Classes</title>
-                <p>For information on the Query Classes, refer to the
-                    <a href="api/core/org/apache/lucene/search/package-summary.html#query">search package javadocs</a>
-                </p>
-            </section>
-            <section id="Changing Similarity"><title>Changing Similarity</title>
-                <p>One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors.  For information on
-                how to do this, see the
-                    <a href="api/core/org/apache/lucene/search/package-summary.html#changingSimilarity">search package javadocs</a></p>
-            </section>
-
-        </section>
-        <section id="Changing your Scoring -- Expert Level"><title>Changing your Scoring -- Expert Level</title>
-            <p>At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.)  To learn more
-                about how to do this, refer to the
-                <a href="api/core/org/apache/lucene/search/package-summary.html#scoring">search package javadocs</a>
-            </p>
-        </section>
-
-        <section id="Appendix"><title>Appendix</title>
-            <section id="Algorithm"><title>Algorithm</title>
-                <p>This section is mostly notes on stepping through the Scoring process and serves as
-                    fertilizer for the earlier sections.</p>
-                <p>In the typical search application, a
-                    <a href="api/core/org/apache/lucene/search/Query.html">Query</a>
-                    is passed to the
-                    <a
-                            href="api/core/org/apache/lucene/search/Searcher.html">Searcher</a>
-                    , beginning the scoring process.
-                </p>
-                <p>Once inside the Searcher, a
-                    <a href="api/core/org/apache/lucene/search/Collector.html">Collector</a>
-                    is used for the scoring and sorting of the search results.
-                    These important objects are involved in a search:
-                    <ol>
-                        <li>The
-                            <a href="api/core/org/apache/lucene/search/Weight.html">Weight</a>
-                            object of the Query. The Weight object is an internal representation of the Query that
-                            allows the Query to be reused by the Searcher.
-                        </li>
-                        <li>The Searcher that initiated the call.</li>
-                        <li>A
-                            <a href="api/core/org/apache/lucene/search/Filter.html">Filter</a>
-                            for limiting the result set. Note, the Filter may be null.
-                        </li>
-                        <li>A
-                            <a href="api/core/org/apache/lucene/search/Sort.html">Sort</a>
-                            object for specifying how to sort the results if the standard score based sort method is not
-                            desired.
-                        </li>
-                    </ol>
-                </p>
-                <p> Assuming we are not sorting (since sorting doesn't
-                    effect the raw Lucene score),
-                    we call one of the search methods of the Searcher, passing in the
-                    <a href="api/core/org/apache/lucene/search/Weight.html">Weight</a>
-                    object created by Searcher.createWeight(Query),
-                    <a href="api/core/org/apache/lucene/search/Filter.html">Filter</a>
-                    and the number of results we want. This method
-                    returns a
-                    <a href="api/core/org/apache/lucene/search/TopDocs.html">TopDocs</a>
-                    object, which is an internal collection of search results.
-                    The Searcher creates a
-                    <a href="api/core/org/apache/lucene/search/TopScoreDocCollector.html">TopScoreDocCollector</a>
-                    and passes it along with the Weight, Filter to another expert search method (for more on the
-                    <a href="api/core/org/apache/lucene/search/Collector.html">Collector</a>
-                    mechanism, see
-                    <a href="api/core/org/apache/lucene/search/Searcher.html">Searcher</a>
-                    .) The TopDocCollector uses a
-                    <a href="api/core/org/apache/lucene/util/PriorityQueue.html">PriorityQueue</a>
-                    to collect the top results for the search.
-                </p>
-                <p>If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise,
-                    we ask the Weight for
-                    a
-                    <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>
-                    for the
-                    <a href="api/core/org/apache/lucene/index/IndexReader.html">IndexReader</a>
-                    of the current searcher and we proceed by
-                    calling the score method on the
-                    <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>
-                    .
-                </p>
-                <p>At last, we are actually going to score some documents. The score method takes in the Collector
-                    (most likely the TopScoreDocCollector or TopFieldCollector) and does its business.
-                    Of course, here is where things get involved. The
-                    <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>
-                    that is returned by the
-                    <a href="api/core/org/apache/lucene/search/Weight.html">Weight</a>
-                    object depends on what type of Query was submitted. In most real world applications with multiple
-                    query terms,
-                    the
-                    <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer</a>
-                    is going to be a
-                    <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanScorer2.java?view=log">BooleanScorer2</a>
-                    (see the section on customizing your scoring for info on changing this.)
-
-                </p>
-                <p>Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the
-                    coord() factor. We then
-                    get a internal Scorer based on the required, optional and prohibited parts of the query.
-                    Using this internal Scorer, the BooleanScorer2 then proceeds
-                    into a while loop based on the Scorer#next() method. The next() method advances to the next document
-                    matching the query. This is an
-                    abstract method in the Scorer class and is thus overriden by all derived
-                    implementations.  <!-- DOUBLE CHECK THIS -->If you have a simple OR query
-                    your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers
-                    from the sub scorers of the OR'd terms.</p>
-            </section>
-        </section>
-    </body>
-</document>