lucene-java-3.5.0/lucene/src/site/src/documentation/content/xdocs/demo2.xml

   1 <?xml version="1.0"?>
   2 <document>
   3         <header>
   4         <title>
   5         Apache Lucene - Basic Demo Sources Walk-through
   6                 </title>
   7         </header>
   8 <properties>
   9 <author email="acoliver@apache.org">Andrew C. Oliver</author>
  10 </properties>
  11 <body>
  12
  13 <section id="About the Code"><title>About the Code</title>
  14 <p>
  15 In this section we walk through the sources behind the command-line Lucene demo: where to find them,
  16 their parts and their function.  This section is intended for Java developers wishing to understand
  17 how to use Lucene in their applications.
  18 </p>
  19 </section>
  20
  21
  22 <section id="Location of the source"><title>Location of the source</title>
  23
  24 <p>
  25 NOTE: to examine the sources, you need to download and extract a source checkout of
  26 Lucene: (lucene-{version}-src.zip).
  27 </p>
  28
  29 <p>
  30 Relative to the directory created when you extracted Lucene, you
  31 should see a directory called <code>lucene/contrib/demo/</code>.  This is the root for the Lucene
  32 demo.  Under this directory is <code>src/java/org/apache/lucene/demo/</code>.  This is where all
  33 the Java sources for the demo live.
  34 </p>
  35
  36 <p>
  37 Within this directory you should see the <code>IndexFiles.java</code> class we executed earlier.
  38 Bring it up in <code>vi</code> or your editor of choice and let's take a look at it.
  39 </p>
  40
  41 </section>
  42
  43 <section id="IndexFiles"><title>IndexFiles</title>
  44
  45 <p>
  46 As we discussed in the previous walk-through, the <a
  47 href="api/contrib-demo/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class creates a Lucene
  48 Index. Let's take a look at how it does this.
  49 </p>
  50
  51 <p>
  52 The <code>main()</code> method parses the command-line parameters, then in preparation for
  53 instantiating <a href="api/core/org/apache/lucene/index/IndexWriter.html">IndexWriter</a>, opens a
  54 <a href="api/core/org/apache/lucene/store/Directory.html">Directory</a> and instantiates
  55 <a href="api/module-analysis-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html"
  56 >StandardAnalyzer</a> and
  57 <a href="api/core/org/apache/lucene/index/IndexWriterConfig.html">IndexWriterConfig</a>.
  58 </p>
  59
  60 <p>
  61 The value of the <code>-index</code> command-line parameter is the name of the filesystem directory
  62 where all index information should be stored.  If <code>IndexFiles</code> is invoked with a
  63 relative path given in the <code>-index</code> command-line parameter, or if the <code>-index</code>
  64 command-line parameter is not given, causing the default relative index path "<code>index</code>"
  65 to be used, the index path will be created as a subdirectory of the current working directory
  66 (if it does not already exist).  On some platforms, the index path may be created in a different
  67 directory (such as the user's home directory).
  68 </p>
  69
  70 <p>
  71 The <code>-docs</code> command-line parameter value is the location of the directory containing
  72 files to be indexed.
  73 </p>
  74
  75 <p>
  76 The <code>-update</code> command-line parameter tells <code>IndexFiles</code> not to delete the
  77 index if it already exists.  When <code>-update</code> is not given, <code>IndexFiles</code> will
  78 first wipe the slate clean before indexing any documents.
  79 </p>
  80
  81 <p>
  82 Lucene <a href="api/core/org/apache/lucene/store/Directory.html">Directory</a>s are used by the
  83 <code>IndexWriter</code> to store information in the index.  In addition to the
  84 <a href="api/core/org/apache/lucene/store/FSDirectory.html">FSDirectory</a> implementation we are using,
  85 there are several other <code>Directory</code> subclasses that can write to RAM, to databases, etc.
  86 </p>
  87
  88 <p>
  89 Lucene <a href="api/core/org/apache/lucene/analysis/Analyzer.html">Analyzer</a>s are processing pipelines
  90 that break up text into indexed tokens, a.k.a. terms, and optionally perform other operations on these
  91 tokens, e.g. downcasing, synonym insertion, filtering out unwanted tokens, etc.  The <code>Analyzer</code>
  92 we are using is <code>StandardAnalyzer</code>, which creates tokens using the Word Break rules from the
  93 Unicode Text Segmentation algorithm specified in <a href="http://unicode.org/reports/tr29/">Unicode
  94 Standard Annex #29</a>; converts tokens to lowercase; and then filters out stopwords.  Stopwords are
  95 common language words such as articles (a, an, the, etc.) and other tokens that may have less value for
  96 searching.  It should be noted that there are different rules for every language, and you should use the
  97 proper analyzer for each.  Lucene currently provides Analyzers for a number of different languages (see
  98 the javadocs under
  99 <a href="api/all/org/apache/lucene/analysis/"
 100 >lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis</a>).
 101 </p>
 102
 103 <p>
 104 The <code>IndexWriterConfig</code> instance holds all configuration for <code>IndexWriter</code>.  For
 105 example, we set the <code>OpenMode</code> to use here based on the value of the <code>-update</code>
 106 command-line parameter.
 107 </p>
 108
 109 <p>
 110 Looking further down in the file, after <code>IndexWriter</code> is instantiated, you should see the
 111 <code>indexDocs()</code> code.  This recursive function crawls the directories and creates
 112 <a href="api/core/org/apache/lucene/document/Document.html">Document</a> objects.  The
 113 <code>Document</code> is simply a data object to represent the text content from the file as well as
 114 its creation time and location.  These instances are added to the <code>IndexWriter</code>.  If
 115 the <code>-update</code> command-line parameter is given, the <code>IndexWriter</code>
 116 <code>OpenMode</code> will be set to <code>OpenMode.CREATE_OR_APPEND</code>, and rather than
 117 adding documents to the index, the <code>IndexWriter</code> will <strong>update</strong> them
 118 in the index by attempting to find an already-indexed document with the same identifier (in our
 119 case, the file path serves as the identifier); deleting it from the index if it exists; and then
 120 adding the new document to the index.
 121 </p>
 122
 123 </section>
 124
 125 <section id="Searching Files"><title>Searching Files</title>
 126
 127 <p>
 128 The <a href="api/contrib-demo/org/apache/lucene/demo/SearchFiles.html">SearchFiles</a> class is
 129 quite simple.  It primarily collaborates with an
 130 <a href="api/core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>,
 131 <a href="api/modules-analysis-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html"
 132 >StandardAnalyzer</a> (which is used in the
 133 <a href="api/contrib-demo/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class as well)
 134 and a <a href="api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a>.  The
 135 query parser is constructed with an analyzer used to interpret your query text in the same way the
 136 documents are interpreted: finding word boundaries, downcasing, and removing useless words like
 137 'a', 'an' and 'the'.  The <a href="api/core/org/apache/lucene/search/Query.html">Query</a>
 138 object contains the results from the
 139 <a href="api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser</a> which is passed
 140 to the searcher.  Note that it's also possible to programmatically construct a rich
 141 <a href="api/core/org/apache/lucene/search/Query.html">Query</a> object without using the query
 142 parser.  The query parser just enables decoding the <a href="queryparsersyntax.html">Lucene query
 143 syntax</a> into the corresponding <a href="api/core/org/apache/lucene/search/Query.html">Query</a>
 144 object.
 145 </p>
 146
 147 <p>
 148 <code>SearchFiles</code> uses the <code>IndexSearcher.search(query,n)</code> method that returns
 149 <a href="api/core/org/apache/lucene/search/TopDocs.html">TopDocs</a> with max <code>n</code> hits.
 150 The results are printed in pages, sorted by score (i.e. relevance).
 151 </p>
 152 </section>
 153 </body>
 154 </document>