pylucene 3.5.0-3
[pylucene.git] / lucene-java-3.5.0 / lucene / src / site / src / documentation / content / xdocs / fileformats.xml
diff --git a/lucene-java-3.5.0/lucene/src/site/src/documentation/content/xdocs/fileformats.xml b/lucene-java-3.5.0/lucene/src/site/src/documentation/content/xdocs/fileformats.xml
new file mode 100644 (file)
index 0000000..02a66d2
--- /dev/null
@@ -0,0 +1,1936 @@
+<?xml version="1.0"?>
+
+<document>
+    <header>
+        <title>
+            Apache Lucene - Index File Formats
+        </title>
+    </header>
+
+    <body>
+        <section id="Index File Formats"><title>Index File Formats</title>
+
+            <p>
+                This document defines the index file formats used
+                in this version of Lucene. If you are using a different
+                version of Lucene, please consult the copy of
+                <code>docs/fileformats.html</code>
+                that was distributed
+                with the version you are using.
+            </p>
+
+            <p>
+                Apache Lucene is written in Java, but several
+                efforts are underway to write
+                <a href="http://wiki.apache.org/lucene-java/LuceneImplementations">versions
+                    of Lucene in other programming
+                languages</a>.  If these versions are to remain compatible with Apache
+                Lucene, then a language-independent definition of the Lucene index
+                format is required.  This document thus attempts to provide a
+                complete and independent definition of the Apache Lucene file
+                formats.
+            </p>
+
+            <p>
+                As Lucene evolves, this document should evolve.
+                Versions of Lucene in different programming languages should endeavor
+                to agree on file formats, and generate new versions of this document.
+            </p>
+
+            <p>
+                Compatibility notes are provided in this document,
+                describing how file formats have changed from prior versions.
+            </p>
+
+            <p>
+                In version 2.1, the file format was changed to allow
+                lock-less commits (ie, no more commit lock). The
+                change is fully backwards compatible: you can open a
+                pre-2.1 index for searching or adding/deleting of
+                docs. When the new segments file is saved
+                (committed), it will be written in the new file format
+                (meaning no specific "upgrade" process is needed).
+                But note that once a commit has occurred, pre-2.1
+                Lucene will not be able to read the index.
+            </p>
+
+            <p>
+                In version 2.3, the file format was changed to allow
+               segments to share a single set of doc store (vectors &amp;
+               stored fields) files.  This allows for faster indexing
+               in certain cases.  The change is fully backwards
+               compatible (in the same way as the lock-less commits
+               change in 2.1).
+            </p>
+
+            <p>
+               In version 2.4, Strings are now written as true UTF-8
+               byte sequence, not Java's modified UTF-8.  See issue
+               LUCENE-510 for details.
+            </p>
+
+           <p>
+               In version 2.9, an optional opaque Map&lt;String,String&gt;
+               CommitUserData may be passed to IndexWriter's commit
+               methods (and later retrieved), which is recorded in
+               the segments_N file.  See issue LUCENE-1382 for
+               details.  Also, diagnostics were added to each segment
+               written recording details about why it was written
+               (due to flush, merge; which OS/JRE was used; etc.).
+               See issue LUCENE-1654 for details.
+            </p>
+           
+           <p>
+               In version 3.0, compressed fields are no longer
+               written to the index (they can still be read, but on
+               merge the new segment will write them,
+               uncompressed). See issue LUCENE-1960 for details.
+            </p>
+
+        <p>
+            In version 3.1, segments records the code version
+            that created them. See LUCENE-2720 for details.
+            
+            Additionally segments track explicitly whether or
+            not they have term vectors. See LUCENE-2811 for details.
+           </p>
+        <p>
+            In version 3.2, numeric fields are written as natively
+            to stored fields file, previously they were stored in
+            text format only.
+           </p>
+        <p>
+            In version 3.4, fields can omit position data while
+            still indexing term frequencies.
+        </p>
+        </section>
+
+        <section id="Definitions"><title>Definitions</title>
+
+            <p>
+                The fundamental concepts in Lucene are index,
+                document, field and term.
+            </p>
+
+
+            <p>
+                An index contains a sequence of documents.
+            </p>
+
+            <ul>
+                <li>
+                    <p>
+                        A document is a sequence of fields.
+                    </p>
+                </li>
+
+                <li>
+                    <p>
+                        A field is a named sequence of terms.
+                    </p>
+                </li>
+
+                <li>
+                    A term is a string.
+                </li>
+            </ul>
+
+            <p>
+                The same string in two different fields is
+                considered a different term.  Thus terms are represented as a pair of
+                strings, the first naming the field, and the second naming text
+                within the field.
+            </p>
+
+            <section id="Inverted Indexing"><title>Inverted Indexing</title>
+
+                <p>
+                    The index stores statistics about terms in order
+                    to make term-based search more efficient.  Lucene's
+                    index falls into the family of indexes known as an <i>inverted
+                        index.</i> This is because it can list, for a term, the documents that contain
+                    it.  This is the inverse of the natural relationship, in which
+                    documents list terms.
+                </p>
+            </section>
+            <section id="Types of Fields">
+                <title>Types of Fields</title>
+                <p>
+                    In Lucene, fields may be <i>stored</i>, in which
+                    case their text is stored in the index literally, in a non-inverted
+                    manner.  Fields that are inverted are called <i>indexed</i>. A field
+                    may be both stored and indexed.</p>
+
+                <p>The text of a field may be <i>tokenized</i> into terms to be
+                    indexed, or the text of a field may be used literally as a term to be indexed.
+                    Most fields are
+                    tokenized, but sometimes it is useful for certain identifier fields
+                    to be indexed literally.
+                </p>
+                <p>See the <a href="api/core/org/apache/lucene/document/Field.html">Field</a> java docs for more information on Fields.</p>
+            </section>
+
+            <section id="Segments"><title>Segments</title>
+
+                <p>
+                    Lucene indexes may be composed of multiple sub-indexes, or
+                    <i>segments</i>. Each segment is a fully independent index, which could be searched
+                    separately. Indexes evolve by:
+                </p>
+
+                <ol>
+                    <li>
+                        <p>Creating new segments for newly added documents.</p>
+                    </li>
+                    <li>
+                        <p>Merging existing segments.</p>
+                    </li>
+                </ol>
+
+                <p>
+                    Searches may involve multiple segments and/or multiple indexes, each
+                    index potentially composed of a set of segments.
+                </p>
+            </section>
+
+            <section id="Document Numbers"><title>Document Numbers</title>
+
+                <p>
+                    Internally, Lucene refers to documents by an integer <i>document
+                        number</i>. The first document added to an index is numbered zero, and each
+                    subsequent document added gets a number one greater than the previous.
+                </p>
+
+                <p>
+                    <br/>
+                </p>
+
+                <p>
+                    Note that a document's number may change, so caution should be taken
+                    when storing these numbers outside of Lucene. In particular, numbers may
+                    change in the following situations:
+                </p>
+
+
+                <ul>
+                    <li>
+                        <p>
+                            The
+                            numbers stored in each segment are unique only within the segment,
+                            and must be converted before they can be used in a larger context.
+                            The standard technique is to allocate each segment a range of
+                            values, based on the range of numbers used in that segment.  To
+                            convert a document number from a segment to an external value, the
+                            segment's <i>base</i> document
+                            number is added.  To convert an external value back to a
+                            segment-specific value, the  segment is identified by the range that
+                            the external value is in, and the segment's base value is
+                            subtracted.  For example two five document segments might be
+                            combined, so that the first segment has a base value of zero, and
+                            the second of five.  Document three from the second segment would
+                            have an external value of eight.
+                        </p>
+                    </li>
+                    <li>
+                        <p>
+                            When documents are deleted, gaps are created
+                            in the numbering. These are eventually removed as the index evolves
+                            through merging. Deleted documents are dropped when segments are
+                            merged. A freshly-merged segment thus has no gaps in its numbering.
+                        </p>
+                    </li>
+                </ul>
+
+            </section>
+
+        </section>
+
+        <section id="Overview"><title>Overview</title>
+
+            <p>
+                Each segment index maintains the following:
+            </p>
+            <ul>
+                <li>
+                    <p>Field names. This
+                        contains the set of field names used in the index.
+
+                    </p>
+                </li>
+                <li>
+                    <p>Stored Field
+                        values. This contains, for each document, a list of attribute-value
+                        pairs, where the attributes are field names. These are used to
+                        store auxiliary information about the document, such as its title,
+                        url, or an identifier to access a
+                        database. The set of stored fields are what is returned for each hit
+                        when searching. This is keyed by document number.
+                    </p>
+                </li>
+                <li>
+                    <p>Term dictionary.
+                        A dictionary containing all of the terms used in all of the indexed
+                        fields of all of the documents. The dictionary also contains the
+                        number of documents which contain the term, and pointers to the
+                        term's frequency and proximity data.
+                    </p>
+                </li>
+
+                <li>
+                    <p>Term Frequency
+                        data. For each term in the dictionary, the numbers of all the
+                        documents that contain that term, and the frequency of the term in
+                        that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)
+                    </p>
+                </li>
+
+                <li>
+                    <p>Term Proximity
+                        data. For each term in the dictionary, the positions that the term
+                        occurs in each document.  Note that this will
+                        not exist if all fields in all documents omit position data.
+                    </p>
+                </li>
+
+                <li>
+                    <p>Normalization
+                        factors. For each field in each document, a value is stored that is
+                        multiplied into the score for hits on that field.
+                    </p>
+                </li>
+                <li>
+                    <p>Term Vectors. For each field in each document, the term vector
+                        (sometimes called document vector) may be stored. A term vector consists
+                        of term text and term frequency. To add Term Vectors to your index see the
+                        <a href="api/core/org/apache/lucene/document/Field.html">Field</a>
+                        constructors
+                    </p>
+                </li>
+                <li>
+                    <p>Deleted documents.
+                        An optional file indicating which documents are deleted.
+                    </p>
+                </li>
+            </ul>
+
+            <p>Details on each of these are provided in subsequent sections.
+            </p>
+        </section>
+
+        <section id="File Naming"><title>File Naming</title>
+
+            <p>
+                All files belonging to a segment have the same name with varying
+                extensions. The extensions correspond to the different file formats
+                described below. When using the Compound File format (default in 1.4 and greater) these files are
+                collapsed into a single .cfs file (see below for details)
+            </p>
+
+            <p>
+                Typically, all segments
+                in an index are stored in a single directory, although this is not
+                required.
+            </p>
+
+            <p>
+                As of version 2.1 (lock-less commits), file names are
+                never re-used (there is one exception, "segments.gen",
+                see below). That is, when any file is saved to the
+                Directory it is given a never before used filename.
+                This is achieved using a simple generations approach.
+                For example, the first segments file is segments_1,
+                then segments_2, etc. The generation is a sequential
+                long integer represented in alpha-numeric (base 36)
+                form.
+            </p>
+
+        </section>
+      <section id="file-names"><title>Summary of File Extensions</title>
+        <p>The following table summarizes the names and extensions of the files in Lucene:
+          <table>
+            <tr>
+              <th>Name</th>
+              <th>Extension</th>
+              <th>Brief Description</th>
+            </tr>
+            <tr>
+              <td><a href="#Segments File">Segments File</a></td>
+              <td>segments.gen, segments_N</td>
+              <td>Stores information about segments</td>
+            </tr>
+            <tr>
+              <td><a href="#Lock File">Lock File</a></td>
+              <td>write.lock</td>
+              <td>The Write lock prevents multiple IndexWriters from writing to the same file.</td>
+            </tr>
+            <tr>
+              <td><a href="#Compound Files">Compound File</a></td>
+              <td>.cfs</td>
+              <td>An optional "virtual" file consisting of all the other index files for systems
+              that frequently run out of file handles.</td>
+            </tr>
+              <tr>
+              <td><a href="#Compound File">Compound File Entry table</a></td>
+              <td>.cfe</td>
+              <td>The "virtual" compound file's entry table holding all entries in the corresponding .cfs file (Since 3.4)</td>
+            </tr>
+            <tr>
+              <td><a href="#Fields">Fields</a></td>
+              <td>.fnm</td>
+              <td>Stores information about the fields</td>
+            </tr>
+            <tr>
+              <td><a href="#field_index">Field Index</a></td>
+              <td>.fdx</td>
+              <td>Contains pointers to field data</td>
+            </tr>
+            <tr>
+              <td><a href="#field_data">Field Data</a></td>
+              <td>.fdt</td>
+              <td>The stored fields for documents</td>
+            </tr>
+            <tr>
+              <td><a href="#tis">Term Infos</a></td>
+              <td>.tis</td>
+              <td>Part of the term dictionary, stores term info</td>
+            </tr>
+            <tr>
+              <td><a href="#tii">Term Info Index</a></td>
+              <td>.tii</td>
+              <td>The index into the Term Infos file</td>
+            </tr>
+            <tr>
+              <td><a href="#Frequencies">Frequencies</a></td>
+              <td>.frq</td>
+              <td>Contains the list of docs which contain each term along with frequency</td>
+            </tr>
+            <tr>
+              <td><a href="#Positions">Positions</a></td>
+              <td>.prx</td>
+              <td>Stores position information about where a term occurs in the index</td>
+            </tr>
+            <tr>
+              <td><a href="#Normalization Factors">Norms</a></td>
+              <td>.nrm</td>
+              <td>Encodes length and boost factors for docs and fields</td>
+            </tr>
+            <tr>
+              <td><a href="#tvx">Term Vector Index</a></td>
+              <td>.tvx</td>
+              <td>Stores offset into the document data file</td>
+            </tr>
+            <tr>
+              <td><a href="#tvd">Term Vector Documents</a></td>
+              <td>.tvd</td>
+              <td>Contains information about each document that has term vectors</td>
+            </tr>
+            <tr>
+              <td><a href="#tvf">Term Vector Fields</a></td>
+              <td>.tvf</td>
+              <td>The field level info about term vectors</td>
+            </tr>
+            <tr>
+              <td><a href="#Deleted Documents">Deleted Documents</a></td>
+              <td>.del</td>
+              <td>Info about what files are deleted</td>
+            </tr>
+          </table>
+
+        </p>
+      </section>
+
+        <section id="Primitive Types"><title>Primitive Types</title>
+
+            <section id="Byte"><title>Byte</title>
+
+                <p>
+                    The most primitive type
+                    is an eight-bit byte. Files are accessed as sequences of bytes. All
+                    other data types are defined as sequences
+                    of bytes, so file formats are byte-order independent.
+                </p>
+
+            </section>
+
+            <section id="UInt32"><title>UInt32</title>
+
+                <p>
+                    32-bit unsigned integers are written as four
+                    bytes, high-order bytes first.
+                </p>
+                <p>
+                    UInt32    --&gt; &lt;Byte&gt;<sup>4</sup>
+                </p>
+
+            </section>
+
+            <section id="Uint64"><title>Uint64</title>
+
+                <p>
+                    64-bit unsigned integers are written as eight
+                    bytes, high-order bytes first.
+                </p>
+
+                <p>UInt64    --&gt; &lt;Byte&gt;<sup>8</sup>
+                </p>
+
+            </section>
+
+            <section id="VInt"><title>VInt</title>
+
+                <p>
+                    A variable-length format for positive integers is
+                    defined where the high-order bit of each byte indicates whether more
+                    bytes remain to be read. The low-order seven bits are appended as
+                    increasingly more significant bits in the resulting integer value.
+                    Thus values from zero to 127 may be stored in a single byte, values
+                    from 128 to 16,383 may be stored in two bytes, and so on.
+                </p>
+
+                <p>
+                    <b>VInt Encoding Example</b>
+                </p>
+
+                <table width="100%" border="0" cellpadding="4" cellspacing="0">
+                    <col width="64*"/>
+                    <col width="64*"/>
+                    <col width="64*"/>
+                    <col width="64*"/>
+                    <tr valign="TOP">
+                        <td width="25%">
+                            <p align="RIGHT">
+                                <b>Value</b>
+                            </p>
+                        </td>
+                        <td width="25%">
+                            <p align="RIGHT">
+                                <b>First byte</b>
+                            </p>
+                        </td>
+                        <td width="25%">
+                            <p align="RIGHT">
+                                <b>Second byte</b>
+                            </p>
+                        </td>
+                        <td width="25%">
+                            <p align="RIGHT">
+                                <b>Third byte</b>
+                            </p>
+                        </td>
+                    </tr>
+                    <tr valign="BOTTOM">
+                        <td width="25%" sdval="0" sdnum="1033;0;#,##0">
+                            <p align="RIGHT">0
+                            </p>
+                        </td>
+                        <td width="25%" sdval="0" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                00000000
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                    <tr valign="BOTTOM">
+                        <td width="25%" sdval="1" sdnum="1033;0;#,##0">
+                            <p align="RIGHT">1
+                            </p>
+                        </td>
+                        <td width="25%" sdval="1" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                00000001
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                    <tr valign="BOTTOM">
+                        <td width="25%" sdval="2" sdnum="1033;0;#,##0">
+                            <p align="RIGHT">2
+                            </p>
+                        </td>
+                        <td width="25%" sdval="10" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                00000010
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                    <tr>
+                        <td width="25%" valign="TOP">
+                            <p align="RIGHT">...
+                            </p>
+                        </td>
+                        <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                        <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                        <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                    <tr valign="BOTTOM">
+                        <td width="25%" sdval="127" sdnum="1033;0;#,##0">
+                            <p align="RIGHT">127
+                            </p>
+                        </td>
+                        <td width="25%" sdval="1111111" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                01111111
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                    <tr valign="BOTTOM">
+                        <td width="25%" sdval="128" sdnum="1033;0;#,##0">
+                            <p align="RIGHT">128
+                            </p>
+                        </td>
+                        <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                10000000
+                            </p>
+                        </td>
+                        <td width="25%" sdval="1" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+                               margin-right: 0.01cm">
+                                00000001
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                    <tr valign="BOTTOM">
+                        <td width="25%" sdval="129" sdnum="1033;0;#,##0">
+                            <p align="RIGHT">129
+                            </p>
+                        </td>
+                        <td width="25%" sdval="10000001" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                10000001
+                            </p>
+                        </td>
+                        <td width="25%" sdval="1" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+                               margin-right: 0.01cm">
+                                00000001
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                    <tr valign="BOTTOM">
+                        <td width="25%" sdval="130" sdnum="1033;0;#,##0">
+                            <p align="RIGHT">130
+                            </p>
+                        </td>
+                        <td width="25%" sdval="10000010" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                10000010
+                            </p>
+                        </td>
+                        <td width="25%" sdval="1" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+                               margin-right: 0.01cm">
+                                00000001
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                    <tr>
+                        <td width="25%" valign="TOP">
+                            <p align="RIGHT">...
+                            </p>
+                        </td>
+                        <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                        <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                        <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                    <tr valign="BOTTOM">
+                        <td width="25%" sdval="16383" sdnum="1033;0;#,##0">
+                            <p align="RIGHT">16,383
+                            </p>
+                        </td>
+                        <td width="25%" sdval="11111111" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                11111111
+                            </p>
+                        </td>
+                        <td width="25%" sdval="1111111" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+                               margin-right: 0.01cm">
+                                01111111
+                            </p>
+                        </td>
+                        <td width="25%" sdnum="1033;0;00000000">
+                            <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+                               0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                    <tr valign="BOTTOM">
+                        <td width="25%" sdval="16384" sdnum="1033;0;#,##0">
+                            <p align="RIGHT">16,384
+                            </p>
+                        </td>
+                        <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                10000000
+                            </p>
+                        </td>
+                        <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+                               margin-right: 0.01cm">
+                                10000000
+                            </p>
+                        </td>
+                        <td width="25%" sdval="1" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: -0.47cm;
+                               margin-right: 0.01cm">
+                                00000001
+                            </p>
+                        </td>
+                    </tr>
+                    <tr valign="BOTTOM">
+                        <td width="25%" sdval="16385" sdnum="1033;0;#,##0">
+                            <p align="RIGHT">16,385
+                            </p>
+                        </td>
+                        <td width="25%" sdval="10000001" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                10000001
+                            </p>
+                        </td>
+                        <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+                               margin-right: 0.01cm">
+                                10000000
+                            </p>
+                        </td>
+                        <td width="25%" sdval="1" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: -0.47cm;
+                               margin-right: 0.01cm">
+                                00000001
+                            </p>
+                        </td>
+                    </tr>
+                    <tr>
+                        <td width="25%" valign="TOP">
+                            <p align="RIGHT">...
+                            </p>
+                        </td>
+                        <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+                               margin-right: 0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                        <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+                               margin-right: 0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                        <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+                            <p class="western" align="RIGHT" style="margin-left: -0.47cm;
+                               margin-right: 0.01cm">
+                                <br/>
+
+                            </p>
+                        </td>
+                    </tr>
+                </table>
+
+                <p>
+                    This provides compression while still being
+                    efficient to decode.
+                </p>
+
+            </section>
+
+            <section id="Chars"><title>Chars</title>
+
+                <p>
+                    Lucene writes unicode
+                    character sequences as UTF-8 encoded bytes.
+                </p>
+
+
+            </section>
+
+            <section id="String"><title>String</title>
+
+                <p>
+                   Lucene writes strings as UTF-8 encoded bytes.
+                    First the length, in bytes, is written as a VInt,
+                    followed by the bytes.
+                </p>
+
+                <p>
+                    String --&gt; VInt, Chars
+                </p>
+
+            </section>
+        </section>
+
+        <section id="Compound Types"><title>Compound Types</title>
+            <section id="MapStringString"><title>Map&lt;String,String&gt;</title>
+
+                <p>
+                   In a couple places Lucene stores a Map
+                    String-&gt;String.
+                </p>
+
+                <p>
+                   Map&lt;String,String&gt; --&gt; Count&lt;String,String&gt;<sup>Count</sup>
+                </p>
+
+            </section>
+
+        </section>
+
+        <section id="Per-Index Files"><title>Per-Index Files</title>
+
+            <p>
+                The files in this section exist one-per-index.
+            </p>
+
+            <section id="Segments File"><title>Segments File</title>
+
+                <p>
+                    The active segments in the index are stored in the
+                    segment info file,
+                    <tt>segments_N</tt>.
+                    There may
+                    be one or more
+                    <tt>segments_N</tt>
+                    files in the
+                    index; however, the one with the largest
+                    generation is the active one (when older
+                    segments_N files are present it's because they
+                    temporarily cannot be deleted, or, a writer is in
+                    the process of committing, or a custom
+                    <a href="api/core/org/apache/lucene/index/IndexDeletionPolicy.html">IndexDeletionPolicy</a>
+                   is in use). This file lists each
+                    segment by name, has details about the separate
+                    norms and deletion files, and also contains the
+                    size of each segment.
+                </p>
+
+                <p>
+                    As of 2.1, there is also a file
+                    <tt>segments.gen</tt>.
+                    This file contains the
+                    current generation (the
+                    <tt>_N</tt>
+                    in
+                    <tt>segments_N</tt>)
+                    of the index. This is
+                    used only as a fallback in case the current
+                    generation cannot be accurately determined by
+                    directory listing alone (as is the case for some
+                    NFS clients with time-based directory cache
+                    expiraation). This file simply contains an Int32
+                    version header (SegmentInfos.FORMAT_LOCKLESS =
+                    -2), followed by the generation recorded as Int64,
+                    written twice.
+                </p>
+                <p>
+                    <b>3.1</b>
+                    Segments --&gt; Format, Version, NameCounter, SegCount, &lt;SegVersion, SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField,
+                    NormGen<sup>NumField</sup>,
+                    IsCompoundFile, DeletionCount, HasProx, Diagnostics, HasVectors&gt;<sup>SegCount</sup>, CommitUserData, Checksum
+                </p>
+
+                <p>
+                    Format, NameCounter, SegCount, SegSize, NumField,
+                    DocStoreOffset, DeletionCount --&gt; Int32
+                </p>
+
+               <p>
+                    Version, DelGen, NormGen, Checksum --&gt; Int64
+                </p>
+
+                <p>
+                   SegVersion, SegName, DocStoreSegment --&gt; String
+                </p>
+
+               <p>
+                  Diagnostics --&gt; Map&lt;String,String&gt;
+               </p>
+
+                <p>
+                    IsCompoundFile, HasSingleNormFile,
+                    DocStoreIsCompoundFile, HasProx, HasVectors --&gt; Int8
+                </p>
+
+               <p>
+                   CommitUserData --&gt; Map&lt;String,String&gt;
+                </p>
+
+                <p>
+                    Format is -9 (SegmentInfos.FORMAT_DIAGNOSTICS).
+                </p>
+
+                <p>
+                    Version counts how often the index has been
+                    changed by adding or deleting documents.
+                </p>
+
+                <p>
+                    NameCounter is used to generate names for new segment files.
+                </p>
+
+                <p>
+                    SegVersion is the code version that created the segment.
+                </p>
+
+                <p>
+                    SegName is the name of the segment, and is used as the file name prefix
+                    for all of the files that compose the segment's index.
+                </p>
+
+                <p>
+                    SegSize is the number of documents contained in the segment index.
+                </p>
+
+                <p>
+                    DelGen is the generation count of the separate
+                    deletes file. If this is -1, there are no
+                    separate deletes. If it is 0, this is a pre-2.1
+                    segment and you must check filesystem for the
+                    existence of _X.del. Anything above zero means
+                    there are separate deletes (_X_N.del).
+                </p>
+
+                <p>
+                    NumField is the size of the array for NormGen, or
+                    -1 if there are no NormGens stored.
+                </p>
+
+                <p>
+                    NormGen records the generation of the separate
+                    norms files. If NumField is -1, there are no
+                    normGens stored and they are all assumed to be 0
+                    when the segment file was written pre-2.1 and all
+                    assumed to be -1 when the segments file is 2.1 or
+                    above. The generation then has the same meaning
+                    as delGen (above).
+                </p>
+
+                <p>
+                    IsCompoundFile records whether the segment is
+                    written as a compound file or not. If this is -1,
+                    the segment is not a compound file. If it is 1,
+                    the segment is a compound file. Else it is 0,
+                    which means we check filesystem to see if _X.cfs
+                    exists.
+                </p>
+
+                <p>
+                    If HasSingleNormFile is 1, then the field norms are
+                    written as a single joined file (with extension
+                    <tt>.nrm</tt>); if it is 0 then each field's norms
+                    are stored as separate <tt>.fN</tt> files.  See
+                    "Normalization Factors" below for details.
+                </p>
+
+                <p>
+                   DocStoreOffset, DocStoreSegment,
+                    DocStoreIsCompoundFile: If DocStoreOffset is -1,
+                    this segment has its own doc store (stored fields
+                    values and term vectors) files and DocStoreSegment
+                    and DocStoreIsCompoundFile are not stored.  In
+                    this case all files for stored field values
+                    (<tt>*.fdt</tt> and <tt>*.fdx</tt>) and term
+                    vectors (<tt>*.tvf</tt>, <tt>*.tvd</tt> and
+                    <tt>*.tvx</tt>) will be stored with this segment.
+                    Otherwise, DocStoreSegment is the name of the
+                    segment that has the shared doc store files;
+                    DocStoreIsCompoundFile is 1 if that segment is
+                    stored in compound file format (as a <tt>.cfx</tt>
+                    file); and DocStoreOffset is the starting document
+                    in the shared doc store files where this segment's
+                    documents begin.  In this case, this segment does
+                    not store its own doc store files but instead
+                    shares a single set of these files with other
+                    segments.
+                </p>
+
+                <p>
+                   Checksum contains the CRC32 checksum of all bytes
+                   in the segments_N file up until the checksum.
+                   This is used to verify integrity of the file on
+                   opening the index.
+               </p>
+
+               <p>
+                   DeletionCount records the number of deleted
+                   documents in this segment.
+               </p>
+
+               <p>
+                   HasProx is 1 if any fields in this segment have
+                   position data (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); else, it's 0.
+               </p>
+
+               <p>
+                   CommitUserData stores an optional user-supplied
+                   opaque Map&lt;String,String&gt; that was passed to
+                   IndexWriter's commit or prepareCommit, or
+                   IndexReader's flush methods.
+                </p>
+               <p>
+                   The Diagnostics Map is privately written by
+                   IndexWriter, as a debugging aid, for each segment
+                   it creates.  It includes metadata like the current
+                   Lucene version, OS, Java version, why the segment
+                   was created (merge, flush, addIndexes), etc.
+                </p>
+         
+        <p> HasVectors is 1 if this segment stores term vectors,
+            else it's 0.
+                </p>
+
+            </section>
+
+            <section id="Lock File"><title>Lock File</title>
+
+                <p>
+                    The write lock, which is stored in the index
+                    directory by default, is named "write.lock".  If
+                    the lock directory is different from the index
+                    directory then the write lock will be named
+                    "XXXX-write.lock" where XXXX is a unique prefix
+                    derived from the full path to the index directory.
+                    When this file is present, a writer is currently
+                    modifying the index (adding or removing
+                    documents).  This lock file ensures that only one
+                    writer is modifying the index at a time.
+                </p>
+            </section>
+
+            <section id="Deletable File"><title>Deletable File</title>
+
+                <p>
+                    A writer dynamically computes
+                    the files that are deletable, instead, so no file
+                    is written.
+                </p>
+
+            </section>
+
+            <section id="Compound Files"><title>Compound Files</title>
+
+                <p>Starting with Lucene 1.4 the compound file format became default. This
+                    is simply a container for all files described in the next section
+                                       (except for the .del file).</p>
+                                                               <p>Compound Entry Table (.cfe) --&gt; Version,  FileCount, &lt;FileName, DataOffset, DataLength&gt;
+                    <sup>FileCount</sup>
+                </p>
+
+                <p>Compound (.cfs) --&gt; FileData <sup>FileCount</sup>
+                </p>
+                
+                                                               <p>Version --&gt; Int</p>
+                                                               
+                <p>FileCount --&gt; VInt</p>
+
+                <p>DataOffset --&gt; Long</p>
+                
+                <p>DataLength --&gt; Long</p>
+
+                <p>FileName --&gt; String</p>
+
+                <p>FileData --&gt; raw file data</p>
+                <p>The raw file data is the data from the individual files named above.</p>
+
+               <p>Starting with Lucene 2.3, doc store files (stored
+               field values and term vectors) can be shared in a
+               single set of files for more than one segment.  When
+               compound file is enabled, these shared files will be
+               added into a single compound file (same format as
+               above) but with the extension <tt>.cfx</tt>.
+               </p>
+
+            </section>
+
+        </section>
+
+        <section id="Per-Segment Files"><title>Per-Segment Files</title>
+
+            <p>
+                The remaining files are all per-segment, and are
+                thus defined by suffix.
+            </p>
+            <section id="Fields"><title>Fields</title>
+                <p>
+                    <br/>
+                    <b>Field Info</b>
+                    <br/>
+                </p>
+
+                <p>
+                    Field names are
+                    stored in the field info file, with suffix .fnm.
+                </p>
+                <p>
+                    FieldInfos
+                    (.fnm) --&gt; FNMVersion,FieldsCount, &lt;FieldName,
+                    FieldBits&gt;
+                    <sup>FieldsCount</sup>
+                </p>
+
+                <p>
+                    FNMVersion, FieldsCount --&gt; VInt
+                </p>
+
+                <p>
+                    FieldName --&gt; String
+                </p>
+
+                <p>
+                    FieldBits --&gt; Byte
+                </p>
+
+                <p>
+                    <ul>
+                        <li>
+                            The low-order bit is one for
+                            indexed fields, and zero for non-indexed fields.
+                        </li>
+                        <li>
+                            The second lowest-order
+                            bit is one for fields that have term vectors stored, and zero for fields
+                            without term vectors.
+                        </li>
+                        <li>If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.</li>
+                        <li>If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.</li>
+                        <li>If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.</li>
+                        <li>If the sixth lowest-order bit is set (0x20), payloads are stored for the indexed field.</li>
+                        <li>If the seventh lowest-order bit is set (0x40), term frequencies and positions omitted for the indexed field.</li>
+                        <li>If the eighth lowest-order bit is set (0x80), positions are omitted for the indexed field.</li>
+                    </ul>
+                </p>
+
+               <p>
+                  FNMVersion (added in 2.9) is -2 for indexes from 2.9 - 3.3. It is -3 for indexes in Lucene 3.4+
+               </p>
+
+                <p>
+                    Fields are numbered by their order in this file. Thus field zero is
+                    the
+                    first field in the file, field one the next, and so on. Note that,
+                    like document numbers, field numbers are segment relative.
+                </p>
+
+
+
+                <p>
+                    <br/>
+                    <b>Stored Fields</b>
+                    <br/>
+                </p>
+
+                <p>
+                    Stored fields are represented by two files:
+                </p>
+
+                <ol>
+                    <li><a name="field_index"/>
+                        <p>
+                            The field index, or .fdx file.
+                        </p>
+
+                        <p>
+                            This contains, for each document, a pointer to
+                            its field data, as follows:
+                        </p>
+
+                        <p>
+                            FieldIndex
+                            (.fdx) --&gt;
+                            &lt;FieldValuesPosition&gt;
+                            <sup>SegSize</sup>
+                        </p>
+                        <p>FieldValuesPosition
+                            --&gt; Uint64
+                        </p>
+                        <p>This
+                            is used to find the location within the field data file of the
+                            fields of a particular document. Because it contains fixed-length
+                            data, this file may be easily randomly accessed. The position of
+                            document
+                            <i>n</i>
+                            's
+                            <i></i>
+                            field data is the Uint64 at
+                            <i>n*8</i>
+                            in
+                            this file.
+                        </p>
+                    </li>
+                    <li>
+                        <p><a name="field_data"/>
+                            The field data, or .fdt file.
+
+                        </p>
+
+                        <p>
+                            This contains the stored fields of each document,
+                            as follows:
+                        </p>
+
+                        <p>
+                            FieldData (.fdt) --&gt;
+                            &lt;DocFieldData&gt;
+                            <sup>SegSize</sup>
+                        </p>
+                        <p>DocFieldData --&gt;
+                            FieldCount, &lt;FieldNum, Bits, Value&gt;
+                            <sup>FieldCount</sup>
+                        </p>
+                        <p>FieldCount --&gt;
+                            VInt
+                        </p>
+                        <p>FieldNum --&gt;
+                            VInt
+                        </p>
+                        <p>Bits --&gt;
+                            Byte
+                        </p>
+                        <p>
+                            <ul>
+                                <li>low order bit is one for tokenized fields</li>
+                                <li>second bit is one for fields containing binary data</li>
+                                <li>third bit is one for fields with compression option enabled
+                                    (if compression is enabled, the algorithm used is ZLIB),
+                                    only available for indexes until Lucene version 2.9.x</li>
+                                <li>4th to 6th bit (mask: 0x7&lt;&lt;3) define the type of a
+                                numeric field: <ul>
+                                  <li>all bits in mask are cleared if no numeric field at all</li>
+                                  <li>1&lt;&lt;3: Value is Int</li>
+                                  <li>2&lt;&lt;3: Value is Long</li>
+                                  <li>3&lt;&lt;3: Value is Int as Float (as of Float.intBitsToFloat)</li>
+                                  <li>4&lt;&lt;3: Value is Long as Double (as of Double.longBitsToDouble)</li>
+                                </ul></li>
+                            </ul>
+                        </p>
+                        <p>Value --&gt;
+                            String | BinaryValue | Int | Long (depending on Bits)
+                        </p>
+                        <p>BinaryValue --&gt;
+                            ValueSize, &lt;Byte&gt;^ValueSize
+                        </p>
+                        <p>ValueSize --&gt;
+                            VInt
+                        </p>
+
+                    </li>
+                </ol>
+
+            </section>
+            <section id="Term Dictionary"><title>Term Dictionary</title>
+
+                <p>
+                    The term dictionary is represented as two files:
+                </p>
+                <ol>
+                    <li><a name="tis"/>
+                        <p>
+                            The term infos, or tis file.
+                        </p>
+
+                        <p>
+                            TermInfoFile (.tis)--&gt;
+                            TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos
+                        </p>
+                        <p>TIVersion --&gt;
+                            UInt32
+                        </p>
+                        <p>TermCount --&gt;
+                            UInt64
+                        </p>
+                        <p>IndexInterval --&gt;
+                            UInt32
+                        </p>
+                        <p>SkipInterval --&gt;
+                            UInt32
+                        </p>
+                        <p>MaxSkipLevels --&gt;
+                            UInt32
+                        </p>
+                        <p>TermInfos --&gt;
+                            &lt;TermInfo&gt;
+                            <sup>TermCount</sup>
+                        </p>
+                        <p>TermInfo --&gt;
+                            &lt;Term, DocFreq, FreqDelta, ProxDelta, SkipDelta&gt;
+                        </p>
+                        <p>Term --&gt;
+                            &lt;PrefixLength, Suffix, FieldNum&gt;
+                        </p>
+                        <p>Suffix --&gt;
+                            String
+                        </p>
+                        <p>PrefixLength,
+                            DocFreq, FreqDelta, ProxDelta, SkipDelta
+                            <br/>
+                            --&gt; VInt
+                        </p>
+                        <p>
+                           This file is sorted by Term. Terms are
+                            ordered first lexicographically (by UTF16
+                            character code) by the term's field name,
+                            and within that lexicographically (by
+                            UTF16 character code) by the term's text.
+                        </p>
+                        <p>TIVersion names the version of the format
+                            of this file and is equal to TermInfosWriter.FORMAT_CURRENT.
+                        </p>
+                        <p>Term
+                            text prefixes are shared. The PrefixLength is the number of initial
+                            characters from the previous term which must be pre-pended to a
+                            term's suffix in order to form the term's text. Thus, if the
+                            previous term's text was "bone" and the term is "boy",
+                            the PrefixLength is two and the suffix is "y".
+                        </p>
+                        <p>FieldNumber
+                            determines the term's field, whose name is stored in the .fdt file.
+                        </p>
+                        <p>DocFreq
+                            is the count of documents which contain the term.
+                        </p>
+                        <p>FreqDelta
+                            determines the position of this term's TermFreqs within the .frq
+                            file. In particular, it is the difference between the position of
+                            this term's data in that file and the position of the previous
+                            term's data (or zero, for the first term in the file).
+                        </p>
+                        <p>ProxDelta
+                            determines the position of this term's TermPositions within the .prx
+                            file. In particular, it is the difference between the position of
+                            this term's data in that file and the position of the previous
+                            term's data (or zero, for the first term in the file.  For fields
+                                       that omit position data, this will be 0 since
+                            prox information is not stored.
+                        </p>
+                        <p>SkipDelta determines the position of this
+                            term's SkipData within the .frq file. In
+                            particular, it is the number of bytes
+                            after TermFreqs that the SkipData starts.
+                            In other words, it is the length of the
+                            TermFreq data. SkipDelta is only stored 
+                            if DocFreq is not smaller than SkipInterval.
+                        </p>
+                    </li>
+                    <li>
+                        <p><a name="tii"/>
+                            The term info index, or .tii file.
+                        </p>
+
+                        <p>
+                            This contains every IndexInterval
+                            <sup>th</sup>
+                            entry from the .tis
+                            file, along with its location in the &quot;tis&quot; file. This is
+                            designed to be read entirely into memory and used to provide random
+                            access to the &quot;tis&quot; file.
+                        </p>
+
+                        <p>
+                            The structure of this file is very similar to the
+                            .tis file, with the addition of one item per record, the IndexDelta.
+                        </p>
+
+                        <p>
+                            TermInfoIndex (.tii)--&gt;
+                            TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices
+                        </p>
+                        <p>TIVersion --&gt;
+                            UInt32
+                        </p>
+                        <p>IndexTermCount --&gt;
+                            UInt64
+                        </p>
+                        <p>IndexInterval --&gt;
+                            UInt32
+                        </p>
+                        <p>SkipInterval --&gt;
+                            UInt32
+                        </p>
+                        <p>TermIndices --&gt;
+                            &lt;TermInfo, IndexDelta&gt;
+                            <sup>IndexTermCount</sup>
+                        </p>
+                        <p>IndexDelta --&gt;
+                            VLong
+                        </p>
+                        <p>IndexDelta
+                            determines the position of this term's TermInfo within the .tis file. In
+                            particular, it is the difference between the position of this term's
+                            entry in that file and the position of the previous term's entry.
+                        </p>
+                        <p>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int).
+                            Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while
+                            smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more
+                            accelerable cases.</p>
+                        <p>MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in 
+                           smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration.
+                           See format of .frq file for more information about skip levels.</p>
+                    </li>
+                </ol>
+            </section>
+
+            <section id="Frequencies"><title>Frequencies</title>
+
+                <p>
+                    The .frq file contains the lists of documents
+                    which contain each term, along with the frequency of the term in that
+                    document (except when frequencies are omitted: IndexOptions.DOCS_ONLY).
+                </p>
+                <p>FreqFile (.frq) --&gt;
+                    &lt;TermFreqs, SkipData&gt;
+                    <sup>TermCount</sup>
+                </p>
+                <p>TermFreqs --&gt;
+                    &lt;TermFreq&gt;
+                    <sup>DocFreq</sup>
+                </p>
+                <p>TermFreq --&gt;
+                    DocDelta[, Freq?]
+                </p>
+                <p>SkipData --&gt;
+                    &lt;&lt;SkipLevelLength, SkipLevel&gt;
+                    <sup>NumSkipLevels-1</sup>, SkipLevel&gt;
+                    &lt;SkipDatum&gt;
+                </p>
+                <p>SkipLevel --&gt;
+                    &lt;SkipDatum&gt;
+                    <sup>DocFreq/(SkipInterval^(Level + 1))</sup>
+                </p>
+                <p>SkipDatum --&gt;
+                    DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
+                </p>
+                <p>DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --&gt;
+                    VInt
+                </p>
+                <p>SkipChildLevelPointer --&gt;
+                    VLong
+                </p>
+                <p>TermFreqs
+                    are ordered by term (the term is implicit, from the .tis file).
+                </p>
+                <p>TermFreq
+                    entries are ordered by increasing document number.
+                </p>
+                <p>DocDelta: if frequencies are indexed, this determines both
+                    the document number and the frequency. In
+                    particular, DocDelta/2 is the difference between
+                    this document number and the previous document
+                    number (or zero when this is the first document in
+                    a TermFreqs). When DocDelta is odd, the frequency
+                    is one. When DocDelta is even, the frequency is
+                    read as another VInt.  If frequencies are omitted, DocDelta
+                    contains the gap (not multiplied by 2) between
+                    document numbers and no frequency information is
+                    stored.
+                </p>
+                <p>For example, the TermFreqs for a term which occurs
+                    once in document seven and three times in document
+                    eleven, with frequencies indexed, would be the following
+                    sequence of VInts:
+                </p>
+                <p>15, 8, 3
+                </p>
+               <p> If frequencies were omitted (IndexOptions.DOCS_ONLY) it would be this sequence
+               of VInts instead:
+                 </p>
+                <p>
+                  7,4
+                 </p>
+                <p>DocSkip records the document number before every
+                    SkipInterval
+                    <sup>th</sup>
+                    document in TermFreqs.
+                    If payloads are disabled for the term's field,
+                    then DocSkip represents the difference from the
+                    previous value in the sequence.
+                    If payloads are enabled for the term's field, 
+                    then DocSkip/2 represents the difference from the
+                    previous value in the sequence. If payloads are enabled
+                    and DocSkip is odd,
+                    then PayloadLength is stored indicating the length 
+                    of the last payload before the SkipInterval<sup>th</sup>
+                    document in TermPositions.
+                                       FreqSkip and ProxSkip record the position of every
+                    SkipInterval
+                    <sup>th</sup>
+                    entry in FreqFile and
+                    ProxFile, respectively. File positions are
+                    relative to the start of TermFreqs and Positions,
+                    to the previous SkipDatum in the sequence.
+                </p>
+                <p>For example, if DocFreq=35 and SkipInterval=16,
+                    then there are two SkipData entries, containing
+                    the 15
+                    <sup>th</sup>
+                    and 31
+                    <sup>st</sup>
+                    document
+                    numbers in TermFreqs. The first FreqSkip names
+                    the number of bytes after the beginning of
+                    TermFreqs that the 16
+                    <sup>th</sup>
+                    SkipDatum
+                    starts, and the second the number of bytes after
+                    that that the 32
+                    <sup>nd</sup>
+                    starts. The first
+                    ProxSkip names the number of bytes after the
+                    beginning of Positions that the 16
+                    <sup>th</sup>
+                    SkipDatum starts, and the second the number of
+                    bytes after that that the 32
+                    <sup>nd</sup>
+                    starts.
+                </p>
+                <p>Each term can have multiple skip levels.
+                   The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))).
+                   The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip
+                   level is Level=0. <br></br>
+                   Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries,
+                   containing the 3<sup>rd</sup>, 7<sup>th</sup>, 11<sup>th</sup>, 15<sup>th</sup>, 19<sup>th</sup>, 23<sup>rd</sup>,
+                   27<sup>th</sup>, and 31<sup>st</sup> document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the
+                   15<sup>th</sup> and 31<sup>st</sup> document numbers in TermFreqs. <br></br>
+                   The SkipData entries on all upper levels &gt; 0 contain a SkipChildLevelPointer referencing the corresponding SkipData
+                   entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer
+                   to entry 31 on level 0.                   
+                </p>
+
+            </section>
+            <section id="Positions"><title>Positions</title>
+
+                <p>
+                    The .prx file contains the lists of positions that
+                    each term occurs at within documents.  Note that
+                    fields omitting positional data do not store
+                    anything into this file, and if all fields in the
+                    index omit positional data then the .prx file will not
+                    exist.
+                </p>
+                <p>ProxFile (.prx) --&gt;
+                    &lt;TermPositions&gt;
+                    <sup>TermCount</sup>
+                </p>
+                <p>TermPositions --&gt;
+                    &lt;Positions&gt;
+                    <sup>DocFreq</sup>
+                </p>
+                <p>Positions --&gt;
+                    &lt;PositionDelta,Payload?&gt;
+                    <sup>Freq</sup>
+                </p>
+                <p>Payload --&gt;
+                    &lt;PayloadLength?,PayloadData&gt;
+                </p>
+                <p>PositionDelta --&gt;
+                    VInt
+                </p>
+                <p>PayloadLength --&gt;
+                    VInt
+                </p>
+                <p>PayloadData --&gt;
+                    byte<sup>PayloadLength</sup>
+                </p>
+                <p>TermPositions
+                    are ordered by term (the term is implicit, from the .tis file).
+                </p>
+                <p>Positions
+                    entries are ordered by increasing document number (the document
+                    number is implicit from the .frq file).
+                </p>
+                <p>PositionDelta
+                    is, if payloads are disabled for the term's field, the difference 
+                    between the position of the current occurrence in
+                    the document and the previous occurrence (or zero, if this is the
+                    first occurrence in this document).
+                    If payloads are enabled for the term's field, then PositionDelta/2
+                    is the difference between the current and the previous position. If
+                    payloads are enabled and PositionDelta is odd, then PayloadLength is 
+                    stored, indicating the length of the payload at the current term position.
+                </p>
+                <p>
+                    For example, the TermPositions for a
+                    term which occurs as the fourth term in one document, and as the
+                    fifth and ninth term in a subsequent document, would be the following
+                    sequence of VInts (payloads disabled):
+                </p>
+                <p>4,
+                    5, 4
+                </p>
+                <p>PayloadData
+                    is metadata associated with the current term position. If PayloadLength
+                    is stored at the current position, then it indicates the length of this 
+                    Payload. If PayloadLength is not stored, then this Payload has the same
+                    length as the Payload at the previous position.
+                </p>
+            </section>
+            <section id="Normalization Factors"><title>Normalization Factors</title>
+
+                                       <p>There's a single .nrm file containing all norms:
+                </p>
+                <p>AllNorms
+                    (.nrm) --&gt; NormsHeader,&lt;Norms&gt;
+                    <sup>NumFieldsWithNorms</sup>
+                </p>
+                <p>Norms
+                    --&gt; &lt;Byte&gt;
+                    <sup>SegSize</sup>
+                </p>
+                <p>NormsHeader
+                    --&gt; 'N','R','M',Version
+                </p>
+                <p>Version
+                    --&gt; Byte
+                </p>
+                <p>NormsHeader 
+                                       has 4 bytes, last of which is the format version for this file, currently -1.
+                </p>
+                <p>Each
+                    byte encodes a floating point value. Bits 0-2 contain the 3-bit
+                    mantissa, and bits 3-8 contain the 5-bit exponent.
+                </p>
+                <p>These
+                    are converted to an IEEE single float value as follows:
+                </p>
+                <ol>
+                    <li>
+                        <p>If
+                            the byte is zero, use a zero float.
+                        </p>
+                    </li>
+                    <li>
+                        <p>Otherwise,
+                            set the sign bit of the float to zero;
+                        </p>
+                    </li>
+                    <li>
+                        <p>add
+                            48 to the exponent and use this as the float's exponent;
+                        </p>
+                    </li>
+                    <li>
+                        <p>map
+                            the mantissa to the high-order 3 bits of the float's mantissa; and
+
+                        </p>
+                    </li>
+                    <li>
+                        <p>set
+                            the low-order 21 bits of the float's mantissa to zero.
+                        </p>
+                    </li>
+                </ol>
+                <p>A separate norm file is created when the norm values of an existing segment are modified. 
+                                       When field <em>N</em> is modified, a separate norm file <em>.sN</em> 
+                                       is created, to maintain the norm values for that field.
+                </p>
+                               <p>Separate norm files are created (when adequate) for both compound and non compound segments.
+                </p>
+
+            </section>
+            <section id="Term Vectors"><title>Term Vectors</title>
+                <p>
+                 Term Vector support is an optional on a field by
+                  field basis. It consists of 3 files.
+                </p>
+                <ol>
+                    <li><a name="tvx"/>
+                        <p>The Document Index or .tvx file.</p>
+                        <p>For each document, this stores the offset
+                           into the document data (.tvd) and field
+                           data (.tvf) files.
+                        </p>
+                        <p>DocumentIndex (.tvx) --&gt; TVXVersion&lt;DocumentPosition,FieldPosition&gt;
+                            <sup>NumDocs</sup>
+                        </p>
+                        <p>TVXVersion --&gt; Int (TermVectorsReader.CURRENT)</p>
+                        <p>DocumentPosition --&gt; UInt64 (offset in
+                        the .tvd file)</p>
+                        <p>FieldPosition --&gt; UInt64 (offset in the
+                        .tvf file)</p>
+                    </li>
+                    <li><a name="tvd"/>
+                        <p>The Document or .tvd file.</p>
+                        <p>This contains, for each document, the number of fields, a list of the fields with
+                            term vector info and finally a list of pointers to the field information in the .tvf
+                            (Term Vector Fields) file.</p>
+                        <p>
+                            Document (.tvd) --&gt; TVDVersion&lt;NumFields, FieldNums, FieldPositions&gt;
+                            <sup>NumDocs</sup>
+                        </p>
+                        <p>TVDVersion --&gt; Int (TermVectorsReader.FORMAT_CURRENT)</p>
+                        <p>NumFields --&gt; VInt</p>
+                        <p>FieldNums --&gt; &lt;FieldNumDelta&gt;
+                            <sup>NumFields</sup>
+                        </p>
+                        <p>FieldNumDelta --&gt; VInt</p>
+                        <p>FieldPositions --&gt; &lt;FieldPositionDelta&gt;
+                            <sup>NumFields-1</sup>
+                        </p>
+                        <p>FieldPositionDelta --&gt; VLong</p>
+                        <p>The .tvd file is used to map out the fields that have term vectors stored and
+                            where the field information is in the .tvf file.</p>
+                    </li>
+                    <li><a name="tvf"/>
+                        <p>The Field or .tvf file.</p>
+                        <p>This file contains, for each field that has a term vector stored, a list of
+                            the terms, their frequencies and, optionally, position and offest information.</p>
+                        <p>Field (.tvf) --&gt; TVFVersion&lt;NumTerms, Position/Offset, TermFreqs&gt;
+                            <sup>NumFields</sup>
+                        </p>
+                        <p>TVFVersion --&gt; Int (TermVectorsReader.FORMAT_CURRENT)</p>
+                        <p>NumTerms --&gt; VInt</p>
+                        <p>Position/Offset --&gt; Byte</p>
+                        <p>TermFreqs --&gt; &lt;TermText, TermFreq, Positions?, Offsets?&gt;
+                            <sup>NumTerms</sup>
+                        </p>
+                        <p>TermText --&gt; &lt;PrefixLength, Suffix&gt;</p>
+                        <p>PrefixLength --&gt; VInt</p>
+                        <p>Suffix --&gt; String</p>
+                        <p>TermFreq --&gt; VInt</p>
+                        <p>Positions --&gt; &lt;VInt&gt;<sup>TermFreq</sup></p>
+                        <p>Offsets --&gt; &lt;VInt, VInt&gt;<sup>TermFreq</sup></p>
+                        <br/>
+                        <p>Notes:</p>
+                        <ul>
+                            <li>Position/Offset byte stores whether this term vector has position or offset information stored.</li>
+                            <li>Term
+                                text prefixes are shared. The PrefixLength is the number of initial
+                                characters from the previous term which must be pre-pended to a
+                                term's suffix in order to form the term's text. Thus, if the
+                                previous term's text was "bone" and the term is "boy",
+                                the PrefixLength is two and the suffix is "y".
+                            </li>
+                            <li>Positions are stored as delta encoded VInts. This means we only store the difference of the current position from the last position</li>
+                            <li>Offsets are stored as delta encoded VInts. The first VInt is the startOffset, the second is the endOffset.</li>
+                        </ul>
+
+
+                    </li>
+                </ol>
+            </section>
+
+            <section id="Deleted Documents"><title>Deleted Documents</title>
+
+                <p>The .del file is
+                    optional, and only exists when a segment contains deletions.
+                </p>
+
+                <p>Although per-segment, this file is maintained exterior to compound segment files.
+                </p>
+                <p>
+                Deletions
+                    (.del) --&gt; [Format],ByteCount,BitCount, Bits | DGaps (depending on Format)
+                </p>
+
+                <p>Format,ByteSize,BitCount --&gt;
+                    Uint32
+                </p>
+
+                <p>Bits --&gt;
+                    &lt;Byte&gt;
+                    <sup>ByteCount</sup>
+                </p>
+
+                               <p>DGaps --&gt;
+                    &lt;DGap,NonzeroByte&gt;
+                    <sup>NonzeroBytesCount</sup>
+                </p>
+
+                <p>DGap --&gt;
+                    VInt
+                </p>
+
+                <p>NonzeroByte --&gt;
+                    Byte
+                </p>
+                               
+                <p>Format
+                    is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded.
+                </p>
+
+                <p>ByteCount
+                    indicates the number of bytes in Bits. It is typically
+                    (SegSize/8)+1.
+                </p>
+
+                <p>
+                    BitCount
+                    indicates the number of bits that are currently set in Bits.
+                </p>
+
+                <p>Bits
+                    contains one bit for each document indexed. When the bit
+                    corresponding to a document number is set, that document is marked as
+                    deleted. Bit ordering is from least to most significant. Thus, if
+                    Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as
+                    deleted.
+                </p>
+
+                               <p>DGaps
+                    represents sparse bit-vectors more efficiently than Bits.
+                    It is made of DGaps on indexes of nonzero bytes in Bits,
+                    and the nonzero bytes themselves. The number of nonzero bytes
+                    in Bits (NonzeroBytesCount) is not stored.
+                </p>
+                <p>For example,
+                    if there are 8000 bits and only bits 10,12,32 are set,
+                    DGaps would be used:
+                </p>
+                <p>
+                    (VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1
+                </p>
+            </section>
+        </section>
+
+        <section id="Limitations"><title>Limitations</title>
+
+            <p>
+             When referring to term numbers, Lucene's current
+             implementation uses a Java <code>int</code> to hold the
+             term index, which means the maximum number of unique
+             terms in any single index segment is ~2.1 billion times
+             the term index interval (default 128) = ~274 billion.
+             This is technically not a limitation of the index file
+             format, just of Lucene's current implementation.
+           </p>
+           <p>
+             Similarly, Lucene uses a Java <code>int</code> to refer
+             to document numbers, and the index file format uses an
+             <code>Int32</code> on-disk to store document numbers.
+             This is a limitation of both the index file format and
+             the current implementation.  Eventually these should be
+             replaced with either <code>UInt64</code> values, or
+             better yet, <code>VInt</code> values which have no
+             limit.
+            </p>
+
+        </section>
+
+    </body>
+
+</document>