--- /dev/null
+<?xml version="1.0"?>
+
+<document>
+ <header>
+ <title>
+ Apache Lucene - Index File Formats
+ </title>
+ </header>
+
+ <body>
+ <section id="Index File Formats"><title>Index File Formats</title>
+
+ <p>
+ This document defines the index file formats used
+ in this version of Lucene. If you are using a different
+ version of Lucene, please consult the copy of
+ <code>docs/fileformats.html</code>
+ that was distributed
+ with the version you are using.
+ </p>
+
+ <p>
+ Apache Lucene is written in Java, but several
+ efforts are underway to write
+ <a href="http://wiki.apache.org/lucene-java/LuceneImplementations">versions
+ of Lucene in other programming
+ languages</a>. If these versions are to remain compatible with Apache
+ Lucene, then a language-independent definition of the Lucene index
+ format is required. This document thus attempts to provide a
+ complete and independent definition of the Apache Lucene file
+ formats.
+ </p>
+
+ <p>
+ As Lucene evolves, this document should evolve.
+ Versions of Lucene in different programming languages should endeavor
+ to agree on file formats, and generate new versions of this document.
+ </p>
+
+ <p>
+ Compatibility notes are provided in this document,
+ describing how file formats have changed from prior versions.
+ </p>
+
+ <p>
+ In version 2.1, the file format was changed to allow
+ lock-less commits (ie, no more commit lock). The
+ change is fully backwards compatible: you can open a
+ pre-2.1 index for searching or adding/deleting of
+ docs. When the new segments file is saved
+ (committed), it will be written in the new file format
+ (meaning no specific "upgrade" process is needed).
+ But note that once a commit has occurred, pre-2.1
+ Lucene will not be able to read the index.
+ </p>
+
+ <p>
+ In version 2.3, the file format was changed to allow
+ segments to share a single set of doc store (vectors &
+ stored fields) files. This allows for faster indexing
+ in certain cases. The change is fully backwards
+ compatible (in the same way as the lock-less commits
+ change in 2.1).
+ </p>
+
+ <p>
+ In version 2.4, Strings are now written as true UTF-8
+ byte sequence, not Java's modified UTF-8. See issue
+ LUCENE-510 for details.
+ </p>
+
+ <p>
+ In version 2.9, an optional opaque Map<String,String>
+ CommitUserData may be passed to IndexWriter's commit
+ methods (and later retrieved), which is recorded in
+ the segments_N file. See issue LUCENE-1382 for
+ details. Also, diagnostics were added to each segment
+ written recording details about why it was written
+ (due to flush, merge; which OS/JRE was used; etc.).
+ See issue LUCENE-1654 for details.
+ </p>
+
+ <p>
+ In version 3.0, compressed fields are no longer
+ written to the index (they can still be read, but on
+ merge the new segment will write them,
+ uncompressed). See issue LUCENE-1960 for details.
+ </p>
+
+ <p>
+ In version 3.1, segments records the code version
+ that created them. See LUCENE-2720 for details.
+
+ Additionally segments track explicitly whether or
+ not they have term vectors. See LUCENE-2811 for details.
+ </p>
+ <p>
+ In version 3.2, numeric fields are written as natively
+ to stored fields file, previously they were stored in
+ text format only.
+ </p>
+ <p>
+ In version 3.4, fields can omit position data while
+ still indexing term frequencies.
+ </p>
+ </section>
+
+ <section id="Definitions"><title>Definitions</title>
+
+ <p>
+ The fundamental concepts in Lucene are index,
+ document, field and term.
+ </p>
+
+
+ <p>
+ An index contains a sequence of documents.
+ </p>
+
+ <ul>
+ <li>
+ <p>
+ A document is a sequence of fields.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ A field is a named sequence of terms.
+ </p>
+ </li>
+
+ <li>
+ A term is a string.
+ </li>
+ </ul>
+
+ <p>
+ The same string in two different fields is
+ considered a different term. Thus terms are represented as a pair of
+ strings, the first naming the field, and the second naming text
+ within the field.
+ </p>
+
+ <section id="Inverted Indexing"><title>Inverted Indexing</title>
+
+ <p>
+ The index stores statistics about terms in order
+ to make term-based search more efficient. Lucene's
+ index falls into the family of indexes known as an <i>inverted
+ index.</i> This is because it can list, for a term, the documents that contain
+ it. This is the inverse of the natural relationship, in which
+ documents list terms.
+ </p>
+ </section>
+ <section id="Types of Fields">
+ <title>Types of Fields</title>
+ <p>
+ In Lucene, fields may be <i>stored</i>, in which
+ case their text is stored in the index literally, in a non-inverted
+ manner. Fields that are inverted are called <i>indexed</i>. A field
+ may be both stored and indexed.</p>
+
+ <p>The text of a field may be <i>tokenized</i> into terms to be
+ indexed, or the text of a field may be used literally as a term to be indexed.
+ Most fields are
+ tokenized, but sometimes it is useful for certain identifier fields
+ to be indexed literally.
+ </p>
+ <p>See the <a href="api/core/org/apache/lucene/document/Field.html">Field</a> java docs for more information on Fields.</p>
+ </section>
+
+ <section id="Segments"><title>Segments</title>
+
+ <p>
+ Lucene indexes may be composed of multiple sub-indexes, or
+ <i>segments</i>. Each segment is a fully independent index, which could be searched
+ separately. Indexes evolve by:
+ </p>
+
+ <ol>
+ <li>
+ <p>Creating new segments for newly added documents.</p>
+ </li>
+ <li>
+ <p>Merging existing segments.</p>
+ </li>
+ </ol>
+
+ <p>
+ Searches may involve multiple segments and/or multiple indexes, each
+ index potentially composed of a set of segments.
+ </p>
+ </section>
+
+ <section id="Document Numbers"><title>Document Numbers</title>
+
+ <p>
+ Internally, Lucene refers to documents by an integer <i>document
+ number</i>. The first document added to an index is numbered zero, and each
+ subsequent document added gets a number one greater than the previous.
+ </p>
+
+ <p>
+ <br/>
+ </p>
+
+ <p>
+ Note that a document's number may change, so caution should be taken
+ when storing these numbers outside of Lucene. In particular, numbers may
+ change in the following situations:
+ </p>
+
+
+ <ul>
+ <li>
+ <p>
+ The
+ numbers stored in each segment are unique only within the segment,
+ and must be converted before they can be used in a larger context.
+ The standard technique is to allocate each segment a range of
+ values, based on the range of numbers used in that segment. To
+ convert a document number from a segment to an external value, the
+ segment's <i>base</i> document
+ number is added. To convert an external value back to a
+ segment-specific value, the segment is identified by the range that
+ the external value is in, and the segment's base value is
+ subtracted. For example two five document segments might be
+ combined, so that the first segment has a base value of zero, and
+ the second of five. Document three from the second segment would
+ have an external value of eight.
+ </p>
+ </li>
+ <li>
+ <p>
+ When documents are deleted, gaps are created
+ in the numbering. These are eventually removed as the index evolves
+ through merging. Deleted documents are dropped when segments are
+ merged. A freshly-merged segment thus has no gaps in its numbering.
+ </p>
+ </li>
+ </ul>
+
+ </section>
+
+ </section>
+
+ <section id="Overview"><title>Overview</title>
+
+ <p>
+ Each segment index maintains the following:
+ </p>
+ <ul>
+ <li>
+ <p>Field names. This
+ contains the set of field names used in the index.
+
+ </p>
+ </li>
+ <li>
+ <p>Stored Field
+ values. This contains, for each document, a list of attribute-value
+ pairs, where the attributes are field names. These are used to
+ store auxiliary information about the document, such as its title,
+ url, or an identifier to access a
+ database. The set of stored fields are what is returned for each hit
+ when searching. This is keyed by document number.
+ </p>
+ </li>
+ <li>
+ <p>Term dictionary.
+ A dictionary containing all of the terms used in all of the indexed
+ fields of all of the documents. The dictionary also contains the
+ number of documents which contain the term, and pointers to the
+ term's frequency and proximity data.
+ </p>
+ </li>
+
+ <li>
+ <p>Term Frequency
+ data. For each term in the dictionary, the numbers of all the
+ documents that contain that term, and the frequency of the term in
+ that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)
+ </p>
+ </li>
+
+ <li>
+ <p>Term Proximity
+ data. For each term in the dictionary, the positions that the term
+ occurs in each document. Note that this will
+ not exist if all fields in all documents omit position data.
+ </p>
+ </li>
+
+ <li>
+ <p>Normalization
+ factors. For each field in each document, a value is stored that is
+ multiplied into the score for hits on that field.
+ </p>
+ </li>
+ <li>
+ <p>Term Vectors. For each field in each document, the term vector
+ (sometimes called document vector) may be stored. A term vector consists
+ of term text and term frequency. To add Term Vectors to your index see the
+ <a href="api/core/org/apache/lucene/document/Field.html">Field</a>
+ constructors
+ </p>
+ </li>
+ <li>
+ <p>Deleted documents.
+ An optional file indicating which documents are deleted.
+ </p>
+ </li>
+ </ul>
+
+ <p>Details on each of these are provided in subsequent sections.
+ </p>
+ </section>
+
+ <section id="File Naming"><title>File Naming</title>
+
+ <p>
+ All files belonging to a segment have the same name with varying
+ extensions. The extensions correspond to the different file formats
+ described below. When using the Compound File format (default in 1.4 and greater) these files are
+ collapsed into a single .cfs file (see below for details)
+ </p>
+
+ <p>
+ Typically, all segments
+ in an index are stored in a single directory, although this is not
+ required.
+ </p>
+
+ <p>
+ As of version 2.1 (lock-less commits), file names are
+ never re-used (there is one exception, "segments.gen",
+ see below). That is, when any file is saved to the
+ Directory it is given a never before used filename.
+ This is achieved using a simple generations approach.
+ For example, the first segments file is segments_1,
+ then segments_2, etc. The generation is a sequential
+ long integer represented in alpha-numeric (base 36)
+ form.
+ </p>
+
+ </section>
+ <section id="file-names"><title>Summary of File Extensions</title>
+ <p>The following table summarizes the names and extensions of the files in Lucene:
+ <table>
+ <tr>
+ <th>Name</th>
+ <th>Extension</th>
+ <th>Brief Description</th>
+ </tr>
+ <tr>
+ <td><a href="#Segments File">Segments File</a></td>
+ <td>segments.gen, segments_N</td>
+ <td>Stores information about segments</td>
+ </tr>
+ <tr>
+ <td><a href="#Lock File">Lock File</a></td>
+ <td>write.lock</td>
+ <td>The Write lock prevents multiple IndexWriters from writing to the same file.</td>
+ </tr>
+ <tr>
+ <td><a href="#Compound Files">Compound File</a></td>
+ <td>.cfs</td>
+ <td>An optional "virtual" file consisting of all the other index files for systems
+ that frequently run out of file handles.</td>
+ </tr>
+ <tr>
+ <td><a href="#Compound File">Compound File Entry table</a></td>
+ <td>.cfe</td>
+ <td>The "virtual" compound file's entry table holding all entries in the corresponding .cfs file (Since 3.4)</td>
+ </tr>
+ <tr>
+ <td><a href="#Fields">Fields</a></td>
+ <td>.fnm</td>
+ <td>Stores information about the fields</td>
+ </tr>
+ <tr>
+ <td><a href="#field_index">Field Index</a></td>
+ <td>.fdx</td>
+ <td>Contains pointers to field data</td>
+ </tr>
+ <tr>
+ <td><a href="#field_data">Field Data</a></td>
+ <td>.fdt</td>
+ <td>The stored fields for documents</td>
+ </tr>
+ <tr>
+ <td><a href="#tis">Term Infos</a></td>
+ <td>.tis</td>
+ <td>Part of the term dictionary, stores term info</td>
+ </tr>
+ <tr>
+ <td><a href="#tii">Term Info Index</a></td>
+ <td>.tii</td>
+ <td>The index into the Term Infos file</td>
+ </tr>
+ <tr>
+ <td><a href="#Frequencies">Frequencies</a></td>
+ <td>.frq</td>
+ <td>Contains the list of docs which contain each term along with frequency</td>
+ </tr>
+ <tr>
+ <td><a href="#Positions">Positions</a></td>
+ <td>.prx</td>
+ <td>Stores position information about where a term occurs in the index</td>
+ </tr>
+ <tr>
+ <td><a href="#Normalization Factors">Norms</a></td>
+ <td>.nrm</td>
+ <td>Encodes length and boost factors for docs and fields</td>
+ </tr>
+ <tr>
+ <td><a href="#tvx">Term Vector Index</a></td>
+ <td>.tvx</td>
+ <td>Stores offset into the document data file</td>
+ </tr>
+ <tr>
+ <td><a href="#tvd">Term Vector Documents</a></td>
+ <td>.tvd</td>
+ <td>Contains information about each document that has term vectors</td>
+ </tr>
+ <tr>
+ <td><a href="#tvf">Term Vector Fields</a></td>
+ <td>.tvf</td>
+ <td>The field level info about term vectors</td>
+ </tr>
+ <tr>
+ <td><a href="#Deleted Documents">Deleted Documents</a></td>
+ <td>.del</td>
+ <td>Info about what files are deleted</td>
+ </tr>
+ </table>
+
+ </p>
+ </section>
+
+ <section id="Primitive Types"><title>Primitive Types</title>
+
+ <section id="Byte"><title>Byte</title>
+
+ <p>
+ The most primitive type
+ is an eight-bit byte. Files are accessed as sequences of bytes. All
+ other data types are defined as sequences
+ of bytes, so file formats are byte-order independent.
+ </p>
+
+ </section>
+
+ <section id="UInt32"><title>UInt32</title>
+
+ <p>
+ 32-bit unsigned integers are written as four
+ bytes, high-order bytes first.
+ </p>
+ <p>
+ UInt32 --> <Byte><sup>4</sup>
+ </p>
+
+ </section>
+
+ <section id="Uint64"><title>Uint64</title>
+
+ <p>
+ 64-bit unsigned integers are written as eight
+ bytes, high-order bytes first.
+ </p>
+
+ <p>UInt64 --> <Byte><sup>8</sup>
+ </p>
+
+ </section>
+
+ <section id="VInt"><title>VInt</title>
+
+ <p>
+ A variable-length format for positive integers is
+ defined where the high-order bit of each byte indicates whether more
+ bytes remain to be read. The low-order seven bits are appended as
+ increasingly more significant bits in the resulting integer value.
+ Thus values from zero to 127 may be stored in a single byte, values
+ from 128 to 16,383 may be stored in two bytes, and so on.
+ </p>
+
+ <p>
+ <b>VInt Encoding Example</b>
+ </p>
+
+ <table width="100%" border="0" cellpadding="4" cellspacing="0">
+ <col width="64*"/>
+ <col width="64*"/>
+ <col width="64*"/>
+ <col width="64*"/>
+ <tr valign="TOP">
+ <td width="25%">
+ <p align="RIGHT">
+ <b>Value</b>
+ </p>
+ </td>
+ <td width="25%">
+ <p align="RIGHT">
+ <b>First byte</b>
+ </p>
+ </td>
+ <td width="25%">
+ <p align="RIGHT">
+ <b>Second byte</b>
+ </p>
+ </td>
+ <td width="25%">
+ <p align="RIGHT">
+ <b>Third byte</b>
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="0" sdnum="1033;0;#,##0">
+ <p align="RIGHT">0
+ </p>
+ </td>
+ <td width="25%" sdval="0" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 00000000
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="1" sdnum="1033;0;#,##0">
+ <p align="RIGHT">1
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="2" sdnum="1033;0;#,##0">
+ <p align="RIGHT">2
+ </p>
+ </td>
+ <td width="25%" sdval="10" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 00000010
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td width="25%" valign="TOP">
+ <p align="RIGHT">...
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="127" sdnum="1033;0;#,##0">
+ <p align="RIGHT">127
+ </p>
+ </td>
+ <td width="25%" sdval="1111111" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 01111111
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="128" sdnum="1033;0;#,##0">
+ <p align="RIGHT">128
+ </p>
+ </td>
+ <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 10000000
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="129" sdnum="1033;0;#,##0">
+ <p align="RIGHT">129
+ </p>
+ </td>
+ <td width="25%" sdval="10000001" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 10000001
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="130" sdnum="1033;0;#,##0">
+ <p align="RIGHT">130
+ </p>
+ </td>
+ <td width="25%" sdval="10000010" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 10000010
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td width="25%" valign="TOP">
+ <p align="RIGHT">...
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="16383" sdnum="1033;0;#,##0">
+ <p align="RIGHT">16,383
+ </p>
+ </td>
+ <td width="25%" sdval="11111111" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 11111111
+ </p>
+ </td>
+ <td width="25%" sdval="1111111" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 01111111
+ </p>
+ </td>
+ <td width="25%" sdnum="1033;0;00000000">
+ <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
+ 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="16384" sdnum="1033;0;#,##0">
+ <p align="RIGHT">16,384
+ </p>
+ </td>
+ <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 10000000
+ </p>
+ </td>
+ <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 10000000
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.47cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ </tr>
+ <tr valign="BOTTOM">
+ <td width="25%" sdval="16385" sdnum="1033;0;#,##0">
+ <p align="RIGHT">16,385
+ </p>
+ </td>
+ <td width="25%" sdval="10000001" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ 10000001
+ </p>
+ </td>
+ <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ 10000000
+ </p>
+ </td>
+ <td width="25%" sdval="1" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.47cm;
+ margin-right: 0.01cm">
+ 00000001
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td width="25%" valign="TOP">
+ <p align="RIGHT">...
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: 0.11cm;
+ margin-right: 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.07cm;
+ margin-right: 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
+ <p class="western" align="RIGHT" style="margin-left: -0.47cm;
+ margin-right: 0.01cm">
+ <br/>
+
+ </p>
+ </td>
+ </tr>
+ </table>
+
+ <p>
+ This provides compression while still being
+ efficient to decode.
+ </p>
+
+ </section>
+
+ <section id="Chars"><title>Chars</title>
+
+ <p>
+ Lucene writes unicode
+ character sequences as UTF-8 encoded bytes.
+ </p>
+
+
+ </section>
+
+ <section id="String"><title>String</title>
+
+ <p>
+ Lucene writes strings as UTF-8 encoded bytes.
+ First the length, in bytes, is written as a VInt,
+ followed by the bytes.
+ </p>
+
+ <p>
+ String --> VInt, Chars
+ </p>
+
+ </section>
+ </section>
+
+ <section id="Compound Types"><title>Compound Types</title>
+ <section id="MapStringString"><title>Map<String,String></title>
+
+ <p>
+ In a couple places Lucene stores a Map
+ String->String.
+ </p>
+
+ <p>
+ Map<String,String> --> Count<String,String><sup>Count</sup>
+ </p>
+
+ </section>
+
+ </section>
+
+ <section id="Per-Index Files"><title>Per-Index Files</title>
+
+ <p>
+ The files in this section exist one-per-index.
+ </p>
+
+ <section id="Segments File"><title>Segments File</title>
+
+ <p>
+ The active segments in the index are stored in the
+ segment info file,
+ <tt>segments_N</tt>.
+ There may
+ be one or more
+ <tt>segments_N</tt>
+ files in the
+ index; however, the one with the largest
+ generation is the active one (when older
+ segments_N files are present it's because they
+ temporarily cannot be deleted, or, a writer is in
+ the process of committing, or a custom
+ <a href="api/core/org/apache/lucene/index/IndexDeletionPolicy.html">IndexDeletionPolicy</a>
+ is in use). This file lists each
+ segment by name, has details about the separate
+ norms and deletion files, and also contains the
+ size of each segment.
+ </p>
+
+ <p>
+ As of 2.1, there is also a file
+ <tt>segments.gen</tt>.
+ This file contains the
+ current generation (the
+ <tt>_N</tt>
+ in
+ <tt>segments_N</tt>)
+ of the index. This is
+ used only as a fallback in case the current
+ generation cannot be accurately determined by
+ directory listing alone (as is the case for some
+ NFS clients with time-based directory cache
+ expiraation). This file simply contains an Int32
+ version header (SegmentInfos.FORMAT_LOCKLESS =
+ -2), followed by the generation recorded as Int64,
+ written twice.
+ </p>
+ <p>
+ <b>3.1</b>
+ Segments --> Format, Version, NameCounter, SegCount, <SegVersion, SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField,
+ NormGen<sup>NumField</sup>,
+ IsCompoundFile, DeletionCount, HasProx, Diagnostics, HasVectors><sup>SegCount</sup>, CommitUserData, Checksum
+ </p>
+
+ <p>
+ Format, NameCounter, SegCount, SegSize, NumField,
+ DocStoreOffset, DeletionCount --> Int32
+ </p>
+
+ <p>
+ Version, DelGen, NormGen, Checksum --> Int64
+ </p>
+
+ <p>
+ SegVersion, SegName, DocStoreSegment --> String
+ </p>
+
+ <p>
+ Diagnostics --> Map<String,String>
+ </p>
+
+ <p>
+ IsCompoundFile, HasSingleNormFile,
+ DocStoreIsCompoundFile, HasProx, HasVectors --> Int8
+ </p>
+
+ <p>
+ CommitUserData --> Map<String,String>
+ </p>
+
+ <p>
+ Format is -9 (SegmentInfos.FORMAT_DIAGNOSTICS).
+ </p>
+
+ <p>
+ Version counts how often the index has been
+ changed by adding or deleting documents.
+ </p>
+
+ <p>
+ NameCounter is used to generate names for new segment files.
+ </p>
+
+ <p>
+ SegVersion is the code version that created the segment.
+ </p>
+
+ <p>
+ SegName is the name of the segment, and is used as the file name prefix
+ for all of the files that compose the segment's index.
+ </p>
+
+ <p>
+ SegSize is the number of documents contained in the segment index.
+ </p>
+
+ <p>
+ DelGen is the generation count of the separate
+ deletes file. If this is -1, there are no
+ separate deletes. If it is 0, this is a pre-2.1
+ segment and you must check filesystem for the
+ existence of _X.del. Anything above zero means
+ there are separate deletes (_X_N.del).
+ </p>
+
+ <p>
+ NumField is the size of the array for NormGen, or
+ -1 if there are no NormGens stored.
+ </p>
+
+ <p>
+ NormGen records the generation of the separate
+ norms files. If NumField is -1, there are no
+ normGens stored and they are all assumed to be 0
+ when the segment file was written pre-2.1 and all
+ assumed to be -1 when the segments file is 2.1 or
+ above. The generation then has the same meaning
+ as delGen (above).
+ </p>
+
+ <p>
+ IsCompoundFile records whether the segment is
+ written as a compound file or not. If this is -1,
+ the segment is not a compound file. If it is 1,
+ the segment is a compound file. Else it is 0,
+ which means we check filesystem to see if _X.cfs
+ exists.
+ </p>
+
+ <p>
+ If HasSingleNormFile is 1, then the field norms are
+ written as a single joined file (with extension
+ <tt>.nrm</tt>); if it is 0 then each field's norms
+ are stored as separate <tt>.fN</tt> files. See
+ "Normalization Factors" below for details.
+ </p>
+
+ <p>
+ DocStoreOffset, DocStoreSegment,
+ DocStoreIsCompoundFile: If DocStoreOffset is -1,
+ this segment has its own doc store (stored fields
+ values and term vectors) files and DocStoreSegment
+ and DocStoreIsCompoundFile are not stored. In
+ this case all files for stored field values
+ (<tt>*.fdt</tt> and <tt>*.fdx</tt>) and term
+ vectors (<tt>*.tvf</tt>, <tt>*.tvd</tt> and
+ <tt>*.tvx</tt>) will be stored with this segment.
+ Otherwise, DocStoreSegment is the name of the
+ segment that has the shared doc store files;
+ DocStoreIsCompoundFile is 1 if that segment is
+ stored in compound file format (as a <tt>.cfx</tt>
+ file); and DocStoreOffset is the starting document
+ in the shared doc store files where this segment's
+ documents begin. In this case, this segment does
+ not store its own doc store files but instead
+ shares a single set of these files with other
+ segments.
+ </p>
+
+ <p>
+ Checksum contains the CRC32 checksum of all bytes
+ in the segments_N file up until the checksum.
+ This is used to verify integrity of the file on
+ opening the index.
+ </p>
+
+ <p>
+ DeletionCount records the number of deleted
+ documents in this segment.
+ </p>
+
+ <p>
+ HasProx is 1 if any fields in this segment have
+ position data (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); else, it's 0.
+ </p>
+
+ <p>
+ CommitUserData stores an optional user-supplied
+ opaque Map<String,String> that was passed to
+ IndexWriter's commit or prepareCommit, or
+ IndexReader's flush methods.
+ </p>
+ <p>
+ The Diagnostics Map is privately written by
+ IndexWriter, as a debugging aid, for each segment
+ it creates. It includes metadata like the current
+ Lucene version, OS, Java version, why the segment
+ was created (merge, flush, addIndexes), etc.
+ </p>
+
+ <p> HasVectors is 1 if this segment stores term vectors,
+ else it's 0.
+ </p>
+
+ </section>
+
+ <section id="Lock File"><title>Lock File</title>
+
+ <p>
+ The write lock, which is stored in the index
+ directory by default, is named "write.lock". If
+ the lock directory is different from the index
+ directory then the write lock will be named
+ "XXXX-write.lock" where XXXX is a unique prefix
+ derived from the full path to the index directory.
+ When this file is present, a writer is currently
+ modifying the index (adding or removing
+ documents). This lock file ensures that only one
+ writer is modifying the index at a time.
+ </p>
+ </section>
+
+ <section id="Deletable File"><title>Deletable File</title>
+
+ <p>
+ A writer dynamically computes
+ the files that are deletable, instead, so no file
+ is written.
+ </p>
+
+ </section>
+
+ <section id="Compound Files"><title>Compound Files</title>
+
+ <p>Starting with Lucene 1.4 the compound file format became default. This
+ is simply a container for all files described in the next section
+ (except for the .del file).</p>
+ <p>Compound Entry Table (.cfe) --> Version, FileCount, <FileName, DataOffset, DataLength>
+ <sup>FileCount</sup>
+ </p>
+
+ <p>Compound (.cfs) --> FileData <sup>FileCount</sup>
+ </p>
+
+ <p>Version --> Int</p>
+
+ <p>FileCount --> VInt</p>
+
+ <p>DataOffset --> Long</p>
+
+ <p>DataLength --> Long</p>
+
+ <p>FileName --> String</p>
+
+ <p>FileData --> raw file data</p>
+ <p>The raw file data is the data from the individual files named above.</p>
+
+ <p>Starting with Lucene 2.3, doc store files (stored
+ field values and term vectors) can be shared in a
+ single set of files for more than one segment. When
+ compound file is enabled, these shared files will be
+ added into a single compound file (same format as
+ above) but with the extension <tt>.cfx</tt>.
+ </p>
+
+ </section>
+
+ </section>
+
+ <section id="Per-Segment Files"><title>Per-Segment Files</title>
+
+ <p>
+ The remaining files are all per-segment, and are
+ thus defined by suffix.
+ </p>
+ <section id="Fields"><title>Fields</title>
+ <p>
+ <br/>
+ <b>Field Info</b>
+ <br/>
+ </p>
+
+ <p>
+ Field names are
+ stored in the field info file, with suffix .fnm.
+ </p>
+ <p>
+ FieldInfos
+ (.fnm) --> FNMVersion,FieldsCount, <FieldName,
+ FieldBits>
+ <sup>FieldsCount</sup>
+ </p>
+
+ <p>
+ FNMVersion, FieldsCount --> VInt
+ </p>
+
+ <p>
+ FieldName --> String
+ </p>
+
+ <p>
+ FieldBits --> Byte
+ </p>
+
+ <p>
+ <ul>
+ <li>
+ The low-order bit is one for
+ indexed fields, and zero for non-indexed fields.
+ </li>
+ <li>
+ The second lowest-order
+ bit is one for fields that have term vectors stored, and zero for fields
+ without term vectors.
+ </li>
+ <li>If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.</li>
+ <li>If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.</li>
+ <li>If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.</li>
+ <li>If the sixth lowest-order bit is set (0x20), payloads are stored for the indexed field.</li>
+ <li>If the seventh lowest-order bit is set (0x40), term frequencies and positions omitted for the indexed field.</li>
+ <li>If the eighth lowest-order bit is set (0x80), positions are omitted for the indexed field.</li>
+ </ul>
+ </p>
+
+ <p>
+ FNMVersion (added in 2.9) is -2 for indexes from 2.9 - 3.3. It is -3 for indexes in Lucene 3.4+
+ </p>
+
+ <p>
+ Fields are numbered by their order in this file. Thus field zero is
+ the
+ first field in the file, field one the next, and so on. Note that,
+ like document numbers, field numbers are segment relative.
+ </p>
+
+
+
+ <p>
+ <br/>
+ <b>Stored Fields</b>
+ <br/>
+ </p>
+
+ <p>
+ Stored fields are represented by two files:
+ </p>
+
+ <ol>
+ <li><a name="field_index"/>
+ <p>
+ The field index, or .fdx file.
+ </p>
+
+ <p>
+ This contains, for each document, a pointer to
+ its field data, as follows:
+ </p>
+
+ <p>
+ FieldIndex
+ (.fdx) -->
+ <FieldValuesPosition>
+ <sup>SegSize</sup>
+ </p>
+ <p>FieldValuesPosition
+ --> Uint64
+ </p>
+ <p>This
+ is used to find the location within the field data file of the
+ fields of a particular document. Because it contains fixed-length
+ data, this file may be easily randomly accessed. The position of
+ document
+ <i>n</i>
+ 's
+ <i></i>
+ field data is the Uint64 at
+ <i>n*8</i>
+ in
+ this file.
+ </p>
+ </li>
+ <li>
+ <p><a name="field_data"/>
+ The field data, or .fdt file.
+
+ </p>
+
+ <p>
+ This contains the stored fields of each document,
+ as follows:
+ </p>
+
+ <p>
+ FieldData (.fdt) -->
+ <DocFieldData>
+ <sup>SegSize</sup>
+ </p>
+ <p>DocFieldData -->
+ FieldCount, <FieldNum, Bits, Value>
+ <sup>FieldCount</sup>
+ </p>
+ <p>FieldCount -->
+ VInt
+ </p>
+ <p>FieldNum -->
+ VInt
+ </p>
+ <p>Bits -->
+ Byte
+ </p>
+ <p>
+ <ul>
+ <li>low order bit is one for tokenized fields</li>
+ <li>second bit is one for fields containing binary data</li>
+ <li>third bit is one for fields with compression option enabled
+ (if compression is enabled, the algorithm used is ZLIB),
+ only available for indexes until Lucene version 2.9.x</li>
+ <li>4th to 6th bit (mask: 0x7<<3) define the type of a
+ numeric field: <ul>
+ <li>all bits in mask are cleared if no numeric field at all</li>
+ <li>1<<3: Value is Int</li>
+ <li>2<<3: Value is Long</li>
+ <li>3<<3: Value is Int as Float (as of Float.intBitsToFloat)</li>
+ <li>4<<3: Value is Long as Double (as of Double.longBitsToDouble)</li>
+ </ul></li>
+ </ul>
+ </p>
+ <p>Value -->
+ String | BinaryValue | Int | Long (depending on Bits)
+ </p>
+ <p>BinaryValue -->
+ ValueSize, <Byte>^ValueSize
+ </p>
+ <p>ValueSize -->
+ VInt
+ </p>
+
+ </li>
+ </ol>
+
+ </section>
+ <section id="Term Dictionary"><title>Term Dictionary</title>
+
+ <p>
+ The term dictionary is represented as two files:
+ </p>
+ <ol>
+ <li><a name="tis"/>
+ <p>
+ The term infos, or tis file.
+ </p>
+
+ <p>
+ TermInfoFile (.tis)-->
+ TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos
+ </p>
+ <p>TIVersion -->
+ UInt32
+ </p>
+ <p>TermCount -->
+ UInt64
+ </p>
+ <p>IndexInterval -->
+ UInt32
+ </p>
+ <p>SkipInterval -->
+ UInt32
+ </p>
+ <p>MaxSkipLevels -->
+ UInt32
+ </p>
+ <p>TermInfos -->
+ <TermInfo>
+ <sup>TermCount</sup>
+ </p>
+ <p>TermInfo -->
+ <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
+ </p>
+ <p>Term -->
+ <PrefixLength, Suffix, FieldNum>
+ </p>
+ <p>Suffix -->
+ String
+ </p>
+ <p>PrefixLength,
+ DocFreq, FreqDelta, ProxDelta, SkipDelta
+ <br/>
+ --> VInt
+ </p>
+ <p>
+ This file is sorted by Term. Terms are
+ ordered first lexicographically (by UTF16
+ character code) by the term's field name,
+ and within that lexicographically (by
+ UTF16 character code) by the term's text.
+ </p>
+ <p>TIVersion names the version of the format
+ of this file and is equal to TermInfosWriter.FORMAT_CURRENT.
+ </p>
+ <p>Term
+ text prefixes are shared. The PrefixLength is the number of initial
+ characters from the previous term which must be pre-pended to a
+ term's suffix in order to form the term's text. Thus, if the
+ previous term's text was "bone" and the term is "boy",
+ the PrefixLength is two and the suffix is "y".
+ </p>
+ <p>FieldNumber
+ determines the term's field, whose name is stored in the .fdt file.
+ </p>
+ <p>DocFreq
+ is the count of documents which contain the term.
+ </p>
+ <p>FreqDelta
+ determines the position of this term's TermFreqs within the .frq
+ file. In particular, it is the difference between the position of
+ this term's data in that file and the position of the previous
+ term's data (or zero, for the first term in the file).
+ </p>
+ <p>ProxDelta
+ determines the position of this term's TermPositions within the .prx
+ file. In particular, it is the difference between the position of
+ this term's data in that file and the position of the previous
+ term's data (or zero, for the first term in the file. For fields
+ that omit position data, this will be 0 since
+ prox information is not stored.
+ </p>
+ <p>SkipDelta determines the position of this
+ term's SkipData within the .frq file. In
+ particular, it is the number of bytes
+ after TermFreqs that the SkipData starts.
+ In other words, it is the length of the
+ TermFreq data. SkipDelta is only stored
+ if DocFreq is not smaller than SkipInterval.
+ </p>
+ </li>
+ <li>
+ <p><a name="tii"/>
+ The term info index, or .tii file.
+ </p>
+
+ <p>
+ This contains every IndexInterval
+ <sup>th</sup>
+ entry from the .tis
+ file, along with its location in the "tis" file. This is
+ designed to be read entirely into memory and used to provide random
+ access to the "tis" file.
+ </p>
+
+ <p>
+ The structure of this file is very similar to the
+ .tis file, with the addition of one item per record, the IndexDelta.
+ </p>
+
+ <p>
+ TermInfoIndex (.tii)-->
+ TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices
+ </p>
+ <p>TIVersion -->
+ UInt32
+ </p>
+ <p>IndexTermCount -->
+ UInt64
+ </p>
+ <p>IndexInterval -->
+ UInt32
+ </p>
+ <p>SkipInterval -->
+ UInt32
+ </p>
+ <p>TermIndices -->
+ <TermInfo, IndexDelta>
+ <sup>IndexTermCount</sup>
+ </p>
+ <p>IndexDelta -->
+ VLong
+ </p>
+ <p>IndexDelta
+ determines the position of this term's TermInfo within the .tis file. In
+ particular, it is the difference between the position of this term's
+ entry in that file and the position of the previous term's entry.
+ </p>
+ <p>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int).
+ Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while
+ smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more
+ accelerable cases.</p>
+ <p>MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in
+ smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration.
+ See format of .frq file for more information about skip levels.</p>
+ </li>
+ </ol>
+ </section>
+
+ <section id="Frequencies"><title>Frequencies</title>
+
+ <p>
+ The .frq file contains the lists of documents
+ which contain each term, along with the frequency of the term in that
+ document (except when frequencies are omitted: IndexOptions.DOCS_ONLY).
+ </p>
+ <p>FreqFile (.frq) -->
+ <TermFreqs, SkipData>
+ <sup>TermCount</sup>
+ </p>
+ <p>TermFreqs -->
+ <TermFreq>
+ <sup>DocFreq</sup>
+ </p>
+ <p>TermFreq -->
+ DocDelta[, Freq?]
+ </p>
+ <p>SkipData -->
+ <<SkipLevelLength, SkipLevel>
+ <sup>NumSkipLevels-1</sup>, SkipLevel>
+ <SkipDatum>
+ </p>
+ <p>SkipLevel -->
+ <SkipDatum>
+ <sup>DocFreq/(SkipInterval^(Level + 1))</sup>
+ </p>
+ <p>SkipDatum -->
+ DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
+ </p>
+ <p>DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip -->
+ VInt
+ </p>
+ <p>SkipChildLevelPointer -->
+ VLong
+ </p>
+ <p>TermFreqs
+ are ordered by term (the term is implicit, from the .tis file).
+ </p>
+ <p>TermFreq
+ entries are ordered by increasing document number.
+ </p>
+ <p>DocDelta: if frequencies are indexed, this determines both
+ the document number and the frequency. In
+ particular, DocDelta/2 is the difference between
+ this document number and the previous document
+ number (or zero when this is the first document in
+ a TermFreqs). When DocDelta is odd, the frequency
+ is one. When DocDelta is even, the frequency is
+ read as another VInt. If frequencies are omitted, DocDelta
+ contains the gap (not multiplied by 2) between
+ document numbers and no frequency information is
+ stored.
+ </p>
+ <p>For example, the TermFreqs for a term which occurs
+ once in document seven and three times in document
+ eleven, with frequencies indexed, would be the following
+ sequence of VInts:
+ </p>
+ <p>15, 8, 3
+ </p>
+ <p> If frequencies were omitted (IndexOptions.DOCS_ONLY) it would be this sequence
+ of VInts instead:
+ </p>
+ <p>
+ 7,4
+ </p>
+ <p>DocSkip records the document number before every
+ SkipInterval
+ <sup>th</sup>
+ document in TermFreqs.
+ If payloads are disabled for the term's field,
+ then DocSkip represents the difference from the
+ previous value in the sequence.
+ If payloads are enabled for the term's field,
+ then DocSkip/2 represents the difference from the
+ previous value in the sequence. If payloads are enabled
+ and DocSkip is odd,
+ then PayloadLength is stored indicating the length
+ of the last payload before the SkipInterval<sup>th</sup>
+ document in TermPositions.
+ FreqSkip and ProxSkip record the position of every
+ SkipInterval
+ <sup>th</sup>
+ entry in FreqFile and
+ ProxFile, respectively. File positions are
+ relative to the start of TermFreqs and Positions,
+ to the previous SkipDatum in the sequence.
+ </p>
+ <p>For example, if DocFreq=35 and SkipInterval=16,
+ then there are two SkipData entries, containing
+ the 15
+ <sup>th</sup>
+ and 31
+ <sup>st</sup>
+ document
+ numbers in TermFreqs. The first FreqSkip names
+ the number of bytes after the beginning of
+ TermFreqs that the 16
+ <sup>th</sup>
+ SkipDatum
+ starts, and the second the number of bytes after
+ that that the 32
+ <sup>nd</sup>
+ starts. The first
+ ProxSkip names the number of bytes after the
+ beginning of Positions that the 16
+ <sup>th</sup>
+ SkipDatum starts, and the second the number of
+ bytes after that that the 32
+ <sup>nd</sup>
+ starts.
+ </p>
+ <p>Each term can have multiple skip levels.
+ The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))).
+ The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip
+ level is Level=0. <br></br>
+ Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries,
+ containing the 3<sup>rd</sup>, 7<sup>th</sup>, 11<sup>th</sup>, 15<sup>th</sup>, 19<sup>th</sup>, 23<sup>rd</sup>,
+ 27<sup>th</sup>, and 31<sup>st</sup> document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the
+ 15<sup>th</sup> and 31<sup>st</sup> document numbers in TermFreqs. <br></br>
+ The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData
+ entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer
+ to entry 31 on level 0.
+ </p>
+
+ </section>
+ <section id="Positions"><title>Positions</title>
+
+ <p>
+ The .prx file contains the lists of positions that
+ each term occurs at within documents. Note that
+ fields omitting positional data do not store
+ anything into this file, and if all fields in the
+ index omit positional data then the .prx file will not
+ exist.
+ </p>
+ <p>ProxFile (.prx) -->
+ <TermPositions>
+ <sup>TermCount</sup>
+ </p>
+ <p>TermPositions -->
+ <Positions>
+ <sup>DocFreq</sup>
+ </p>
+ <p>Positions -->
+ <PositionDelta,Payload?>
+ <sup>Freq</sup>
+ </p>
+ <p>Payload -->
+ <PayloadLength?,PayloadData>
+ </p>
+ <p>PositionDelta -->
+ VInt
+ </p>
+ <p>PayloadLength -->
+ VInt
+ </p>
+ <p>PayloadData -->
+ byte<sup>PayloadLength</sup>
+ </p>
+ <p>TermPositions
+ are ordered by term (the term is implicit, from the .tis file).
+ </p>
+ <p>Positions
+ entries are ordered by increasing document number (the document
+ number is implicit from the .frq file).
+ </p>
+ <p>PositionDelta
+ is, if payloads are disabled for the term's field, the difference
+ between the position of the current occurrence in
+ the document and the previous occurrence (or zero, if this is the
+ first occurrence in this document).
+ If payloads are enabled for the term's field, then PositionDelta/2
+ is the difference between the current and the previous position. If
+ payloads are enabled and PositionDelta is odd, then PayloadLength is
+ stored, indicating the length of the payload at the current term position.
+ </p>
+ <p>
+ For example, the TermPositions for a
+ term which occurs as the fourth term in one document, and as the
+ fifth and ninth term in a subsequent document, would be the following
+ sequence of VInts (payloads disabled):
+ </p>
+ <p>4,
+ 5, 4
+ </p>
+ <p>PayloadData
+ is metadata associated with the current term position. If PayloadLength
+ is stored at the current position, then it indicates the length of this
+ Payload. If PayloadLength is not stored, then this Payload has the same
+ length as the Payload at the previous position.
+ </p>
+ </section>
+ <section id="Normalization Factors"><title>Normalization Factors</title>
+
+ <p>There's a single .nrm file containing all norms:
+ </p>
+ <p>AllNorms
+ (.nrm) --> NormsHeader,<Norms>
+ <sup>NumFieldsWithNorms</sup>
+ </p>
+ <p>Norms
+ --> <Byte>
+ <sup>SegSize</sup>
+ </p>
+ <p>NormsHeader
+ --> 'N','R','M',Version
+ </p>
+ <p>Version
+ --> Byte
+ </p>
+ <p>NormsHeader
+ has 4 bytes, last of which is the format version for this file, currently -1.
+ </p>
+ <p>Each
+ byte encodes a floating point value. Bits 0-2 contain the 3-bit
+ mantissa, and bits 3-8 contain the 5-bit exponent.
+ </p>
+ <p>These
+ are converted to an IEEE single float value as follows:
+ </p>
+ <ol>
+ <li>
+ <p>If
+ the byte is zero, use a zero float.
+ </p>
+ </li>
+ <li>
+ <p>Otherwise,
+ set the sign bit of the float to zero;
+ </p>
+ </li>
+ <li>
+ <p>add
+ 48 to the exponent and use this as the float's exponent;
+ </p>
+ </li>
+ <li>
+ <p>map
+ the mantissa to the high-order 3 bits of the float's mantissa; and
+
+ </p>
+ </li>
+ <li>
+ <p>set
+ the low-order 21 bits of the float's mantissa to zero.
+ </p>
+ </li>
+ </ol>
+ <p>A separate norm file is created when the norm values of an existing segment are modified.
+ When field <em>N</em> is modified, a separate norm file <em>.sN</em>
+ is created, to maintain the norm values for that field.
+ </p>
+ <p>Separate norm files are created (when adequate) for both compound and non compound segments.
+ </p>
+
+ </section>
+ <section id="Term Vectors"><title>Term Vectors</title>
+ <p>
+ Term Vector support is an optional on a field by
+ field basis. It consists of 3 files.
+ </p>
+ <ol>
+ <li><a name="tvx"/>
+ <p>The Document Index or .tvx file.</p>
+ <p>For each document, this stores the offset
+ into the document data (.tvd) and field
+ data (.tvf) files.
+ </p>
+ <p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition,FieldPosition>
+ <sup>NumDocs</sup>
+ </p>
+ <p>TVXVersion --> Int (TermVectorsReader.CURRENT)</p>
+ <p>DocumentPosition --> UInt64 (offset in
+ the .tvd file)</p>
+ <p>FieldPosition --> UInt64 (offset in the
+ .tvf file)</p>
+ </li>
+ <li><a name="tvd"/>
+ <p>The Document or .tvd file.</p>
+ <p>This contains, for each document, the number of fields, a list of the fields with
+ term vector info and finally a list of pointers to the field information in the .tvf
+ (Term Vector Fields) file.</p>
+ <p>
+ Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions>
+ <sup>NumDocs</sup>
+ </p>
+ <p>TVDVersion --> Int (TermVectorsReader.FORMAT_CURRENT)</p>
+ <p>NumFields --> VInt</p>
+ <p>FieldNums --> <FieldNumDelta>
+ <sup>NumFields</sup>
+ </p>
+ <p>FieldNumDelta --> VInt</p>
+ <p>FieldPositions --> <FieldPositionDelta>
+ <sup>NumFields-1</sup>
+ </p>
+ <p>FieldPositionDelta --> VLong</p>
+ <p>The .tvd file is used to map out the fields that have term vectors stored and
+ where the field information is in the .tvf file.</p>
+ </li>
+ <li><a name="tvf"/>
+ <p>The Field or .tvf file.</p>
+ <p>This file contains, for each field that has a term vector stored, a list of
+ the terms, their frequencies and, optionally, position and offest information.</p>
+ <p>Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs>
+ <sup>NumFields</sup>
+ </p>
+ <p>TVFVersion --> Int (TermVectorsReader.FORMAT_CURRENT)</p>
+ <p>NumTerms --> VInt</p>
+ <p>Position/Offset --> Byte</p>
+ <p>TermFreqs --> <TermText, TermFreq, Positions?, Offsets?>
+ <sup>NumTerms</sup>
+ </p>
+ <p>TermText --> <PrefixLength, Suffix></p>
+ <p>PrefixLength --> VInt</p>
+ <p>Suffix --> String</p>
+ <p>TermFreq --> VInt</p>
+ <p>Positions --> <VInt><sup>TermFreq</sup></p>
+ <p>Offsets --> <VInt, VInt><sup>TermFreq</sup></p>
+ <br/>
+ <p>Notes:</p>
+ <ul>
+ <li>Position/Offset byte stores whether this term vector has position or offset information stored.</li>
+ <li>Term
+ text prefixes are shared. The PrefixLength is the number of initial
+ characters from the previous term which must be pre-pended to a
+ term's suffix in order to form the term's text. Thus, if the
+ previous term's text was "bone" and the term is "boy",
+ the PrefixLength is two and the suffix is "y".
+ </li>
+ <li>Positions are stored as delta encoded VInts. This means we only store the difference of the current position from the last position</li>
+ <li>Offsets are stored as delta encoded VInts. The first VInt is the startOffset, the second is the endOffset.</li>
+ </ul>
+
+
+ </li>
+ </ol>
+ </section>
+
+ <section id="Deleted Documents"><title>Deleted Documents</title>
+
+ <p>The .del file is
+ optional, and only exists when a segment contains deletions.
+ </p>
+
+ <p>Although per-segment, this file is maintained exterior to compound segment files.
+ </p>
+ <p>
+ Deletions
+ (.del) --> [Format],ByteCount,BitCount, Bits | DGaps (depending on Format)
+ </p>
+
+ <p>Format,ByteSize,BitCount -->
+ Uint32
+ </p>
+
+ <p>Bits -->
+ <Byte>
+ <sup>ByteCount</sup>
+ </p>
+
+ <p>DGaps -->
+ <DGap,NonzeroByte>
+ <sup>NonzeroBytesCount</sup>
+ </p>
+
+ <p>DGap -->
+ VInt
+ </p>
+
+ <p>NonzeroByte -->
+ Byte
+ </p>
+
+ <p>Format
+ is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded.
+ </p>
+
+ <p>ByteCount
+ indicates the number of bytes in Bits. It is typically
+ (SegSize/8)+1.
+ </p>
+
+ <p>
+ BitCount
+ indicates the number of bits that are currently set in Bits.
+ </p>
+
+ <p>Bits
+ contains one bit for each document indexed. When the bit
+ corresponding to a document number is set, that document is marked as
+ deleted. Bit ordering is from least to most significant. Thus, if
+ Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as
+ deleted.
+ </p>
+
+ <p>DGaps
+ represents sparse bit-vectors more efficiently than Bits.
+ It is made of DGaps on indexes of nonzero bytes in Bits,
+ and the nonzero bytes themselves. The number of nonzero bytes
+ in Bits (NonzeroBytesCount) is not stored.
+ </p>
+ <p>For example,
+ if there are 8000 bits and only bits 10,12,32 are set,
+ DGaps would be used:
+ </p>
+ <p>
+ (VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1
+ </p>
+ </section>
+ </section>
+
+ <section id="Limitations"><title>Limitations</title>
+
+ <p>
+ When referring to term numbers, Lucene's current
+ implementation uses a Java <code>int</code> to hold the
+ term index, which means the maximum number of unique
+ terms in any single index segment is ~2.1 billion times
+ the term index interval (default 128) = ~274 billion.
+ This is technically not a limitation of the index file
+ format, just of Lucene's current implementation.
+ </p>
+ <p>
+ Similarly, Lucene uses a Java <code>int</code> to refer
+ to document numbers, and the index file format uses an
+ <code>Int32</code> on-disk to store document numbers.
+ This is a limitation of both the index file format and
+ the current implementation. Eventually these should be
+ replaced with either <code>UInt64</code> values, or
+ better yet, <code>VInt</code> values which have no
+ limit.
+ </p>
+
+ </section>
+
+ </body>
+
+</document>