+++ /dev/null
-<?xml version="1.0"?>
-
-<document>
- <header>
- <title>
- Apache Lucene - Index File Formats
- </title>
- </header>
-
- <body>
- <section id="Index File Formats"><title>Index File Formats</title>
-
- <p>
- This document defines the index file formats used
- in this version of Lucene. If you are using a different
- version of Lucene, please consult the copy of
- <code>docs/fileformats.html</code>
- that was distributed
- with the version you are using.
- </p>
-
- <p>
- Apache Lucene is written in Java, but several
- efforts are underway to write
- <a href="http://wiki.apache.org/lucene-java/LuceneImplementations">versions
- of Lucene in other programming
- languages</a>. If these versions are to remain compatible with Apache
- Lucene, then a language-independent definition of the Lucene index
- format is required. This document thus attempts to provide a
- complete and independent definition of the Apache Lucene file
- formats.
- </p>
-
- <p>
- As Lucene evolves, this document should evolve.
- Versions of Lucene in different programming languages should endeavor
- to agree on file formats, and generate new versions of this document.
- </p>
-
- <p>
- Compatibility notes are provided in this document,
- describing how file formats have changed from prior versions.
- </p>
-
- <p>
- In version 2.1, the file format was changed to allow
- lock-less commits (ie, no more commit lock). The
- change is fully backwards compatible: you can open a
- pre-2.1 index for searching or adding/deleting of
- docs. When the new segments file is saved
- (committed), it will be written in the new file format
- (meaning no specific "upgrade" process is needed).
- But note that once a commit has occurred, pre-2.1
- Lucene will not be able to read the index.
- </p>
-
- <p>
- In version 2.3, the file format was changed to allow
- segments to share a single set of doc store (vectors &
- stored fields) files. This allows for faster indexing
- in certain cases. The change is fully backwards
- compatible (in the same way as the lock-less commits
- change in 2.1).
- </p>
-
- <p>
- In version 2.4, Strings are now written as true UTF-8
- byte sequence, not Java's modified UTF-8. See issue
- LUCENE-510 for details.
- </p>
-
- <p>
- In version 2.9, an optional opaque Map<String,String>
- CommitUserData may be passed to IndexWriter's commit
- methods (and later retrieved), which is recorded in
- the segments_N file. See issue LUCENE-1382 for
- details. Also, diagnostics were added to each segment
- written recording details about why it was written
- (due to flush, merge; which OS/JRE was used; etc.).
- See issue LUCENE-1654 for details.
- </p>
-
- <p>
- In version 3.0, compressed fields are no longer
- written to the index (they can still be read, but on
- merge the new segment will write them,
- uncompressed). See issue LUCENE-1960 for details.
- </p>
-
- <p>
- In version 3.1, segments records the code version
- that created them. See LUCENE-2720 for details.
-
- Additionally segments track explicitly whether or
- not they have term vectors. See LUCENE-2811 for details.
- </p>
- <p>
- In version 3.2, numeric fields are written as natively
- to stored fields file, previously they were stored in
- text format only.
- </p>
- <p>
- In version 3.4, fields can omit position data while
- still indexing term frequencies.
- </p>
- </section>
-
- <section id="Definitions"><title>Definitions</title>
-
- <p>
- The fundamental concepts in Lucene are index,
- document, field and term.
- </p>
-
-
- <p>
- An index contains a sequence of documents.
- </p>
-
- <ul>
- <li>
- <p>
- A document is a sequence of fields.
- </p>
- </li>
-
- <li>
- <p>
- A field is a named sequence of terms.
- </p>
- </li>
-
- <li>
- A term is a string.
- </li>
- </ul>
-
- <p>
- The same string in two different fields is
- considered a different term. Thus terms are represented as a pair of
- strings, the first naming the field, and the second naming text
- within the field.
- </p>
-
- <section id="Inverted Indexing"><title>Inverted Indexing</title>
-
- <p>
- The index stores statistics about terms in order
- to make term-based search more efficient. Lucene's
- index falls into the family of indexes known as an <i>inverted
- index.</i> This is because it can list, for a term, the documents that contain
- it. This is the inverse of the natural relationship, in which
- documents list terms.
- </p>
- </section>
- <section id="Types of Fields">
- <title>Types of Fields</title>
- <p>
- In Lucene, fields may be <i>stored</i>, in which
- case their text is stored in the index literally, in a non-inverted
- manner. Fields that are inverted are called <i>indexed</i>. A field
- may be both stored and indexed.</p>
-
- <p>The text of a field may be <i>tokenized</i> into terms to be
- indexed, or the text of a field may be used literally as a term to be indexed.
- Most fields are
- tokenized, but sometimes it is useful for certain identifier fields
- to be indexed literally.
- </p>
- <p>See the <a href="api/core/org/apache/lucene/document/Field.html">Field</a> java docs for more information on Fields.</p>
- </section>
-
- <section id="Segments"><title>Segments</title>
-
- <p>
- Lucene indexes may be composed of multiple sub-indexes, or
- <i>segments</i>. Each segment is a fully independent index, which could be searched
- separately. Indexes evolve by:
- </p>
-
- <ol>
- <li>
- <p>Creating new segments for newly added documents.</p>
- </li>
- <li>
- <p>Merging existing segments.</p>
- </li>
- </ol>
-
- <p>
- Searches may involve multiple segments and/or multiple indexes, each
- index potentially composed of a set of segments.
- </p>
- </section>
-
- <section id="Document Numbers"><title>Document Numbers</title>
-
- <p>
- Internally, Lucene refers to documents by an integer <i>document
- number</i>. The first document added to an index is numbered zero, and each
- subsequent document added gets a number one greater than the previous.
- </p>
-
- <p>
- <br/>
- </p>
-
- <p>
- Note that a document's number may change, so caution should be taken
- when storing these numbers outside of Lucene. In particular, numbers may
- change in the following situations:
- </p>
-
-
- <ul>
- <li>
- <p>
- The
- numbers stored in each segment are unique only within the segment,
- and must be converted before they can be used in a larger context.
- The standard technique is to allocate each segment a range of
- values, based on the range of numbers used in that segment. To
- convert a document number from a segment to an external value, the
- segment's <i>base</i> document
- number is added. To convert an external value back to a
- segment-specific value, the segment is identified by the range that
- the external value is in, and the segment's base value is
- subtracted. For example two five document segments might be
- combined, so that the first segment has a base value of zero, and
- the second of five. Document three from the second segment would
- have an external value of eight.
- </p>
- </li>
- <li>
- <p>
- When documents are deleted, gaps are created
- in the numbering. These are eventually removed as the index evolves
- through merging. Deleted documents are dropped when segments are
- merged. A freshly-merged segment thus has no gaps in its numbering.
- </p>
- </li>
- </ul>
-
- </section>
-
- </section>
-
- <section id="Overview"><title>Overview</title>
-
- <p>
- Each segment index maintains the following:
- </p>
- <ul>
- <li>
- <p>Field names. This
- contains the set of field names used in the index.
-
- </p>
- </li>
- <li>
- <p>Stored Field
- values. This contains, for each document, a list of attribute-value
- pairs, where the attributes are field names. These are used to
- store auxiliary information about the document, such as its title,
- url, or an identifier to access a
- database. The set of stored fields are what is returned for each hit
- when searching. This is keyed by document number.
- </p>
- </li>
- <li>
- <p>Term dictionary.
- A dictionary containing all of the terms used in all of the indexed
- fields of all of the documents. The dictionary also contains the
- number of documents which contain the term, and pointers to the
- term's frequency and proximity data.
- </p>
- </li>
-
- <li>
- <p>Term Frequency
- data. For each term in the dictionary, the numbers of all the
- documents that contain that term, and the frequency of the term in
- that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)
- </p>
- </li>
-
- <li>
- <p>Term Proximity
- data. For each term in the dictionary, the positions that the term
- occurs in each document. Note that this will
- not exist if all fields in all documents omit position data.
- </p>
- </li>
-
- <li>
- <p>Normalization
- factors. For each field in each document, a value is stored that is
- multiplied into the score for hits on that field.
- </p>
- </li>
- <li>
- <p>Term Vectors. For each field in each document, the term vector
- (sometimes called document vector) may be stored. A term vector consists
- of term text and term frequency. To add Term Vectors to your index see the
- <a href="api/core/org/apache/lucene/document/Field.html">Field</a>
- constructors
- </p>
- </li>
- <li>
- <p>Deleted documents.
- An optional file indicating which documents are deleted.
- </p>
- </li>
- </ul>
-
- <p>Details on each of these are provided in subsequent sections.
- </p>
- </section>
-
- <section id="File Naming"><title>File Naming</title>
-
- <p>
- All files belonging to a segment have the same name with varying
- extensions. The extensions correspond to the different file formats
- described below. When using the Compound File format (default in 1.4 and greater) these files are
- collapsed into a single .cfs file (see below for details)
- </p>
-
- <p>
- Typically, all segments
- in an index are stored in a single directory, although this is not
- required.
- </p>
-
- <p>
- As of version 2.1 (lock-less commits), file names are
- never re-used (there is one exception, "segments.gen",
- see below). That is, when any file is saved to the
- Directory it is given a never before used filename.
- This is achieved using a simple generations approach.
- For example, the first segments file is segments_1,
- then segments_2, etc. The generation is a sequential
- long integer represented in alpha-numeric (base 36)
- form.
- </p>
-
- </section>
- <section id="file-names"><title>Summary of File Extensions</title>
- <p>The following table summarizes the names and extensions of the files in Lucene:
- <table>
- <tr>
- <th>Name</th>
- <th>Extension</th>
- <th>Brief Description</th>
- </tr>
- <tr>
- <td><a href="#Segments File">Segments File</a></td>
- <td>segments.gen, segments_N</td>
- <td>Stores information about segments</td>
- </tr>
- <tr>
- <td><a href="#Lock File">Lock File</a></td>
- <td>write.lock</td>
- <td>The Write lock prevents multiple IndexWriters from writing to the same file.</td>
- </tr>
- <tr>
- <td><a href="#Compound Files">Compound File</a></td>
- <td>.cfs</td>
- <td>An optional "virtual" file consisting of all the other index files for systems
- that frequently run out of file handles.</td>
- </tr>
- <tr>
- <td><a href="#Compound File">Compound File Entry table</a></td>
- <td>.cfe</td>
- <td>The "virtual" compound file's entry table holding all entries in the corresponding .cfs file (Since 3.4)</td>
- </tr>
- <tr>
- <td><a href="#Fields">Fields</a></td>
- <td>.fnm</td>
- <td>Stores information about the fields</td>
- </tr>
- <tr>
- <td><a href="#field_index">Field Index</a></td>
- <td>.fdx</td>
- <td>Contains pointers to field data</td>
- </tr>
- <tr>
- <td><a href="#field_data">Field Data</a></td>
- <td>.fdt</td>
- <td>The stored fields for documents</td>
- </tr>
- <tr>
- <td><a href="#tis">Term Infos</a></td>
- <td>.tis</td>
- <td>Part of the term dictionary, stores term info</td>
- </tr>
- <tr>
- <td><a href="#tii">Term Info Index</a></td>
- <td>.tii</td>
- <td>The index into the Term Infos file</td>
- </tr>
- <tr>
- <td><a href="#Frequencies">Frequencies</a></td>
- <td>.frq</td>
- <td>Contains the list of docs which contain each term along with frequency</td>
- </tr>
- <tr>
- <td><a href="#Positions">Positions</a></td>
- <td>.prx</td>
- <td>Stores position information about where a term occurs in the index</td>
- </tr>
- <tr>
- <td><a href="#Normalization Factors">Norms</a></td>
- <td>.nrm</td>
- <td>Encodes length and boost factors for docs and fields</td>
- </tr>
- <tr>
- <td><a href="#tvx">Term Vector Index</a></td>
- <td>.tvx</td>
- <td>Stores offset into the document data file</td>
- </tr>
- <tr>
- <td><a href="#tvd">Term Vector Documents</a></td>
- <td>.tvd</td>
- <td>Contains information about each document that has term vectors</td>
- </tr>
- <tr>
- <td><a href="#tvf">Term Vector Fields</a></td>
- <td>.tvf</td>
- <td>The field level info about term vectors</td>
- </tr>
- <tr>
- <td><a href="#Deleted Documents">Deleted Documents</a></td>
- <td>.del</td>
- <td>Info about what files are deleted</td>
- </tr>
- </table>
-
- </p>
- </section>
-
- <section id="Primitive Types"><title>Primitive Types</title>
-
- <section id="Byte"><title>Byte</title>
-
- <p>
- The most primitive type
- is an eight-bit byte. Files are accessed as sequences of bytes. All
- other data types are defined as sequences
- of bytes, so file formats are byte-order independent.
- </p>
-
- </section>
-
- <section id="UInt32"><title>UInt32</title>
-
- <p>
- 32-bit unsigned integers are written as four
- bytes, high-order bytes first.
- </p>
- <p>
- UInt32 --> <Byte><sup>4</sup>
- </p>
-
- </section>
-
- <section id="Uint64"><title>Uint64</title>
-
- <p>
- 64-bit unsigned integers are written as eight
- bytes, high-order bytes first.
- </p>
-
- <p>UInt64 --> <Byte><sup>8</sup>
- </p>
-
- </section>
-
- <section id="VInt"><title>VInt</title>
-
- <p>
- A variable-length format for positive integers is
- defined where the high-order bit of each byte indicates whether more
- bytes remain to be read. The low-order seven bits are appended as
- increasingly more significant bits in the resulting integer value.
- Thus values from zero to 127 may be stored in a single byte, values
- from 128 to 16,383 may be stored in two bytes, and so on.
- </p>
-
- <p>
- <b>VInt Encoding Example</b>
- </p>
-
- <table width="100%" border="0" cellpadding="4" cellspacing="0">
- <col width="64*"/>
- <col width="64*"/>
- <col width="64*"/>
- <col width="64*"/>
- <tr valign="TOP">
- <td width="25%">
- <p align="RIGHT">
- <b>Value</b>
- </p>
- </td>
- <td width="25%">
- <p align="RIGHT">
- <b>First byte</b>
- </p>
- </td>
- <td width="25%">
- <p align="RIGHT">
- <b>Second byte</b>
- </p>
- </td>
- <td width="25%">
- <p align="RIGHT">
- <b>Third byte</b>
- </p>
- </td>
- </tr>
- <tr valign="BOTTOM">
- <td width="25%" sdval="0" sdnum="1033;0;#,##0">
- <p align="RIGHT">0
- </p>
- </td>
- <td width="25%" sdval="0" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- 00000000
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- <tr valign="BOTTOM">
- <td width="25%" sdval="1" sdnum="1033;0;#,##0">
- <p align="RIGHT">1
- </p>
- </td>
- <td width="25%" sdval="1" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- 00000001
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- <tr valign="BOTTOM">
- <td width="25%" sdval="2" sdnum="1033;0;#,##0">
- <p align="RIGHT">2
- </p>
- </td>
- <td width="25%" sdval="10" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- 00000010
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- <tr>
- <td width="25%" valign="TOP">
- <p align="RIGHT">...
- </p>
- </td>
- <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- <tr valign="BOTTOM">
- <td width="25%" sdval="127" sdnum="1033;0;#,##0">
- <p align="RIGHT">127
- </p>
- </td>
- <td width="25%" sdval="1111111" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- 01111111
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- <tr valign="BOTTOM">
- <td width="25%" sdval="128" sdnum="1033;0;#,##0">
- <p align="RIGHT">128
- </p>
- </td>
- <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- 10000000
- </p>
- </td>
- <td width="25%" sdval="1" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: -0.07cm;
- margin-right: 0.01cm">
- 00000001
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- <tr valign="BOTTOM">
- <td width="25%" sdval="129" sdnum="1033;0;#,##0">
- <p align="RIGHT">129
- </p>
- </td>
- <td width="25%" sdval="10000001" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- 10000001
- </p>
- </td>
- <td width="25%" sdval="1" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: -0.07cm;
- margin-right: 0.01cm">
- 00000001
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- <tr valign="BOTTOM">
- <td width="25%" sdval="130" sdnum="1033;0;#,##0">
- <p align="RIGHT">130
- </p>
- </td>
- <td width="25%" sdval="10000010" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- 10000010
- </p>
- </td>
- <td width="25%" sdval="1" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: -0.07cm;
- margin-right: 0.01cm">
- 00000001
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- <tr>
- <td width="25%" valign="TOP">
- <p align="RIGHT">...
- </p>
- </td>
- <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- <tr valign="BOTTOM">
- <td width="25%" sdval="16383" sdnum="1033;0;#,##0">
- <p align="RIGHT">16,383
- </p>
- </td>
- <td width="25%" sdval="11111111" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- 11111111
- </p>
- </td>
- <td width="25%" sdval="1111111" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: -0.07cm;
- margin-right: 0.01cm">
- 01111111
- </p>
- </td>
- <td width="25%" sdnum="1033;0;00000000">
- <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
- 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- <tr valign="BOTTOM">
- <td width="25%" sdval="16384" sdnum="1033;0;#,##0">
- <p align="RIGHT">16,384
- </p>
- </td>
- <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- 10000000
- </p>
- </td>
- <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: -0.07cm;
- margin-right: 0.01cm">
- 10000000
- </p>
- </td>
- <td width="25%" sdval="1" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: -0.47cm;
- margin-right: 0.01cm">
- 00000001
- </p>
- </td>
- </tr>
- <tr valign="BOTTOM">
- <td width="25%" sdval="16385" sdnum="1033;0;#,##0">
- <p align="RIGHT">16,385
- </p>
- </td>
- <td width="25%" sdval="10000001" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- 10000001
- </p>
- </td>
- <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: -0.07cm;
- margin-right: 0.01cm">
- 10000000
- </p>
- </td>
- <td width="25%" sdval="1" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: -0.47cm;
- margin-right: 0.01cm">
- 00000001
- </p>
- </td>
- </tr>
- <tr>
- <td width="25%" valign="TOP">
- <p align="RIGHT">...
- </p>
- </td>
- <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: 0.11cm;
- margin-right: 0.01cm">
- <br/>
-
- </p>
- </td>
- <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: -0.07cm;
- margin-right: 0.01cm">
- <br/>
-
- </p>
- </td>
- <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
- <p class="western" align="RIGHT" style="margin-left: -0.47cm;
- margin-right: 0.01cm">
- <br/>
-
- </p>
- </td>
- </tr>
- </table>
-
- <p>
- This provides compression while still being
- efficient to decode.
- </p>
-
- </section>
-
- <section id="Chars"><title>Chars</title>
-
- <p>
- Lucene writes unicode
- character sequences as UTF-8 encoded bytes.
- </p>
-
-
- </section>
-
- <section id="String"><title>String</title>
-
- <p>
- Lucene writes strings as UTF-8 encoded bytes.
- First the length, in bytes, is written as a VInt,
- followed by the bytes.
- </p>
-
- <p>
- String --> VInt, Chars
- </p>
-
- </section>
- </section>
-
- <section id="Compound Types"><title>Compound Types</title>
- <section id="MapStringString"><title>Map<String,String></title>
-
- <p>
- In a couple places Lucene stores a Map
- String->String.
- </p>
-
- <p>
- Map<String,String> --> Count<String,String><sup>Count</sup>
- </p>
-
- </section>
-
- </section>
-
- <section id="Per-Index Files"><title>Per-Index Files</title>
-
- <p>
- The files in this section exist one-per-index.
- </p>
-
- <section id="Segments File"><title>Segments File</title>
-
- <p>
- The active segments in the index are stored in the
- segment info file,
- <tt>segments_N</tt>.
- There may
- be one or more
- <tt>segments_N</tt>
- files in the
- index; however, the one with the largest
- generation is the active one (when older
- segments_N files are present it's because they
- temporarily cannot be deleted, or, a writer is in
- the process of committing, or a custom
- <a href="api/core/org/apache/lucene/index/IndexDeletionPolicy.html">IndexDeletionPolicy</a>
- is in use). This file lists each
- segment by name, has details about the separate
- norms and deletion files, and also contains the
- size of each segment.
- </p>
-
- <p>
- As of 2.1, there is also a file
- <tt>segments.gen</tt>.
- This file contains the
- current generation (the
- <tt>_N</tt>
- in
- <tt>segments_N</tt>)
- of the index. This is
- used only as a fallback in case the current
- generation cannot be accurately determined by
- directory listing alone (as is the case for some
- NFS clients with time-based directory cache
- expiraation). This file simply contains an Int32
- version header (SegmentInfos.FORMAT_LOCKLESS =
- -2), followed by the generation recorded as Int64,
- written twice.
- </p>
- <p>
- <b>3.1</b>
- Segments --> Format, Version, NameCounter, SegCount, <SegVersion, SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField,
- NormGen<sup>NumField</sup>,
- IsCompoundFile, DeletionCount, HasProx, Diagnostics, HasVectors><sup>SegCount</sup>, CommitUserData, Checksum
- </p>
-
- <p>
- Format, NameCounter, SegCount, SegSize, NumField,
- DocStoreOffset, DeletionCount --> Int32
- </p>
-
- <p>
- Version, DelGen, NormGen, Checksum --> Int64
- </p>
-
- <p>
- SegVersion, SegName, DocStoreSegment --> String
- </p>
-
- <p>
- Diagnostics --> Map<String,String>
- </p>
-
- <p>
- IsCompoundFile, HasSingleNormFile,
- DocStoreIsCompoundFile, HasProx, HasVectors --> Int8
- </p>
-
- <p>
- CommitUserData --> Map<String,String>
- </p>
-
- <p>
- Format is -9 (SegmentInfos.FORMAT_DIAGNOSTICS).
- </p>
-
- <p>
- Version counts how often the index has been
- changed by adding or deleting documents.
- </p>
-
- <p>
- NameCounter is used to generate names for new segment files.
- </p>
-
- <p>
- SegVersion is the code version that created the segment.
- </p>
-
- <p>
- SegName is the name of the segment, and is used as the file name prefix
- for all of the files that compose the segment's index.
- </p>
-
- <p>
- SegSize is the number of documents contained in the segment index.
- </p>
-
- <p>
- DelGen is the generation count of the separate
- deletes file. If this is -1, there are no
- separate deletes. If it is 0, this is a pre-2.1
- segment and you must check filesystem for the
- existence of _X.del. Anything above zero means
- there are separate deletes (_X_N.del).
- </p>
-
- <p>
- NumField is the size of the array for NormGen, or
- -1 if there are no NormGens stored.
- </p>
-
- <p>
- NormGen records the generation of the separate
- norms files. If NumField is -1, there are no
- normGens stored and they are all assumed to be 0
- when the segment file was written pre-2.1 and all
- assumed to be -1 when the segments file is 2.1 or
- above. The generation then has the same meaning
- as delGen (above).
- </p>
-
- <p>
- IsCompoundFile records whether the segment is
- written as a compound file or not. If this is -1,
- the segment is not a compound file. If it is 1,
- the segment is a compound file. Else it is 0,
- which means we check filesystem to see if _X.cfs
- exists.
- </p>
-
- <p>
- If HasSingleNormFile is 1, then the field norms are
- written as a single joined file (with extension
- <tt>.nrm</tt>); if it is 0 then each field's norms
- are stored as separate <tt>.fN</tt> files. See
- "Normalization Factors" below for details.
- </p>
-
- <p>
- DocStoreOffset, DocStoreSegment,
- DocStoreIsCompoundFile: If DocStoreOffset is -1,
- this segment has its own doc store (stored fields
- values and term vectors) files and DocStoreSegment
- and DocStoreIsCompoundFile are not stored. In
- this case all files for stored field values
- (<tt>*.fdt</tt> and <tt>*.fdx</tt>) and term
- vectors (<tt>*.tvf</tt>, <tt>*.tvd</tt> and
- <tt>*.tvx</tt>) will be stored with this segment.
- Otherwise, DocStoreSegment is the name of the
- segment that has the shared doc store files;
- DocStoreIsCompoundFile is 1 if that segment is
- stored in compound file format (as a <tt>.cfx</tt>
- file); and DocStoreOffset is the starting document
- in the shared doc store files where this segment's
- documents begin. In this case, this segment does
- not store its own doc store files but instead
- shares a single set of these files with other
- segments.
- </p>
-
- <p>
- Checksum contains the CRC32 checksum of all bytes
- in the segments_N file up until the checksum.
- This is used to verify integrity of the file on
- opening the index.
- </p>
-
- <p>
- DeletionCount records the number of deleted
- documents in this segment.
- </p>
-
- <p>
- HasProx is 1 if any fields in this segment have
- position data (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); else, it's 0.
- </p>
-
- <p>
- CommitUserData stores an optional user-supplied
- opaque Map<String,String> that was passed to
- IndexWriter's commit or prepareCommit, or
- IndexReader's flush methods.
- </p>
- <p>
- The Diagnostics Map is privately written by
- IndexWriter, as a debugging aid, for each segment
- it creates. It includes metadata like the current
- Lucene version, OS, Java version, why the segment
- was created (merge, flush, addIndexes), etc.
- </p>
-
- <p> HasVectors is 1 if this segment stores term vectors,
- else it's 0.
- </p>
-
- </section>
-
- <section id="Lock File"><title>Lock File</title>
-
- <p>
- The write lock, which is stored in the index
- directory by default, is named "write.lock". If
- the lock directory is different from the index
- directory then the write lock will be named
- "XXXX-write.lock" where XXXX is a unique prefix
- derived from the full path to the index directory.
- When this file is present, a writer is currently
- modifying the index (adding or removing
- documents). This lock file ensures that only one
- writer is modifying the index at a time.
- </p>
- </section>
-
- <section id="Deletable File"><title>Deletable File</title>
-
- <p>
- A writer dynamically computes
- the files that are deletable, instead, so no file
- is written.
- </p>
-
- </section>
-
- <section id="Compound Files"><title>Compound Files</title>
-
- <p>Starting with Lucene 1.4 the compound file format became default. This
- is simply a container for all files described in the next section
- (except for the .del file).</p>
- <p>Compound Entry Table (.cfe) --> Version, FileCount, <FileName, DataOffset, DataLength>
- <sup>FileCount</sup>
- </p>
-
- <p>Compound (.cfs) --> FileData <sup>FileCount</sup>
- </p>
-
- <p>Version --> Int</p>
-
- <p>FileCount --> VInt</p>
-
- <p>DataOffset --> Long</p>
-
- <p>DataLength --> Long</p>
-
- <p>FileName --> String</p>
-
- <p>FileData --> raw file data</p>
- <p>The raw file data is the data from the individual files named above.</p>
-
- <p>Starting with Lucene 2.3, doc store files (stored
- field values and term vectors) can be shared in a
- single set of files for more than one segment. When
- compound file is enabled, these shared files will be
- added into a single compound file (same format as
- above) but with the extension <tt>.cfx</tt>.
- </p>
-
- </section>
-
- </section>
-
- <section id="Per-Segment Files"><title>Per-Segment Files</title>
-
- <p>
- The remaining files are all per-segment, and are
- thus defined by suffix.
- </p>
- <section id="Fields"><title>Fields</title>
- <p>
- <br/>
- <b>Field Info</b>
- <br/>
- </p>
-
- <p>
- Field names are
- stored in the field info file, with suffix .fnm.
- </p>
- <p>
- FieldInfos
- (.fnm) --> FNMVersion,FieldsCount, <FieldName,
- FieldBits>
- <sup>FieldsCount</sup>
- </p>
-
- <p>
- FNMVersion, FieldsCount --> VInt
- </p>
-
- <p>
- FieldName --> String
- </p>
-
- <p>
- FieldBits --> Byte
- </p>
-
- <p>
- <ul>
- <li>
- The low-order bit is one for
- indexed fields, and zero for non-indexed fields.
- </li>
- <li>
- The second lowest-order
- bit is one for fields that have term vectors stored, and zero for fields
- without term vectors.
- </li>
- <li>If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.</li>
- <li>If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.</li>
- <li>If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.</li>
- <li>If the sixth lowest-order bit is set (0x20), payloads are stored for the indexed field.</li>
- <li>If the seventh lowest-order bit is set (0x40), term frequencies and positions omitted for the indexed field.</li>
- <li>If the eighth lowest-order bit is set (0x80), positions are omitted for the indexed field.</li>
- </ul>
- </p>
-
- <p>
- FNMVersion (added in 2.9) is -2 for indexes from 2.9 - 3.3. It is -3 for indexes in Lucene 3.4+
- </p>
-
- <p>
- Fields are numbered by their order in this file. Thus field zero is
- the
- first field in the file, field one the next, and so on. Note that,
- like document numbers, field numbers are segment relative.
- </p>
-
-
-
- <p>
- <br/>
- <b>Stored Fields</b>
- <br/>
- </p>
-
- <p>
- Stored fields are represented by two files:
- </p>
-
- <ol>
- <li><a name="field_index"/>
- <p>
- The field index, or .fdx file.
- </p>
-
- <p>
- This contains, for each document, a pointer to
- its field data, as follows:
- </p>
-
- <p>
- FieldIndex
- (.fdx) -->
- <FieldValuesPosition>
- <sup>SegSize</sup>
- </p>
- <p>FieldValuesPosition
- --> Uint64
- </p>
- <p>This
- is used to find the location within the field data file of the
- fields of a particular document. Because it contains fixed-length
- data, this file may be easily randomly accessed. The position of
- document
- <i>n</i>
- 's
- <i></i>
- field data is the Uint64 at
- <i>n*8</i>
- in
- this file.
- </p>
- </li>
- <li>
- <p><a name="field_data"/>
- The field data, or .fdt file.
-
- </p>
-
- <p>
- This contains the stored fields of each document,
- as follows:
- </p>
-
- <p>
- FieldData (.fdt) -->
- <DocFieldData>
- <sup>SegSize</sup>
- </p>
- <p>DocFieldData -->
- FieldCount, <FieldNum, Bits, Value>
- <sup>FieldCount</sup>
- </p>
- <p>FieldCount -->
- VInt
- </p>
- <p>FieldNum -->
- VInt
- </p>
- <p>Bits -->
- Byte
- </p>
- <p>
- <ul>
- <li>low order bit is one for tokenized fields</li>
- <li>second bit is one for fields containing binary data</li>
- <li>third bit is one for fields with compression option enabled
- (if compression is enabled, the algorithm used is ZLIB),
- only available for indexes until Lucene version 2.9.x</li>
- <li>4th to 6th bit (mask: 0x7<<3) define the type of a
- numeric field: <ul>
- <li>all bits in mask are cleared if no numeric field at all</li>
- <li>1<<3: Value is Int</li>
- <li>2<<3: Value is Long</li>
- <li>3<<3: Value is Int as Float (as of Float.intBitsToFloat)</li>
- <li>4<<3: Value is Long as Double (as of Double.longBitsToDouble)</li>
- </ul></li>
- </ul>
- </p>
- <p>Value -->
- String | BinaryValue | Int | Long (depending on Bits)
- </p>
- <p>BinaryValue -->
- ValueSize, <Byte>^ValueSize
- </p>
- <p>ValueSize -->
- VInt
- </p>
-
- </li>
- </ol>
-
- </section>
- <section id="Term Dictionary"><title>Term Dictionary</title>
-
- <p>
- The term dictionary is represented as two files:
- </p>
- <ol>
- <li><a name="tis"/>
- <p>
- The term infos, or tis file.
- </p>
-
- <p>
- TermInfoFile (.tis)-->
- TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos
- </p>
- <p>TIVersion -->
- UInt32
- </p>
- <p>TermCount -->
- UInt64
- </p>
- <p>IndexInterval -->
- UInt32
- </p>
- <p>SkipInterval -->
- UInt32
- </p>
- <p>MaxSkipLevels -->
- UInt32
- </p>
- <p>TermInfos -->
- <TermInfo>
- <sup>TermCount</sup>
- </p>
- <p>TermInfo -->
- <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
- </p>
- <p>Term -->
- <PrefixLength, Suffix, FieldNum>
- </p>
- <p>Suffix -->
- String
- </p>
- <p>PrefixLength,
- DocFreq, FreqDelta, ProxDelta, SkipDelta
- <br/>
- --> VInt
- </p>
- <p>
- This file is sorted by Term. Terms are
- ordered first lexicographically (by UTF16
- character code) by the term's field name,
- and within that lexicographically (by
- UTF16 character code) by the term's text.
- </p>
- <p>TIVersion names the version of the format
- of this file and is equal to TermInfosWriter.FORMAT_CURRENT.
- </p>
- <p>Term
- text prefixes are shared. The PrefixLength is the number of initial
- characters from the previous term which must be pre-pended to a
- term's suffix in order to form the term's text. Thus, if the
- previous term's text was "bone" and the term is "boy",
- the PrefixLength is two and the suffix is "y".
- </p>
- <p>FieldNumber
- determines the term's field, whose name is stored in the .fdt file.
- </p>
- <p>DocFreq
- is the count of documents which contain the term.
- </p>
- <p>FreqDelta
- determines the position of this term's TermFreqs within the .frq
- file. In particular, it is the difference between the position of
- this term's data in that file and the position of the previous
- term's data (or zero, for the first term in the file).
- </p>
- <p>ProxDelta
- determines the position of this term's TermPositions within the .prx
- file. In particular, it is the difference between the position of
- this term's data in that file and the position of the previous
- term's data (or zero, for the first term in the file. For fields
- that omit position data, this will be 0 since
- prox information is not stored.
- </p>
- <p>SkipDelta determines the position of this
- term's SkipData within the .frq file. In
- particular, it is the number of bytes
- after TermFreqs that the SkipData starts.
- In other words, it is the length of the
- TermFreq data. SkipDelta is only stored
- if DocFreq is not smaller than SkipInterval.
- </p>
- </li>
- <li>
- <p><a name="tii"/>
- The term info index, or .tii file.
- </p>
-
- <p>
- This contains every IndexInterval
- <sup>th</sup>
- entry from the .tis
- file, along with its location in the "tis" file. This is
- designed to be read entirely into memory and used to provide random
- access to the "tis" file.
- </p>
-
- <p>
- The structure of this file is very similar to the
- .tis file, with the addition of one item per record, the IndexDelta.
- </p>
-
- <p>
- TermInfoIndex (.tii)-->
- TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices
- </p>
- <p>TIVersion -->
- UInt32
- </p>
- <p>IndexTermCount -->
- UInt64
- </p>
- <p>IndexInterval -->
- UInt32
- </p>
- <p>SkipInterval -->
- UInt32
- </p>
- <p>TermIndices -->
- <TermInfo, IndexDelta>
- <sup>IndexTermCount</sup>
- </p>
- <p>IndexDelta -->
- VLong
- </p>
- <p>IndexDelta
- determines the position of this term's TermInfo within the .tis file. In
- particular, it is the difference between the position of this term's
- entry in that file and the position of the previous term's entry.
- </p>
- <p>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int).
- Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while
- smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more
- accelerable cases.</p>
- <p>MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in
- smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration.
- See format of .frq file for more information about skip levels.</p>
- </li>
- </ol>
- </section>
-
- <section id="Frequencies"><title>Frequencies</title>
-
- <p>
- The .frq file contains the lists of documents
- which contain each term, along with the frequency of the term in that
- document (except when frequencies are omitted: IndexOptions.DOCS_ONLY).
- </p>
- <p>FreqFile (.frq) -->
- <TermFreqs, SkipData>
- <sup>TermCount</sup>
- </p>
- <p>TermFreqs -->
- <TermFreq>
- <sup>DocFreq</sup>
- </p>
- <p>TermFreq -->
- DocDelta[, Freq?]
- </p>
- <p>SkipData -->
- <<SkipLevelLength, SkipLevel>
- <sup>NumSkipLevels-1</sup>, SkipLevel>
- <SkipDatum>
- </p>
- <p>SkipLevel -->
- <SkipDatum>
- <sup>DocFreq/(SkipInterval^(Level + 1))</sup>
- </p>
- <p>SkipDatum -->
- DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
- </p>
- <p>DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip -->
- VInt
- </p>
- <p>SkipChildLevelPointer -->
- VLong
- </p>
- <p>TermFreqs
- are ordered by term (the term is implicit, from the .tis file).
- </p>
- <p>TermFreq
- entries are ordered by increasing document number.
- </p>
- <p>DocDelta: if frequencies are indexed, this determines both
- the document number and the frequency. In
- particular, DocDelta/2 is the difference between
- this document number and the previous document
- number (or zero when this is the first document in
- a TermFreqs). When DocDelta is odd, the frequency
- is one. When DocDelta is even, the frequency is
- read as another VInt. If frequencies are omitted, DocDelta
- contains the gap (not multiplied by 2) between
- document numbers and no frequency information is
- stored.
- </p>
- <p>For example, the TermFreqs for a term which occurs
- once in document seven and three times in document
- eleven, with frequencies indexed, would be the following
- sequence of VInts:
- </p>
- <p>15, 8, 3
- </p>
- <p> If frequencies were omitted (IndexOptions.DOCS_ONLY) it would be this sequence
- of VInts instead:
- </p>
- <p>
- 7,4
- </p>
- <p>DocSkip records the document number before every
- SkipInterval
- <sup>th</sup>
- document in TermFreqs.
- If payloads are disabled for the term's field,
- then DocSkip represents the difference from the
- previous value in the sequence.
- If payloads are enabled for the term's field,
- then DocSkip/2 represents the difference from the
- previous value in the sequence. If payloads are enabled
- and DocSkip is odd,
- then PayloadLength is stored indicating the length
- of the last payload before the SkipInterval<sup>th</sup>
- document in TermPositions.
- FreqSkip and ProxSkip record the position of every
- SkipInterval
- <sup>th</sup>
- entry in FreqFile and
- ProxFile, respectively. File positions are
- relative to the start of TermFreqs and Positions,
- to the previous SkipDatum in the sequence.
- </p>
- <p>For example, if DocFreq=35 and SkipInterval=16,
- then there are two SkipData entries, containing
- the 15
- <sup>th</sup>
- and 31
- <sup>st</sup>
- document
- numbers in TermFreqs. The first FreqSkip names
- the number of bytes after the beginning of
- TermFreqs that the 16
- <sup>th</sup>
- SkipDatum
- starts, and the second the number of bytes after
- that that the 32
- <sup>nd</sup>
- starts. The first
- ProxSkip names the number of bytes after the
- beginning of Positions that the 16
- <sup>th</sup>
- SkipDatum starts, and the second the number of
- bytes after that that the 32
- <sup>nd</sup>
- starts.
- </p>
- <p>Each term can have multiple skip levels.
- The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))).
- The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip
- level is Level=0. <br></br>
- Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries,
- containing the 3<sup>rd</sup>, 7<sup>th</sup>, 11<sup>th</sup>, 15<sup>th</sup>, 19<sup>th</sup>, 23<sup>rd</sup>,
- 27<sup>th</sup>, and 31<sup>st</sup> document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the
- 15<sup>th</sup> and 31<sup>st</sup> document numbers in TermFreqs. <br></br>
- The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData
- entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer
- to entry 31 on level 0.
- </p>
-
- </section>
- <section id="Positions"><title>Positions</title>
-
- <p>
- The .prx file contains the lists of positions that
- each term occurs at within documents. Note that
- fields omitting positional data do not store
- anything into this file, and if all fields in the
- index omit positional data then the .prx file will not
- exist.
- </p>
- <p>ProxFile (.prx) -->
- <TermPositions>
- <sup>TermCount</sup>
- </p>
- <p>TermPositions -->
- <Positions>
- <sup>DocFreq</sup>
- </p>
- <p>Positions -->
- <PositionDelta,Payload?>
- <sup>Freq</sup>
- </p>
- <p>Payload -->
- <PayloadLength?,PayloadData>
- </p>
- <p>PositionDelta -->
- VInt
- </p>
- <p>PayloadLength -->
- VInt
- </p>
- <p>PayloadData -->
- byte<sup>PayloadLength</sup>
- </p>
- <p>TermPositions
- are ordered by term (the term is implicit, from the .tis file).
- </p>
- <p>Positions
- entries are ordered by increasing document number (the document
- number is implicit from the .frq file).
- </p>
- <p>PositionDelta
- is, if payloads are disabled for the term's field, the difference
- between the position of the current occurrence in
- the document and the previous occurrence (or zero, if this is the
- first occurrence in this document).
- If payloads are enabled for the term's field, then PositionDelta/2
- is the difference between the current and the previous position. If
- payloads are enabled and PositionDelta is odd, then PayloadLength is
- stored, indicating the length of the payload at the current term position.
- </p>
- <p>
- For example, the TermPositions for a
- term which occurs as the fourth term in one document, and as the
- fifth and ninth term in a subsequent document, would be the following
- sequence of VInts (payloads disabled):
- </p>
- <p>4,
- 5, 4
- </p>
- <p>PayloadData
- is metadata associated with the current term position. If PayloadLength
- is stored at the current position, then it indicates the length of this
- Payload. If PayloadLength is not stored, then this Payload has the same
- length as the Payload at the previous position.
- </p>
- </section>
- <section id="Normalization Factors"><title>Normalization Factors</title>
-
- <p>There's a single .nrm file containing all norms:
- </p>
- <p>AllNorms
- (.nrm) --> NormsHeader,<Norms>
- <sup>NumFieldsWithNorms</sup>
- </p>
- <p>Norms
- --> <Byte>
- <sup>SegSize</sup>
- </p>
- <p>NormsHeader
- --> 'N','R','M',Version
- </p>
- <p>Version
- --> Byte
- </p>
- <p>NormsHeader
- has 4 bytes, last of which is the format version for this file, currently -1.
- </p>
- <p>Each
- byte encodes a floating point value. Bits 0-2 contain the 3-bit
- mantissa, and bits 3-8 contain the 5-bit exponent.
- </p>
- <p>These
- are converted to an IEEE single float value as follows:
- </p>
- <ol>
- <li>
- <p>If
- the byte is zero, use a zero float.
- </p>
- </li>
- <li>
- <p>Otherwise,
- set the sign bit of the float to zero;
- </p>
- </li>
- <li>
- <p>add
- 48 to the exponent and use this as the float's exponent;
- </p>
- </li>
- <li>
- <p>map
- the mantissa to the high-order 3 bits of the float's mantissa; and
-
- </p>
- </li>
- <li>
- <p>set
- the low-order 21 bits of the float's mantissa to zero.
- </p>
- </li>
- </ol>
- <p>A separate norm file is created when the norm values of an existing segment are modified.
- When field <em>N</em> is modified, a separate norm file <em>.sN</em>
- is created, to maintain the norm values for that field.
- </p>
- <p>Separate norm files are created (when adequate) for both compound and non compound segments.
- </p>
-
- </section>
- <section id="Term Vectors"><title>Term Vectors</title>
- <p>
- Term Vector support is an optional on a field by
- field basis. It consists of 3 files.
- </p>
- <ol>
- <li><a name="tvx"/>
- <p>The Document Index or .tvx file.</p>
- <p>For each document, this stores the offset
- into the document data (.tvd) and field
- data (.tvf) files.
- </p>
- <p>DocumentIndex (.tvx) --> TVXVersion<DocumentPosition,FieldPosition>
- <sup>NumDocs</sup>
- </p>
- <p>TVXVersion --> Int (TermVectorsReader.CURRENT)</p>
- <p>DocumentPosition --> UInt64 (offset in
- the .tvd file)</p>
- <p>FieldPosition --> UInt64 (offset in the
- .tvf file)</p>
- </li>
- <li><a name="tvd"/>
- <p>The Document or .tvd file.</p>
- <p>This contains, for each document, the number of fields, a list of the fields with
- term vector info and finally a list of pointers to the field information in the .tvf
- (Term Vector Fields) file.</p>
- <p>
- Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions>
- <sup>NumDocs</sup>
- </p>
- <p>TVDVersion --> Int (TermVectorsReader.FORMAT_CURRENT)</p>
- <p>NumFields --> VInt</p>
- <p>FieldNums --> <FieldNumDelta>
- <sup>NumFields</sup>
- </p>
- <p>FieldNumDelta --> VInt</p>
- <p>FieldPositions --> <FieldPositionDelta>
- <sup>NumFields-1</sup>
- </p>
- <p>FieldPositionDelta --> VLong</p>
- <p>The .tvd file is used to map out the fields that have term vectors stored and
- where the field information is in the .tvf file.</p>
- </li>
- <li><a name="tvf"/>
- <p>The Field or .tvf file.</p>
- <p>This file contains, for each field that has a term vector stored, a list of
- the terms, their frequencies and, optionally, position and offest information.</p>
- <p>Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs>
- <sup>NumFields</sup>
- </p>
- <p>TVFVersion --> Int (TermVectorsReader.FORMAT_CURRENT)</p>
- <p>NumTerms --> VInt</p>
- <p>Position/Offset --> Byte</p>
- <p>TermFreqs --> <TermText, TermFreq, Positions?, Offsets?>
- <sup>NumTerms</sup>
- </p>
- <p>TermText --> <PrefixLength, Suffix></p>
- <p>PrefixLength --> VInt</p>
- <p>Suffix --> String</p>
- <p>TermFreq --> VInt</p>
- <p>Positions --> <VInt><sup>TermFreq</sup></p>
- <p>Offsets --> <VInt, VInt><sup>TermFreq</sup></p>
- <br/>
- <p>Notes:</p>
- <ul>
- <li>Position/Offset byte stores whether this term vector has position or offset information stored.</li>
- <li>Term
- text prefixes are shared. The PrefixLength is the number of initial
- characters from the previous term which must be pre-pended to a
- term's suffix in order to form the term's text. Thus, if the
- previous term's text was "bone" and the term is "boy",
- the PrefixLength is two and the suffix is "y".
- </li>
- <li>Positions are stored as delta encoded VInts. This means we only store the difference of the current position from the last position</li>
- <li>Offsets are stored as delta encoded VInts. The first VInt is the startOffset, the second is the endOffset.</li>
- </ul>
-
-
- </li>
- </ol>
- </section>
-
- <section id="Deleted Documents"><title>Deleted Documents</title>
-
- <p>The .del file is
- optional, and only exists when a segment contains deletions.
- </p>
-
- <p>Although per-segment, this file is maintained exterior to compound segment files.
- </p>
- <p>
- Deletions
- (.del) --> [Format],ByteCount,BitCount, Bits | DGaps (depending on Format)
- </p>
-
- <p>Format,ByteSize,BitCount -->
- Uint32
- </p>
-
- <p>Bits -->
- <Byte>
- <sup>ByteCount</sup>
- </p>
-
- <p>DGaps -->
- <DGap,NonzeroByte>
- <sup>NonzeroBytesCount</sup>
- </p>
-
- <p>DGap -->
- VInt
- </p>
-
- <p>NonzeroByte -->
- Byte
- </p>
-
- <p>Format
- is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded.
- </p>
-
- <p>ByteCount
- indicates the number of bytes in Bits. It is typically
- (SegSize/8)+1.
- </p>
-
- <p>
- BitCount
- indicates the number of bits that are currently set in Bits.
- </p>
-
- <p>Bits
- contains one bit for each document indexed. When the bit
- corresponding to a document number is set, that document is marked as
- deleted. Bit ordering is from least to most significant. Thus, if
- Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as
- deleted.
- </p>
-
- <p>DGaps
- represents sparse bit-vectors more efficiently than Bits.
- It is made of DGaps on indexes of nonzero bytes in Bits,
- and the nonzero bytes themselves. The number of nonzero bytes
- in Bits (NonzeroBytesCount) is not stored.
- </p>
- <p>For example,
- if there are 8000 bits and only bits 10,12,32 are set,
- DGaps would be used:
- </p>
- <p>
- (VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1
- </p>
- </section>
- </section>
-
- <section id="Limitations"><title>Limitations</title>
-
- <p>
- When referring to term numbers, Lucene's current
- implementation uses a Java <code>int</code> to hold the
- term index, which means the maximum number of unique
- terms in any single index segment is ~2.1 billion times
- the term index interval (default 128) = ~274 billion.
- This is technically not a limitation of the index file
- format, just of Lucene's current implementation.
- </p>
- <p>
- Similarly, Lucene uses a Java <code>int</code> to refer
- to document numbers, and the index file format uses an
- <code>Int32</code> on-disk to store document numbers.
- This is a limitation of both the index file format and
- the current implementation. Eventually these should be
- replaced with either <code>UInt64</code> values, or
- better yet, <code>VInt</code> values which have no
- limit.
- </p>
-
- </section>
-
- </body>
-
-</document>