X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.5.0/lucene/src/site/src/documentation/content/xdocs/fileformats.xml
diff --git a/lucene-java-3.5.0/lucene/src/site/src/documentation/content/xdocs/fileformats.xml b/lucene-java-3.5.0/lucene/src/site/src/documentation/content/xdocs/fileformats.xml
new file mode 100644
index 0000000..02a66d2
--- /dev/null
+++ b/lucene-java-3.5.0/lucene/src/site/src/documentation/content/xdocs/fileformats.xml
@@ -0,0 +1,1936 @@
+
+
+
+ This document defines the index file formats used
+ in this version of Lucene. If you are using a different
+ version of Lucene, please consult the copy of
+
+ Apache Lucene is written in Java, but several
+ efforts are underway to write
+ versions
+ of Lucene in other programming
+ languages. If these versions are to remain compatible with Apache
+ Lucene, then a language-independent definition of the Lucene index
+ format is required. This document thus attempts to provide a
+ complete and independent definition of the Apache Lucene file
+ formats.
+
+ As Lucene evolves, this document should evolve.
+ Versions of Lucene in different programming languages should endeavor
+ to agree on file formats, and generate new versions of this document.
+
+ Compatibility notes are provided in this document,
+ describing how file formats have changed from prior versions.
+
+ In version 2.1, the file format was changed to allow
+ lock-less commits (ie, no more commit lock). The
+ change is fully backwards compatible: you can open a
+ pre-2.1 index for searching or adding/deleting of
+ docs. When the new segments file is saved
+ (committed), it will be written in the new file format
+ (meaning no specific "upgrade" process is needed).
+ But note that once a commit has occurred, pre-2.1
+ Lucene will not be able to read the index.
+
+ In version 2.3, the file format was changed to allow
+ segments to share a single set of doc store (vectors &
+ stored fields) files. This allows for faster indexing
+ in certain cases. The change is fully backwards
+ compatible (in the same way as the lock-less commits
+ change in 2.1).
+
+ In version 2.4, Strings are now written as true UTF-8
+ byte sequence, not Java's modified UTF-8. See issue
+ LUCENE-510 for details.
+
+ In version 2.9, an optional opaque Map<String,String>
+ CommitUserData may be passed to IndexWriter's commit
+ methods (and later retrieved), which is recorded in
+ the segments_N file. See issue LUCENE-1382 for
+ details. Also, diagnostics were added to each segment
+ written recording details about why it was written
+ (due to flush, merge; which OS/JRE was used; etc.).
+ See issue LUCENE-1654 for details.
+
+ In version 3.0, compressed fields are no longer
+ written to the index (they can still be read, but on
+ merge the new segment will write them,
+ uncompressed). See issue LUCENE-1960 for details.
+
+ In version 3.1, segments records the code version
+ that created them. See LUCENE-2720 for details.
+
+ Additionally segments track explicitly whether or
+ not they have term vectors. See LUCENE-2811 for details.
+
+ In version 3.2, numeric fields are written as natively
+ to stored fields file, previously they were stored in
+ text format only.
+
+ In version 3.4, fields can omit position data while
+ still indexing term frequencies.
+
+ The fundamental concepts in Lucene are index,
+ document, field and term.
+
+ An index contains a sequence of documents.
+
+ A document is a sequence of fields.
+
+ A field is a named sequence of terms.
+
+ The same string in two different fields is
+ considered a different term. Thus terms are represented as a pair of
+ strings, the first naming the field, and the second naming text
+ within the field.
+
+ The index stores statistics about terms in order
+ to make term-based search more efficient. Lucene's
+ index falls into the family of indexes known as an inverted
+ index. This is because it can list, for a term, the documents that contain
+ it. This is the inverse of the natural relationship, in which
+ documents list terms.
+
+ In Lucene, fields may be stored, in which
+ case their text is stored in the index literally, in a non-inverted
+ manner. Fields that are inverted are called indexed. A field
+ may be both stored and indexed. The text of a field may be tokenized into terms to be
+ indexed, or the text of a field may be used literally as a term to be indexed.
+ Most fields are
+ tokenized, but sometimes it is useful for certain identifier fields
+ to be indexed literally.
+ See the Field java docs for more information on Fields.
+ Lucene indexes may be composed of multiple sub-indexes, or
+ segments. Each segment is a fully independent index, which could be searched
+ separately. Indexes evolve by:
+ Creating new segments for newly added documents. Merging existing segments.
+ Searches may involve multiple segments and/or multiple indexes, each
+ index potentially composed of a set of segments.
+
+ Internally, Lucene refers to documents by an integer document
+ number. The first document added to an index is numbered zero, and each
+ subsequent document added gets a number one greater than the previous.
+
+
+ Note that a document's number may change, so caution should be taken
+ when storing these numbers outside of Lucene. In particular, numbers may
+ change in the following situations:
+
+ The
+ numbers stored in each segment are unique only within the segment,
+ and must be converted before they can be used in a larger context.
+ The standard technique is to allocate each segment a range of
+ values, based on the range of numbers used in that segment. To
+ convert a document number from a segment to an external value, the
+ segment's base document
+ number is added. To convert an external value back to a
+ segment-specific value, the segment is identified by the range that
+ the external value is in, and the segment's base value is
+ subtracted. For example two five document segments might be
+ combined, so that the first segment has a base value of zero, and
+ the second of five. Document three from the second segment would
+ have an external value of eight.
+
+ When documents are deleted, gaps are created
+ in the numbering. These are eventually removed as the index evolves
+ through merging. Deleted documents are dropped when segments are
+ merged. A freshly-merged segment thus has no gaps in its numbering.
+
+ Each segment index maintains the following:
+ Field names. This
+ contains the set of field names used in the index.
+
+ Stored Field
+ values. This contains, for each document, a list of attribute-value
+ pairs, where the attributes are field names. These are used to
+ store auxiliary information about the document, such as its title,
+ url, or an identifier to access a
+ database. The set of stored fields are what is returned for each hit
+ when searching. This is keyed by document number.
+ Term dictionary.
+ A dictionary containing all of the terms used in all of the indexed
+ fields of all of the documents. The dictionary also contains the
+ number of documents which contain the term, and pointers to the
+ term's frequency and proximity data.
+ Term Frequency
+ data. For each term in the dictionary, the numbers of all the
+ documents that contain that term, and the frequency of the term in
+ that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)
+ Term Proximity
+ data. For each term in the dictionary, the positions that the term
+ occurs in each document. Note that this will
+ not exist if all fields in all documents omit position data.
+ Normalization
+ factors. For each field in each document, a value is stored that is
+ multiplied into the score for hits on that field.
+ Term Vectors. For each field in each document, the term vector
+ (sometimes called document vector) may be stored. A term vector consists
+ of term text and term frequency. To add Term Vectors to your index see the
+ Field
+ constructors
+ Deleted documents.
+ An optional file indicating which documents are deleted.
+ Details on each of these are provided in subsequent sections.
+
+ All files belonging to a segment have the same name with varying
+ extensions. The extensions correspond to the different file formats
+ described below. When using the Compound File format (default in 1.4 and greater) these files are
+ collapsed into a single .cfs file (see below for details)
+
+ Typically, all segments
+ in an index are stored in a single directory, although this is not
+ required.
+
+ As of version 2.1 (lock-less commits), file names are
+ never re-used (there is one exception, "segments.gen",
+ see below). That is, when any file is saved to the
+ Directory it is given a never before used filename.
+ This is achieved using a simple generations approach.
+ For example, the first segments file is segments_1,
+ then segments_2, etc. The generation is a sequential
+ long integer represented in alpha-numeric (base 36)
+ form.
+ The following table summarizes the names and extensions of the files in Lucene:
+ docs/fileformats.html
+ that was distributed
+ with the version you are using.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Name
+ Extension
+ Brief Description
+
+
+ Segments File
+ segments.gen, segments_N
+ Stores information about segments
+
+
+ Lock File
+ write.lock
+ The Write lock prevents multiple IndexWriters from writing to the same file.
+
+
+ Compound File
+ .cfs
+ An optional "virtual" file consisting of all the other index files for systems
+ that frequently run out of file handles.
+
+
+ Compound File Entry table
+ .cfe
+ The "virtual" compound file's entry table holding all entries in the corresponding .cfs file (Since 3.4)
+
+
+ Fields
+ .fnm
+ Stores information about the fields
+
+
+ Field Index
+ .fdx
+ Contains pointers to field data
+
+
+ Field Data
+ .fdt
+ The stored fields for documents
+
+
+ Term Infos
+ .tis
+ Part of the term dictionary, stores term info
+
+
+ Term Info Index
+ .tii
+ The index into the Term Infos file
+
+
+ Frequencies
+ .frq
+ Contains the list of docs which contain each term along with frequency
+
+
+ Positions
+ .prx
+ Stores position information about where a term occurs in the index
+
+
+ Norms
+ .nrm
+ Encodes length and boost factors for docs and fields
+
+
+ Term Vector Index
+ .tvx
+ Stores offset into the document data file
+
+
+ Term Vector Documents
+ .tvd
+ Contains information about each document that has term vectors
+
+
+ Term Vector Fields
+ .tvf
+ The field level info about term vectors
+
+
+ Deleted Documents
+ .del
+ Info about what files are deleted
+
+ The most primitive type + is an eight-bit byte. Files are accessed as sequences of bytes. All + other data types are defined as sequences + of bytes, so file formats are byte-order independent. +
+ ++ 32-bit unsigned integers are written as four + bytes, high-order bytes first. +
++ UInt32 --> <Byte>4 +
+ ++ 64-bit unsigned integers are written as eight + bytes, high-order bytes first. +
+ +UInt64 --> <Byte>8 +
+ ++ A variable-length format for positive integers is + defined where the high-order bit of each byte indicates whether more + bytes remain to be read. The low-order seven bits are appended as + increasingly more significant bits in the resulting integer value. + Thus values from zero to 127 may be stored in a single byte, values + from 128 to 16,383 may be stored in two bytes, and so on. +
+ ++ VInt Encoding Example +
+ +
+ + Value + + |
+
+ + First byte + + |
+
+ + Second byte + + |
+
+ + Third byte + + |
+
+ 0 + + |
+
+ + 00000000 + + |
+
+
+ |
+
+
+ |
+
+ 1 + + |
+
+ + 00000001 + + |
+
+
+ |
+
+
+ |
+
+ 2 + + |
+
+ + 00000010 + + |
+
+
+ |
+
+
+ |
+
+ ... + + |
+
+
+ |
+
+
+ |
+
+
+ |
+
+ 127 + + |
+
+ + 01111111 + + |
+
+
+ |
+
+
+ |
+
+ 128 + + |
+
+ + 10000000 + + |
+
+ + 00000001 + + |
+
+
+ |
+
+ 129 + + |
+
+ + 10000001 + + |
+
+ + 00000001 + + |
+
+
+ |
+
+ 130 + + |
+
+ + 10000010 + + |
+
+ + 00000001 + + |
+
+
+ |
+
+ ... + + |
+
+
+ |
+
+
+ |
+
+
+ |
+
+ 16,383 + + |
+
+ + 11111111 + + |
+
+ + 01111111 + + |
+
+
+ |
+
+ 16,384 + + |
+
+ + 10000000 + + |
+
+ + 10000000 + + |
+
+ + 00000001 + + |
+
+ 16,385 + + |
+
+ + 10000001 + + |
+
+ + 10000000 + + |
+
+ + 00000001 + + |
+
+ ... + + |
+
+
+ |
+
+
+ |
+
+
+ |
+
+ This provides compression while still being + efficient to decode. +
+ ++ Lucene writes unicode + character sequences as UTF-8 encoded bytes. +
+ + ++ Lucene writes strings as UTF-8 encoded bytes. + First the length, in bytes, is written as a VInt, + followed by the bytes. +
+ ++ String --> VInt, Chars +
+ ++ In a couple places Lucene stores a Map + String->String. +
+ ++ Map<String,String> --> Count<String,String>Count +
+ ++ The files in this section exist one-per-index. +
+ ++ The active segments in the index are stored in the + segment info file, + segments_N. + There may + be one or more + segments_N + files in the + index; however, the one with the largest + generation is the active one (when older + segments_N files are present it's because they + temporarily cannot be deleted, or, a writer is in + the process of committing, or a custom + IndexDeletionPolicy + is in use). This file lists each + segment by name, has details about the separate + norms and deletion files, and also contains the + size of each segment. +
+ ++ As of 2.1, there is also a file + segments.gen. + This file contains the + current generation (the + _N + in + segments_N) + of the index. This is + used only as a fallback in case the current + generation cannot be accurately determined by + directory listing alone (as is the case for some + NFS clients with time-based directory cache + expiraation). This file simply contains an Int32 + version header (SegmentInfos.FORMAT_LOCKLESS = + -2), followed by the generation recorded as Int64, + written twice. +
++ 3.1 + Segments --> Format, Version, NameCounter, SegCount, <SegVersion, SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField, + NormGenNumField, + IsCompoundFile, DeletionCount, HasProx, Diagnostics, HasVectors>SegCount, CommitUserData, Checksum +
+ ++ Format, NameCounter, SegCount, SegSize, NumField, + DocStoreOffset, DeletionCount --> Int32 +
+ ++ Version, DelGen, NormGen, Checksum --> Int64 +
+ ++ SegVersion, SegName, DocStoreSegment --> String +
+ ++ Diagnostics --> Map<String,String> +
+ ++ IsCompoundFile, HasSingleNormFile, + DocStoreIsCompoundFile, HasProx, HasVectors --> Int8 +
+ ++ CommitUserData --> Map<String,String> +
+ ++ Format is -9 (SegmentInfos.FORMAT_DIAGNOSTICS). +
+ ++ Version counts how often the index has been + changed by adding or deleting documents. +
+ ++ NameCounter is used to generate names for new segment files. +
+ ++ SegVersion is the code version that created the segment. +
+ ++ SegName is the name of the segment, and is used as the file name prefix + for all of the files that compose the segment's index. +
+ ++ SegSize is the number of documents contained in the segment index. +
+ ++ DelGen is the generation count of the separate + deletes file. If this is -1, there are no + separate deletes. If it is 0, this is a pre-2.1 + segment and you must check filesystem for the + existence of _X.del. Anything above zero means + there are separate deletes (_X_N.del). +
+ ++ NumField is the size of the array for NormGen, or + -1 if there are no NormGens stored. +
+ ++ NormGen records the generation of the separate + norms files. If NumField is -1, there are no + normGens stored and they are all assumed to be 0 + when the segment file was written pre-2.1 and all + assumed to be -1 when the segments file is 2.1 or + above. The generation then has the same meaning + as delGen (above). +
+ ++ IsCompoundFile records whether the segment is + written as a compound file or not. If this is -1, + the segment is not a compound file. If it is 1, + the segment is a compound file. Else it is 0, + which means we check filesystem to see if _X.cfs + exists. +
+ ++ If HasSingleNormFile is 1, then the field norms are + written as a single joined file (with extension + .nrm); if it is 0 then each field's norms + are stored as separate .fN files. See + "Normalization Factors" below for details. +
+ ++ DocStoreOffset, DocStoreSegment, + DocStoreIsCompoundFile: If DocStoreOffset is -1, + this segment has its own doc store (stored fields + values and term vectors) files and DocStoreSegment + and DocStoreIsCompoundFile are not stored. In + this case all files for stored field values + (*.fdt and *.fdx) and term + vectors (*.tvf, *.tvd and + *.tvx) will be stored with this segment. + Otherwise, DocStoreSegment is the name of the + segment that has the shared doc store files; + DocStoreIsCompoundFile is 1 if that segment is + stored in compound file format (as a .cfx + file); and DocStoreOffset is the starting document + in the shared doc store files where this segment's + documents begin. In this case, this segment does + not store its own doc store files but instead + shares a single set of these files with other + segments. +
+ ++ Checksum contains the CRC32 checksum of all bytes + in the segments_N file up until the checksum. + This is used to verify integrity of the file on + opening the index. +
+ ++ DeletionCount records the number of deleted + documents in this segment. +
+ ++ HasProx is 1 if any fields in this segment have + position data (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); else, it's 0. +
+ ++ CommitUserData stores an optional user-supplied + opaque Map<String,String> that was passed to + IndexWriter's commit or prepareCommit, or + IndexReader's flush methods. +
++ The Diagnostics Map is privately written by + IndexWriter, as a debugging aid, for each segment + it creates. It includes metadata like the current + Lucene version, OS, Java version, why the segment + was created (merge, flush, addIndexes), etc. +
+ +HasVectors is 1 if this segment stores term vectors, + else it's 0. +
+ ++ The write lock, which is stored in the index + directory by default, is named "write.lock". If + the lock directory is different from the index + directory then the write lock will be named + "XXXX-write.lock" where XXXX is a unique prefix + derived from the full path to the index directory. + When this file is present, a writer is currently + modifying the index (adding or removing + documents). This lock file ensures that only one + writer is modifying the index at a time. +
++ A writer dynamically computes + the files that are deletable, instead, so no file + is written. +
+ +Starting with Lucene 1.4 the compound file format became default. This + is simply a container for all files described in the next section + (except for the .del file).
+Compound Entry Table (.cfe) --> Version, FileCount, <FileName, DataOffset, DataLength> + FileCount +
+ +Compound (.cfs) --> FileData FileCount +
+ +Version --> Int
+ +FileCount --> VInt
+ +DataOffset --> Long
+ +DataLength --> Long
+ +FileName --> String
+ +FileData --> raw file data
+The raw file data is the data from the individual files named above.
+ +Starting with Lucene 2.3, doc store files (stored + field values and term vectors) can be shared in a + single set of files for more than one segment. When + compound file is enabled, these shared files will be + added into a single compound file (same format as + above) but with the extension .cfx. +
+ ++ The remaining files are all per-segment, and are + thus defined by suffix. +
+
+
+ Field Info
+
+
+ Field names are + stored in the field info file, with suffix .fnm. +
++ FieldInfos + (.fnm) --> FNMVersion,FieldsCount, <FieldName, + FieldBits> + FieldsCount +
+ ++ FNMVersion, FieldsCount --> VInt +
+ ++ FieldName --> String +
+ ++ FieldBits --> Byte +
+ ++
+ FNMVersion (added in 2.9) is -2 for indexes from 2.9 - 3.3. It is -3 for indexes in Lucene 3.4+ +
+ ++ Fields are numbered by their order in this file. Thus field zero is + the + first field in the file, field one the next, and so on. Note that, + like document numbers, field numbers are segment relative. +
+ + + +
+
+ Stored Fields
+
+
+ Stored fields are represented by two files: +
+ ++ The field index, or .fdx file. +
+ ++ This contains, for each document, a pointer to + its field data, as follows: +
+ ++ FieldIndex + (.fdx) --> + <FieldValuesPosition> + SegSize +
+FieldValuesPosition + --> Uint64 +
+This + is used to find the location within the field data file of the + fields of a particular document. Because it contains fixed-length + data, this file may be easily randomly accessed. The position of + document + n + 's + + field data is the Uint64 at + n*8 + in + this file. +
++ The field data, or .fdt file. + +
+ ++ This contains the stored fields of each document, + as follows: +
+ ++ FieldData (.fdt) --> + <DocFieldData> + SegSize +
+DocFieldData --> + FieldCount, <FieldNum, Bits, Value> + FieldCount +
+FieldCount --> + VInt +
+FieldNum --> + VInt +
+Bits --> + Byte +
++
Value --> + String | BinaryValue | Int | Long (depending on Bits) +
+BinaryValue --> + ValueSize, <Byte>^ValueSize +
+ValueSize --> + VInt +
+ ++ The term dictionary is represented as two files: +
++ The term infos, or tis file. +
+ ++ TermInfoFile (.tis)--> + TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos +
+TIVersion --> + UInt32 +
+TermCount --> + UInt64 +
+IndexInterval --> + UInt32 +
+SkipInterval --> + UInt32 +
+MaxSkipLevels --> + UInt32 +
+TermInfos --> + <TermInfo> + TermCount +
+TermInfo --> + <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta> +
+Term --> + <PrefixLength, Suffix, FieldNum> +
+Suffix --> + String +
+PrefixLength,
+ DocFreq, FreqDelta, ProxDelta, SkipDelta
+
+ --> VInt
+
+ This file is sorted by Term. Terms are + ordered first lexicographically (by UTF16 + character code) by the term's field name, + and within that lexicographically (by + UTF16 character code) by the term's text. +
+TIVersion names the version of the format + of this file and is equal to TermInfosWriter.FORMAT_CURRENT. +
+Term + text prefixes are shared. The PrefixLength is the number of initial + characters from the previous term which must be pre-pended to a + term's suffix in order to form the term's text. Thus, if the + previous term's text was "bone" and the term is "boy", + the PrefixLength is two and the suffix is "y". +
+FieldNumber + determines the term's field, whose name is stored in the .fdt file. +
+DocFreq + is the count of documents which contain the term. +
+FreqDelta + determines the position of this term's TermFreqs within the .frq + file. In particular, it is the difference between the position of + this term's data in that file and the position of the previous + term's data (or zero, for the first term in the file). +
+ProxDelta + determines the position of this term's TermPositions within the .prx + file. In particular, it is the difference between the position of + this term's data in that file and the position of the previous + term's data (or zero, for the first term in the file. For fields + that omit position data, this will be 0 since + prox information is not stored. +
+SkipDelta determines the position of this + term's SkipData within the .frq file. In + particular, it is the number of bytes + after TermFreqs that the SkipData starts. + In other words, it is the length of the + TermFreq data. SkipDelta is only stored + if DocFreq is not smaller than SkipInterval. +
++ The term info index, or .tii file. +
+ ++ This contains every IndexInterval + th + entry from the .tis + file, along with its location in the "tis" file. This is + designed to be read entirely into memory and used to provide random + access to the "tis" file. +
+ ++ The structure of this file is very similar to the + .tis file, with the addition of one item per record, the IndexDelta. +
+ ++ TermInfoIndex (.tii)--> + TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices +
+TIVersion --> + UInt32 +
+IndexTermCount --> + UInt64 +
+IndexInterval --> + UInt32 +
+SkipInterval --> + UInt32 +
+TermIndices --> + <TermInfo, IndexDelta> + IndexTermCount +
+IndexDelta --> + VLong +
+IndexDelta + determines the position of this term's TermInfo within the .tis file. In + particular, it is the difference between the position of this term's + entry in that file and the position of the previous term's entry. +
+SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). + Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while + smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more + accelerable cases.
+MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in + smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration. + See format of .frq file for more information about skip levels.
++ The .frq file contains the lists of documents + which contain each term, along with the frequency of the term in that + document (except when frequencies are omitted: IndexOptions.DOCS_ONLY). +
+FreqFile (.frq) --> + <TermFreqs, SkipData> + TermCount +
+TermFreqs --> + <TermFreq> + DocFreq +
+TermFreq --> + DocDelta[, Freq?] +
+SkipData --> + <<SkipLevelLength, SkipLevel> + NumSkipLevels-1, SkipLevel> + <SkipDatum> +
+SkipLevel --> + <SkipDatum> + DocFreq/(SkipInterval^(Level + 1)) +
+SkipDatum --> + DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer? +
+DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> + VInt +
+SkipChildLevelPointer --> + VLong +
+TermFreqs + are ordered by term (the term is implicit, from the .tis file). +
+TermFreq + entries are ordered by increasing document number. +
+DocDelta: if frequencies are indexed, this determines both + the document number and the frequency. In + particular, DocDelta/2 is the difference between + this document number and the previous document + number (or zero when this is the first document in + a TermFreqs). When DocDelta is odd, the frequency + is one. When DocDelta is even, the frequency is + read as another VInt. If frequencies are omitted, DocDelta + contains the gap (not multiplied by 2) between + document numbers and no frequency information is + stored. +
+For example, the TermFreqs for a term which occurs + once in document seven and three times in document + eleven, with frequencies indexed, would be the following + sequence of VInts: +
+15, 8, 3 +
+If frequencies were omitted (IndexOptions.DOCS_ONLY) it would be this sequence + of VInts instead: +
++ 7,4 +
+DocSkip records the document number before every + SkipInterval + th + document in TermFreqs. + If payloads are disabled for the term's field, + then DocSkip represents the difference from the + previous value in the sequence. + If payloads are enabled for the term's field, + then DocSkip/2 represents the difference from the + previous value in the sequence. If payloads are enabled + and DocSkip is odd, + then PayloadLength is stored indicating the length + of the last payload before the SkipIntervalth + document in TermPositions. + FreqSkip and ProxSkip record the position of every + SkipInterval + th + entry in FreqFile and + ProxFile, respectively. File positions are + relative to the start of TermFreqs and Positions, + to the previous SkipDatum in the sequence. +
+For example, if DocFreq=35 and SkipInterval=16, + then there are two SkipData entries, containing + the 15 + th + and 31 + st + document + numbers in TermFreqs. The first FreqSkip names + the number of bytes after the beginning of + TermFreqs that the 16 + th + SkipDatum + starts, and the second the number of bytes after + that that the 32 + nd + starts. The first + ProxSkip names the number of bytes after the + beginning of Positions that the 16 + th + SkipDatum starts, and the second the number of + bytes after that that the 32 + nd + starts. +
+Each term can have multiple skip levels.
+ The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))).
+ The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip
+ level is Level=0.
+ Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries,
+ containing the 3rd, 7th, 11th, 15th, 19th, 23rd,
+ 27th, and 31st document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the
+ 15th and 31st document numbers in TermFreqs.
+ The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData
+ entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer
+ to entry 31 on level 0.
+
+ The .prx file contains the lists of positions that + each term occurs at within documents. Note that + fields omitting positional data do not store + anything into this file, and if all fields in the + index omit positional data then the .prx file will not + exist. +
+ProxFile (.prx) --> + <TermPositions> + TermCount +
+TermPositions --> + <Positions> + DocFreq +
+Positions --> + <PositionDelta,Payload?> + Freq +
+Payload --> + <PayloadLength?,PayloadData> +
+PositionDelta --> + VInt +
+PayloadLength --> + VInt +
+PayloadData --> + bytePayloadLength +
+TermPositions + are ordered by term (the term is implicit, from the .tis file). +
+Positions + entries are ordered by increasing document number (the document + number is implicit from the .frq file). +
+PositionDelta + is, if payloads are disabled for the term's field, the difference + between the position of the current occurrence in + the document and the previous occurrence (or zero, if this is the + first occurrence in this document). + If payloads are enabled for the term's field, then PositionDelta/2 + is the difference between the current and the previous position. If + payloads are enabled and PositionDelta is odd, then PayloadLength is + stored, indicating the length of the payload at the current term position. +
++ For example, the TermPositions for a + term which occurs as the fourth term in one document, and as the + fifth and ninth term in a subsequent document, would be the following + sequence of VInts (payloads disabled): +
+4, + 5, 4 +
+PayloadData + is metadata associated with the current term position. If PayloadLength + is stored at the current position, then it indicates the length of this + Payload. If PayloadLength is not stored, then this Payload has the same + length as the Payload at the previous position. +
+There's a single .nrm file containing all norms: +
+AllNorms + (.nrm) --> NormsHeader,<Norms> + NumFieldsWithNorms +
+Norms + --> <Byte> + SegSize +
+NormsHeader + --> 'N','R','M',Version +
+Version + --> Byte +
+NormsHeader + has 4 bytes, last of which is the format version for this file, currently -1. +
+Each + byte encodes a floating point value. Bits 0-2 contain the 3-bit + mantissa, and bits 3-8 contain the 5-bit exponent. +
+These + are converted to an IEEE single float value as follows: +
+If + the byte is zero, use a zero float. +
+Otherwise, + set the sign bit of the float to zero; +
+add + 48 to the exponent and use this as the float's exponent; +
+map + the mantissa to the high-order 3 bits of the float's mantissa; and + +
+set + the low-order 21 bits of the float's mantissa to zero. +
+A separate norm file is created when the norm values of an existing segment are modified. + When field N is modified, a separate norm file .sN + is created, to maintain the norm values for that field. +
+Separate norm files are created (when adequate) for both compound and non compound segments. +
+ ++ Term Vector support is an optional on a field by + field basis. It consists of 3 files. +
+The Document Index or .tvx file.
+For each document, this stores the offset + into the document data (.tvd) and field + data (.tvf) files. +
+DocumentIndex (.tvx) --> TVXVersion<DocumentPosition,FieldPosition> + NumDocs +
+TVXVersion --> Int (TermVectorsReader.CURRENT)
+DocumentPosition --> UInt64 (offset in + the .tvd file)
+FieldPosition --> UInt64 (offset in the + .tvf file)
+The Document or .tvd file.
+This contains, for each document, the number of fields, a list of the fields with + term vector info and finally a list of pointers to the field information in the .tvf + (Term Vector Fields) file.
++ Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions> + NumDocs +
+TVDVersion --> Int (TermVectorsReader.FORMAT_CURRENT)
+NumFields --> VInt
+FieldNums --> <FieldNumDelta> + NumFields +
+FieldNumDelta --> VInt
+FieldPositions --> <FieldPositionDelta> + NumFields-1 +
+FieldPositionDelta --> VLong
+The .tvd file is used to map out the fields that have term vectors stored and + where the field information is in the .tvf file.
+The Field or .tvf file.
+This file contains, for each field that has a term vector stored, a list of + the terms, their frequencies and, optionally, position and offest information.
+Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs> + NumFields +
+TVFVersion --> Int (TermVectorsReader.FORMAT_CURRENT)
+NumTerms --> VInt
+Position/Offset --> Byte
+TermFreqs --> <TermText, TermFreq, Positions?, Offsets?> + NumTerms +
+TermText --> <PrefixLength, Suffix>
+PrefixLength --> VInt
+Suffix --> String
+TermFreq --> VInt
+Positions --> <VInt>TermFreq
+Offsets --> <VInt, VInt>TermFreq
+Notes:
+The .del file is + optional, and only exists when a segment contains deletions. +
+ +Although per-segment, this file is maintained exterior to compound segment files. +
++ Deletions + (.del) --> [Format],ByteCount,BitCount, Bits | DGaps (depending on Format) +
+ +Format,ByteSize,BitCount --> + Uint32 +
+ +Bits --> + <Byte> + ByteCount +
+ +DGaps --> + <DGap,NonzeroByte> + NonzeroBytesCount +
+ +DGap --> + VInt +
+ +NonzeroByte --> + Byte +
+ +Format + is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded. +
+ +ByteCount + indicates the number of bytes in Bits. It is typically + (SegSize/8)+1. +
+ ++ BitCount + indicates the number of bits that are currently set in Bits. +
+ +Bits + contains one bit for each document indexed. When the bit + corresponding to a document number is set, that document is marked as + deleted. Bit ordering is from least to most significant. Thus, if + Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as + deleted. +
+ +DGaps + represents sparse bit-vectors more efficiently than Bits. + It is made of DGaps on indexes of nonzero bytes in Bits, + and the nonzero bytes themselves. The number of nonzero bytes + in Bits (NonzeroBytesCount) is not stored. +
+For example, + if there are 8000 bits and only bits 10,12,32 are set, + DGaps would be used: +
++ (VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1 +
+
+ When referring to term numbers, Lucene's current
+ implementation uses a Java int
to hold the
+ term index, which means the maximum number of unique
+ terms in any single index segment is ~2.1 billion times
+ the term index interval (default 128) = ~274 billion.
+ This is technically not a limitation of the index file
+ format, just of Lucene's current implementation.
+
+ Similarly, Lucene uses a Java int
to refer
+ to document numbers, and the index file format uses an
+ Int32
on-disk to store document numbers.
+ This is a limitation of both the index file format and
+ the current implementation. Eventually these should be
+ replaced with either UInt64
values, or
+ better yet, VInt
values which have no
+ limit.
+