Main Page   Namespace List   Class Hierarchy   Alphabetical List   Compound List   File List   Namespace Members   Compound Members   File Members   Related Pages  

KeyfileIncIndex Class Reference

#include <KeyfileIncIndex.hpp>

Inheritance diagram for KeyfileIncIndex:

PushIndex Index List of all members.

Public Methods

 KeyfileIncIndex (const string &prefix, int cachesize=128000000, DOCID_T startdocid=1)
 KeyfileIncIndex ()
 New empty one for index manager to use.

 ~KeyfileIncIndex ()
 Clean up.

void setName (const string &prefix)
 sets the name for this index

bool beginDoc (const DocumentProps *dp)
 the beginning of a new document

bool addTerm (const Term &t)
 adding a term to the current document

void endDoc (const DocumentProps *dp)
 signify the end of current document

virtual void endDoc (const DocumentProps *dp, const string &mgr)
 signify the end of current document

void endCollection (const CollectionProps *cp)
 signify the end of this collection.

void setDocManager (const string &mgrID)
 set the document manager to use for succeeding documents

void setMesgStream (ostream *lemStream)
 set the mesg stream

void addKnownTerm (TERMID_T termID, LOC_T position)
 update data for an already seen term

TERMID_T addUnknownTerm (const InvFPTerm *term)
 initialize data for a previously unseen term.

TERMID_T addUncachedTerm (const InvFPTerm *term)
 update data for a term that is not cached in the term cache.

Open index
bool open (const string &indexName)
 Open previously created Index with given prefix.

Spelling and index conversion
TERMID_T term (const TERM_T &word) const
 Convert a term spelling to a termID.

const TERM_T term (TERMID_T termID) const
 Convert a termID to its spelling.

DOCID_T document (const EXDOCID_T &docIDStr) const
 Convert a spelling to docID.

const EXDOCID_T document (DOCID_T docID) const
 Convert a docID to its spelling.

const DocumentManagerdocManager (DOCID_T docID) const
 The document manager for this document.

Summary counts
COUNT_T docCount () const
 Total count (i.e., number) of documents in collection.

COUNT_T termCountUnique () const
 Total count of unique terms in collection.

COUNT_T termCount (TERMID_T termID) const
 Total counts of a term in collection.

COUNT_T termCount () const
 Total counts of all terms in collection.

float docLengthAvg () const
 Average document length.

COUNT_T docCount (TERMID_T termID) const
 Total counts of doc with a given term.

COUNT_T docLength (DOCID_T docID) const
 Total counts of terms in a document, including stop words maybe.

virtual COUNT_T totaldocLength (DOCID_T docID) const
 Total counts of terms in a document including stopwords for sure.

COUNT_T docLengthCounted (DOCID_T docID) const
 Total count of terms in given document, not including stop words.

Index entry access
DocInfoListdocInfoList (TERMID_T termID) const
 doc entries in a term index,
See also:
DocList , InvFPDocList


TermInfoListtermInfoList (DOCID_T docID) const
 word entries in a document index (bag of words),
See also:
TermList


TermInfoListtermInfoListSeq (DOCID_T docID) const
 word entries in a document index (sequence of words),
See also:
TermList



Protected Methods

bool tryOpen ()
 try to open an existing index

void writeTOC ()
 write out the table of contents file.

void writeCache (bool lastRun=false)
 write out the cache

void lastWriteCache ()
 final run write out of cache

void mergeCacheSegments ()
 out-of-tree cache management combine segments into single segment

void writeCacheSegment ()
 write out segments

void writeDocMgrIDs ()
 write out document manager ids

int docMgrID (const string &mgr)
virtual void doendDoc (const DocumentProps *dp, int mgrid)
 handle end of document token.

void openDBs ()
 open the database files

void openSegments ()
 open the segment files

void createDBs ()
 create the database files

void fullToc ()
 readin all toc

bool docMgrIDs ()
 read in document manager internal and external ids map

record fetchDocumentRecord (DOCID_T key) const
 retrieve a document record.

void addDocumentLookup (DOCID_T documentKey, const char *documentName)
 store a document record

void addTermLookup (TERMID_T termKey, const char *termSpelling)
 store a term record

void addGeneralLookup (Keyfile &numberNameIndex, Keyfile &nameNumberIndex, TERMID_T number, const char *name)
 store a record

InvFPDocListinternalDocInfoList (TERMID_T termID) const
 retrieve and construct the DocInfoList for a term.

void _updateTermlist (InvFPDocList *curlist, LOC_T position)
 add a position to a DocInfoList

int _cacheSize ()
 total memory used by cache

void _computeMemoryBounds (int memorySize)
 cache size limits based on cachesize parameter to constructor

void _resetEstimatePoint ()
 Approximate how many updates to collect before flushing the cache.


Protected Attributes

int listlengths
 how long all the lists are

COUNT_Tcounts
 array to hold all the overall count stats of this db

std::vector< std::string > names
 array to hold all the names for files we need for this db

float aveDocLen
 the average document length in this index

vector< std::string > docmgrs
 list of document managers

ostream * msgstream
 Lemur code messages stream.

Keyfile invlookup
 termID -> TermData (term statistics and inverted list segment offsets)

Keyfile dIDs
 documentName -> documentID

Keyfile dSTRs
 documentID -> documentName

Keyfile tIDs
 termName -> termID

Keyfile tSTRs
 termID -> termName

File dtlookup
 document statistics (document length, etc.)

ReadBufferdtlookupReadBuffer
 read buffer for dtlookup

File writetlist
char termKey [MAX_TERM_LENGTH]
 buffers for term() lookup functions

char docKey [MAX_DOCID_LENGTH]
 buffers for document() lookup functions

int _listsSize
 memory for use by inverted list buffers

int _memorySize
 upper bound for memory use

std::string name
 the prefix name

vector< InvFPDocList * > invertlists
 array of pointers to doclists

vector< LocatedTermtermlist
 list of terms and their locations in this document

int curdocmgr
 the current docmanager to use

vector< DocumentManager * > docMgrs
 list of document manager objects

TermCache _cache
 cache of term entries

std::vector< File * > _segments
 out-of-tree segments for data

TERMID_T _largestFlushedTermID
 highest term id flushed to disk.

int _estimatePoint
 invertlists point where we should next check on the cache size

bool ignoreDoc
 are we in a bad document state?


Detailed Description

KeyfileIncIndex builds an index assigning termids, docids, tracking locations of term within documents, and tracking terms within documents. It also expects a DocumentProp to have the total number of terms that were in a document. It expects that all stopping and stemming (if any) occurs before the term is passed in. If used with an existing index, new documents are added incrementally. Records are stored in keyfile b-trees. KeyfileIncIndex also provides the Index API for using the index.


Constructor & Destructor Documentation

KeyfileIncIndex::KeyfileIncIndex const string &    prefix,
int    cachesize = 128000000,
DOCID_T    startdocid = 1
 

Instantiate with index name without extension. Optionally pass in cachesize and starting document id number.

KeyfileIncIndex::KeyfileIncIndex  
 

New empty one for index manager to use.

KeyfileIncIndex::~KeyfileIncIndex  
 

Clean up.


Member Function Documentation

int KeyfileIncIndex::_cacheSize   [protected]
 

total memory used by cache

void KeyfileIncIndex::_computeMemoryBounds int    memorySize [protected]
 

cache size limits based on cachesize parameter to constructor

void KeyfileIncIndex::_resetEstimatePoint   [protected]
 

Approximate how many updates to collect before flushing the cache.

void KeyfileIncIndex::_updateTermlist InvFPDocList   curlist,
LOC_T    position
[protected]
 

add a position to a DocInfoList

void KeyfileIncIndex::addDocumentLookup DOCID_T    documentKey,
const char *    documentName
[protected]
 

store a document record

void KeyfileIncIndex::addGeneralLookup Keyfile   numberNameIndex,
Keyfile   nameNumberIndex,
TERMID_T    number,
const char *    name
[protected]
 

store a record

void KeyfileIncIndex::addKnownTerm TERMID_T    termID,
LOC_T    position
 

update data for an already seen term

bool KeyfileIncIndex::addTerm const Term   t [virtual]
 

adding a term to the current document

Implements PushIndex.

void KeyfileIncIndex::addTermLookup TERMID_T    termKey,
const char *    termSpelling
[protected]
 

store a term record

TERMID_T KeyfileIncIndex::addUncachedTerm const InvFPTerm   term
 

update data for a term that is not cached in the term cache.

TERMID_T KeyfileIncIndex::addUnknownTerm const InvFPTerm   term
 

initialize data for a previously unseen term.

bool KeyfileIncIndex::beginDoc const DocumentProps   dp [virtual]
 

the beginning of a new document

Implements PushIndex.

void KeyfileIncIndex::createDBs   [protected]
 

create the database files

COUNT_T KeyfileIncIndex::docCount TERMID_T    termID const [virtual]
 

Total counts of doc with a given term.

Implements Index.

COUNT_T KeyfileIncIndex::docCount   const [inline, virtual]
 

Total count (i.e., number) of documents in collection.

Implements Index.

DocInfoList * KeyfileIncIndex::docInfoList TERMID_T    termID const [virtual]
 

doc entries in a term index,

See also:
DocList , InvFPDocList

Implements Index.

COUNT_T KeyfileIncIndex::docLength DOCID_T    docID const [virtual]
 

Total counts of terms in a document, including stop words maybe.

Implements Index.

float KeyfileIncIndex::docLengthAvg   [virtual]
 

Average document length.

Implements Index.

COUNT_T KeyfileIncIndex::docLengthCounted DOCID_T    docID const
 

Total count of terms in given document, not including stop words.

const DocumentManager * KeyfileIncIndex::docManager DOCID_T    docID const [virtual]
 

The document manager for this document.

Reimplemented from Index.

int KeyfileIncIndex::docMgrID const string &    mgr [protected]
 

returns the internal id of given docmgr if not already registered, mgr will be added

bool KeyfileIncIndex::docMgrIDs   [protected]
 

read in document manager internal and external ids map

const EXDOCID_T KeyfileIncIndex::document DOCID_T    docID const [virtual]
 

Convert a docID to its spelling.

Implements Index.

DOCID_T KeyfileIncIndex::document const EXDOCID_T   docIDStr const [virtual]
 

Convert a spelling to docID.

Implements Index.

void KeyfileIncIndex::doendDoc const DocumentProps   dp,
int    mgrid
[protected, virtual]
 

handle end of document token.

void KeyfileIncIndex::endCollection const CollectionProps   cp [virtual]
 

signify the end of this collection.

Implements PushIndex.

void KeyfileIncIndex::endDoc const DocumentProps   dp,
const string &    mgr
[virtual]
 

signify the end of current document

void KeyfileIncIndex::endDoc const DocumentProps   dp [virtual]
 

signify the end of current document

Implements PushIndex.

record KeyfileIncIndex::fetchDocumentRecord DOCID_T    key const [protected]
 

retrieve a document record.

void KeyfileIncIndex::fullToc   [protected]
 

readin all toc

InvFPDocList * KeyfileIncIndex::internalDocInfoList TERMID_T    termID const [protected]
 

retrieve and construct the DocInfoList for a term.

void KeyfileIncIndex::lastWriteCache   [protected]
 

final run write out of cache

void KeyfileIncIndex::mergeCacheSegments   [protected]
 

out-of-tree cache management combine segments into single segment

bool KeyfileIncIndex::open const string &    indexName [virtual]
 

Open previously created Index with given prefix.

Implements Index.

void KeyfileIncIndex::openDBs   [protected]
 

open the database files

void KeyfileIncIndex::openSegments   [protected]
 

open the segment files

void KeyfileIncIndex::setDocManager const string &    mgrID [virtual]
 

set the document manager to use for succeeding documents

Implements PushIndex.

void KeyfileIncIndex::setMesgStream ostream *    lemStream
 

set the mesg stream

void KeyfileIncIndex::setName const string &    prefix
 

sets the name for this index

const TERM_T KeyfileIncIndex::term TERMID_T    termID const [virtual]
 

Convert a termID to its spelling.

Implements Index.

TERMID_T KeyfileIncIndex::term const TERM_T   word const [virtual]
 

Convert a term spelling to a termID.

Implements Index.

COUNT_T KeyfileIncIndex::termCount   const [inline, virtual]
 

Total counts of all terms in collection.

Implements Index.

COUNT_T KeyfileIncIndex::termCount TERMID_T    termID const [virtual]
 

Total counts of a term in collection.

Implements Index.

COUNT_T KeyfileIncIndex::termCountUnique   const [inline, virtual]
 

Total count of unique terms in collection.

Implements Index.

TermInfoList * KeyfileIncIndex::termInfoList DOCID_T    docID const [virtual]
 

word entries in a document index (bag of words),

See also:
TermList

Implements Index.

TermInfoList * KeyfileIncIndex::termInfoListSeq DOCID_T    docID const [virtual]
 

word entries in a document index (sequence of words),

See also:
TermList

Reimplemented from Index.

COUNT_T KeyfileIncIndex::totaldocLength DOCID_T    docID const [virtual]
 

Total counts of terms in a document including stopwords for sure.

bool KeyfileIncIndex::tryOpen   [protected]
 

try to open an existing index

void KeyfileIncIndex::writeCache bool    lastRun = false [protected]
 

write out the cache

void KeyfileIncIndex::writeCacheSegment   [protected]
 

write out segments

void KeyfileIncIndex::writeDocMgrIDs   [protected]
 

write out document manager ids

void KeyfileIncIndex::writeTOC   [protected]
 

write out the table of contents file.


Member Data Documentation

TermCache KeyfileIncIndex::_cache [protected]
 

cache of term entries

int KeyfileIncIndex::_estimatePoint [protected]
 

invertlists point where we should next check on the cache size

TERMID_T KeyfileIncIndex::_largestFlushedTermID [protected]
 

highest term id flushed to disk.

int KeyfileIncIndex::_listsSize [protected]
 

memory for use by inverted list buffers

int KeyfileIncIndex::_memorySize [protected]
 

upper bound for memory use

std::vector<File*> KeyfileIncIndex::_segments [protected]
 

out-of-tree segments for data

float KeyfileIncIndex::aveDocLen [protected]
 

the average document length in this index

COUNT_T* KeyfileIncIndex::counts [protected]
 

array to hold all the overall count stats of this db

int KeyfileIncIndex::curdocmgr [protected]
 

the current docmanager to use

Keyfile KeyfileIncIndex::dIDs [protected]
 

documentName -> documentID

char KeyfileIncIndex::docKey[MAX_DOCID_LENGTH] [protected]
 

buffers for document() lookup functions

vector<DocumentManager*> KeyfileIncIndex::docMgrs [protected]
 

list of document manager objects

vector<std::string> KeyfileIncIndex::docmgrs [protected]
 

list of document managers

Keyfile KeyfileIncIndex::dSTRs [protected]
 

documentID -> documentName

File KeyfileIncIndex::dtlookup [protected]
 

document statistics (document length, etc.)

ReadBuffer* KeyfileIncIndex::dtlookupReadBuffer [protected]
 

read buffer for dtlookup

bool KeyfileIncIndex::ignoreDoc [protected]
 

are we in a bad document state?

vector<InvFPDocList*> KeyfileIncIndex::invertlists [protected]
 

array of pointers to doclists

Keyfile KeyfileIncIndex::invlookup [protected]
 

termID -> TermData (term statistics and inverted list segment offsets)

int KeyfileIncIndex::listlengths [protected]
 

how long all the lists are

ostream* KeyfileIncIndex::msgstream [protected]
 

Lemur code messages stream.

std::string KeyfileIncIndex::name [protected]
 

the prefix name

std::vector<std::string> KeyfileIncIndex::names [protected]
 

array to hold all the names for files we need for this db

char KeyfileIncIndex::termKey[MAX_TERM_LENGTH] [protected]
 

buffers for term() lookup functions

vector<LocatedTerm> KeyfileIncIndex::termlist [protected]
 

list of terms and their locations in this document

Keyfile KeyfileIncIndex::tIDs [protected]
 

termName -> termID

Keyfile KeyfileIncIndex::tSTRs [protected]
 

termID -> termName

File KeyfileIncIndex::writetlist [protected]
 

filestream for writing the list of located terms mutable for index access mode of Index API (not PushIndex)


The documentation for this class was generated from the following files:
Generated on Wed Nov 3 12:59:43 2004 for Lemur Toolkit by doxygen1.2.18