VTK
Public Types | Public Member Functions | Static Public Member Functions | Protected Member Functions
vtkTokenizer Class Reference

Converts a document collection into a term collection. More...

#include <vtkTokenizer.h>

Inheritance diagram for vtkTokenizer:
Inheritance graph
[legend]
Collaboration diagram for vtkTokenizer:
Collaboration graph
[legend]

List of all members.

Public Types

typedef vtkTableAlgorithm Superclass
typedef std::pair
< vtkUnicodeString::value_type,
vtkUnicodeString::value_type
DelimiterRange
typedef std::vector
< DelimiterRange
DelimiterRanges

Public Member Functions

virtual const char * GetClassName ()
virtual int IsA (const char *type)
void PrintSelf (ostream &os, vtkIndent indent)
void AddDroppedDelimiters (vtkUnicodeString::value_type begin, vtkUnicodeString::value_type end)
void AddDroppedDelimiters (const DelimiterRanges &ranges)
void AddKeptDelimiters (vtkUnicodeString::value_type begin, vtkUnicodeString::value_type end)
void ClearDroppedDelimiters ()
void ClearKeptDelimiters ()
void AddKeptDelimiters (const DelimiterRanges &ranges)
void DropPunctuation ()
void DropWhitespace ()
void KeepPunctuation ()
void KeepWhitespace ()
void KeepLogosyllabic ()

Static Public Member Functions

static vtkTokenizerNew ()
static int IsTypeOf (const char *type)
static vtkTokenizerSafeDownCast (vtkObject *o)
static const DelimiterRanges Punctuation ()
static const DelimiterRanges Whitespace ()
static const DelimiterRanges Logosyllabic ()

Protected Member Functions

 vtkTokenizer ()
 ~vtkTokenizer ()
int FillInputPortInformation (int port, vtkInformation *info)
virtual int RequestData (vtkInformation *request, vtkInformationVector **inputVector, vtkInformationVector *outputVector)

Detailed Description

Converts a document collection into a term collection.

Given an artifact table containing text documents, splits each document into its component tokens, producing a feature table containing the results.

Tokenization is performed by splitting input text into tokens based on character delimiters. Delimiters are divided into two categories: "dropped" and "kept". "Dropped" delimiters are discarded from the output, while "kept" delimiters are retained in the output as individual tokens. Initially, vtkTokenizer has no delimiters defined, so you must set some delimiters before use.

Users can reset and append to the lists of delimiters for each category. Delimiters are specified as half-open ranges of Unicode code points. This makes it easy to tokenize logosyllabic scripts such as Chinese, Korean, and Japanese by specifying an entire range of logograms as "kept" delimiters, so that individual glyphs become tokens.

Inputs: Input port 0: (required) A vtkTable containing zero-to-many "documents", with one document per table row, a vtkIdTypeArray column containing document ids, and a vtkUnicodeStringArray column containing the contents of each document. Input port 1: (optional) A vtkTable containing zero-to-many document ranges to be processed, with one range per table row, a vtkIdTypeArray column containing document ids, a vtkIdTypeArray containing begin offsets, and a vtkIdTypeArray column containing end offsets. If input port 1 is left unconnected, the filter will automatically process the entire contents of every input document.

Outputs: Output port 0: A vtkTable containing "document", "begin", "end", "type", and "text" columns.

Use SetInputArrayToProcess(0, ...) to specify the input table column that contains document ids (must be a vtkIdTypeArray). Default: "document"

Use SetInputArrayToProcess(1, ...) to specify the input table column that contains document contents (must be a vtkUnicodeStringArray). Default: "text"

Use SetInputArrayToProcess(2, 1, ...) to specify the input table column that contains range document ids (must be a vtkIdTypeArray). Defaults to "document".

Use SetInputArrayToProcess(3, 1, ...) to specify the input table column that contains range begin offsets (must be a vtkIdTypeArray). Defaults to "begin".

Use SetInputArrayToProcess(4, 1, ...) to specify the input table column that contains range end offsets (must be a vtkIdTypeArray). Defaults to "end".

Thanks:
Developed by Timothy M. Shead (tshead@sandia.gov) at Sandia National Laboratories.
Events:
vtkCommand::ProgressEvent
Tests:
vtkTokenizer (Tests)

Definition at line 87 of file vtkTokenizer.h.


Member Typedef Documentation

Reimplemented from vtkTableAlgorithm.

Definition at line 92 of file vtkTokenizer.h.

Defines storage for a half-open range of Unicode characters [begin, end).

Definition at line 98 of file vtkTokenizer.h.

Defines storage for a collection of half-open ranges of Unicode characters.

Definition at line 101 of file vtkTokenizer.h.


Constructor & Destructor Documentation

vtkTokenizer::vtkTokenizer ( ) [protected]
vtkTokenizer::~vtkTokenizer ( ) [protected]

Member Function Documentation

static vtkTokenizer* vtkTokenizer::New ( ) [static]

Create an object with Debug turned off, modified time initialized to zero, and reference counting on.

Reimplemented from vtkTableAlgorithm.

virtual const char* vtkTokenizer::GetClassName ( ) [virtual]

Reimplemented from vtkTableAlgorithm.

static int vtkTokenizer::IsTypeOf ( const char *  name) [static]

Return 1 if this class type is the same type of (or a subclass of) the named class. Returns 0 otherwise. This method works in combination with vtkTypeMacro found in vtkSetGet.h.

Reimplemented from vtkTableAlgorithm.

virtual int vtkTokenizer::IsA ( const char *  name) [virtual]

Return 1 if this class is the same type of (or a subclass of) the named class. Returns 0 otherwise. This method works in combination with vtkTypeMacro found in vtkSetGet.h.

Reimplemented from vtkTableAlgorithm.

static vtkTokenizer* vtkTokenizer::SafeDownCast ( vtkObject o) [static]

Reimplemented from vtkTableAlgorithm.

void vtkTokenizer::PrintSelf ( ostream &  os,
vtkIndent  indent 
) [virtual]

Methods invoked by print to print information about the object including superclasses. Typically not called by the user (use Print() instead) but used in the hierarchical print process to combine the output of several classes.

Reimplemented from vtkTableAlgorithm.

static const DelimiterRanges vtkTokenizer::Punctuation ( ) [static]

Returns a set of delimiter ranges that match Unicode punctuation codepoints.

static const DelimiterRanges vtkTokenizer::Whitespace ( ) [static]

Returns a set of delimiter ranges that match Unicode whitespace codepoints.

static const DelimiterRanges vtkTokenizer::Logosyllabic ( ) [static]

Returns a set of delimiter ranges that match logosyllabic languages where characters represent words instead of sounds, such as Chinese, Japanese, and Korean.

void vtkTokenizer::AddDroppedDelimiters ( vtkUnicodeString::value_type  begin,
vtkUnicodeString::value_type  end 
)

Adds the half-open range of Unicode characters [begin, end) to the set of "dropped" delimiters.

void vtkTokenizer::AddDroppedDelimiters ( const DelimiterRanges ranges)

Adds a collection of delimiter ranges to the set of "dropped" delimiters.

void vtkTokenizer::AddKeptDelimiters ( vtkUnicodeString::value_type  begin,
vtkUnicodeString::value_type  end 
)

Adds the half-open range of Unicode characters [begin, end) to the set of "kept" delimiters.

void vtkTokenizer::AddKeptDelimiters ( const DelimiterRanges ranges)

Adds a collection of delimiter ranges to the set of "kept" delimiters.

void vtkTokenizer::DropPunctuation ( )

Convenience functions to specify delimiters, mainly intended for use from Python and the ParaView server manager. C++ developers are strongly encouraged to use AddDroppedDelimiters(...) and AddKeptDelimiters(...) instead.

void vtkTokenizer::DropWhitespace ( )

Convenience functions to specify delimiters, mainly intended for use from Python and the ParaView server manager. C++ developers are strongly encouraged to use AddDroppedDelimiters(...) and AddKeptDelimiters(...) instead.

void vtkTokenizer::KeepPunctuation ( )

Convenience functions to specify delimiters, mainly intended for use from Python and the ParaView server manager. C++ developers are strongly encouraged to use AddDroppedDelimiters(...) and AddKeptDelimiters(...) instead.

void vtkTokenizer::KeepWhitespace ( )

Convenience functions to specify delimiters, mainly intended for use from Python and the ParaView server manager. C++ developers are strongly encouraged to use AddDroppedDelimiters(...) and AddKeptDelimiters(...) instead.

void vtkTokenizer::KeepLogosyllabic ( )

Convenience functions to specify delimiters, mainly intended for use from Python and the ParaView server manager. C++ developers are strongly encouraged to use AddDroppedDelimiters(...) and AddKeptDelimiters(...) instead.

void vtkTokenizer::ClearDroppedDelimiters ( )

Clears the set of "dropped" delimiters.

void vtkTokenizer::ClearKeptDelimiters ( )

Clears the set of "kept" delimiters.

int vtkTokenizer::FillInputPortInformation ( int  port,
vtkInformation info 
) [protected, virtual]

Fill the input port information objects for this algorithm. This is invoked by the first call to GetInputPortInformation for each port so subclasses can specify what they can handle.

Reimplemented from vtkTableAlgorithm.

virtual int vtkTokenizer::RequestData ( vtkInformation request,
vtkInformationVector **  inputVector,
vtkInformationVector outputVector 
) [protected, virtual]

This is called by the superclass. This is the method you should override.

Reimplemented from vtkTableAlgorithm.


The documentation for this class was generated from the following file: