Converts a document collection into a term collection. More...

#include <vtkTokenizer.h>

Inheritance diagram for vtkTokenizer:

Collaboration diagram for vtkTokenizer:

Public Types
typedef vtkTableAlgorithm	Superclass
typedef std::pair < vtkUnicodeString::value_type, vtkUnicodeString::value_type >	DelimiterRange
typedef std::vector < DelimiterRange >	DelimiterRanges
Public Member Functions
virtual const char *	GetClassName ()
virtual int	IsA (const char *type)
void	PrintSelf (ostream &os, vtkIndent indent)
void	AddDroppedDelimiters (vtkUnicodeString::value_type begin, vtkUnicodeString::value_type end)
void	AddDroppedDelimiters (const DelimiterRanges &ranges)
void	AddKeptDelimiters (vtkUnicodeString::value_type begin, vtkUnicodeString::value_type end)
void	ClearDroppedDelimiters ()
void	ClearKeptDelimiters ()

void	AddKeptDelimiters (const DelimiterRanges &ranges)

void	DropPunctuation ()
void	DropWhitespace ()
void	KeepPunctuation ()
void	KeepWhitespace ()
void	KeepLogosyllabic ()
Static Public Member Functions
static vtkTokenizer *	New ()
static int	IsTypeOf (const char *type)
static vtkTokenizer *	SafeDownCast (vtkObject *o)
static const DelimiterRanges	Punctuation ()
static const DelimiterRanges	Whitespace ()
static const DelimiterRanges	Logosyllabic ()
Protected Member Functions
	vtkTokenizer ()
	~vtkTokenizer ()
int	FillInputPortInformation (int port, vtkInformation *info)
virtual int	RequestData (vtkInformation request, vtkInformationVector inputVector, vtkInformationVector outputVector)

Detailed Description

Converts a document collection into a term collection.

Given an artifact table containing text documents, splits each document into its component tokens, producing a feature table containing the results.

Tokenization is performed by splitting input text into tokens based on character delimiters. Delimiters are divided into two categories: "dropped" and "kept". "Dropped" delimiters are discarded from the output, while "kept" delimiters are retained in the output as individual tokens. Initially, vtkTokenizer has no delimiters defined, so you must set some delimiters before use.

Users can reset and append to the lists of delimiters for each category. Delimiters are specified as half-open ranges of Unicode code points. This makes it easy to tokenize logosyllabic scripts such as Chinese, Korean, and Japanese by specifying an entire range of logograms as "kept" delimiters, so that individual glyphs become tokens.

Inputs: Input port 0: (required) A vtkTable containing zero-to-many "documents", with one document per table row, a vtkIdTypeArray column containing document ids, and a vtkUnicodeStringArray column containing the contents of each document. Input port 1: (optional) A vtkTable containing zero-to-many document ranges to be processed, with one range per table row, a vtkIdTypeArray column containing document ids, a vtkIdTypeArray containing begin offsets, and a vtkIdTypeArray column containing end offsets. If input port 1 is left unconnected, the filter will automatically process the entire contents of every input document.

Outputs: Output port 0: A vtkTable containing "document", "begin", "end", "type", and "text" columns.

Use SetInputArrayToProcess(0, ...) to specify the input table column that contains document ids (must be a vtkIdTypeArray). Default: "document"

Use SetInputArrayToProcess(1, ...) to specify the input table column that contains document contents (must be a vtkUnicodeStringArray). Default: "text"

Use SetInputArrayToProcess(2, 1, ...) to specify the input table column that contains range document ids (must be a vtkIdTypeArray). Defaults to "document".

Use SetInputArrayToProcess(3, 1, ...) to specify the input table column that contains range begin offsets (must be a vtkIdTypeArray). Defaults to "begin".

Use SetInputArrayToProcess(4, 1, ...) to specify the input table column that contains range end offsets (must be a vtkIdTypeArray). Defaults to "end".

Thanks:: Developed by Timothy M. Shead (tshead@sandia.gov) at Sandia National Laboratories.

Events:: vtkCommand::ProgressEvent

Tests:: vtkTokenizer (Tests)

Definition at line 87 of file vtkTokenizer.h.

Member Typedef Documentation

typedef vtkTableAlgorithm vtkTokenizer::Superclass

Reimplemented from vtkTableAlgorithm.

Definition at line 92 of file vtkTokenizer.h.

typedef std::pair<vtkUnicodeString::value_type, vtkUnicodeString::value_type> vtkTokenizer::DelimiterRange

Defines storage for a half-open range of Unicode characters [begin, end).

Definition at line 98 of file vtkTokenizer.h.

typedef std::vector<DelimiterRange> vtkTokenizer::DelimiterRanges

Defines storage for a collection of half-open ranges of Unicode characters.

Definition at line 101 of file vtkTokenizer.h.

Constructor & Destructor Documentation

vtkTokenizer::vtkTokenizer ( ) [protected]

vtkTokenizer::~vtkTokenizer ( ) [protected]

Member Function Documentation

static vtkTokenizer* vtkTokenizer::New ( ) [static]

Create an object with Debug turned off, modified time initialized to zero, and reference counting on.

Reimplemented from vtkTableAlgorithm.

virtual const char* vtkTokenizer::GetClassName ( ) [virtual]

Reimplemented from vtkTableAlgorithm.

static int vtkTokenizer::IsTypeOf ( const char * name ) [static]

Return 1 if this class type is the same type of (or a subclass of) the named class. Returns 0 otherwise. This method works in combination with vtkTypeMacro found in vtkSetGet.h.

Reimplemented from vtkTableAlgorithm.

virtual int vtkTokenizer::IsA ( const char * name ) [virtual]

Return 1 if this class is the same type of (or a subclass of) the named class. Returns 0 otherwise. This method works in combination with vtkTypeMacro found in vtkSetGet.h.

Reimplemented from vtkTableAlgorithm.

static vtkTokenizer* vtkTokenizer::SafeDownCast ( vtkObject * o ) [static]

Reimplemented from vtkTableAlgorithm.

void vtkTokenizer::PrintSelf	(	ostream &	os,
		vtkIndent	indent
	)		`[virtual]`

Methods invoked by print to print information about the object including superclasses. Typically not called by the user (use Print() instead) but used in the hierarchical print process to combine the output of several classes.

Reimplemented from vtkTableAlgorithm.

static const DelimiterRanges vtkTokenizer::Punctuation ( ) [static]

Returns a set of delimiter ranges that match Unicode punctuation codepoints.

static const DelimiterRanges vtkTokenizer::Whitespace ( ) [static]

Returns a set of delimiter ranges that match Unicode whitespace codepoints.

static const DelimiterRanges vtkTokenizer::Logosyllabic ( ) [static]

Returns a set of delimiter ranges that match logosyllabic languages where characters represent words instead of sounds, such as Chinese, Japanese, and Korean.

void vtkTokenizer::AddDroppedDelimiters	(	vtkUnicodeString::value_type	begin,
		vtkUnicodeString::value_type	end
	)

Adds the half-open range of Unicode characters [begin, end) to the set of "dropped" delimiters.

void vtkTokenizer::AddDroppedDelimiters ( const DelimiterRanges & ranges )

Adds a collection of delimiter ranges to the set of "dropped" delimiters.

void vtkTokenizer::AddKeptDelimiters	(	vtkUnicodeString::value_type	begin,
		vtkUnicodeString::value_type	end
	)

Adds the half-open range of Unicode characters [begin, end) to the set of "kept" delimiters.

void vtkTokenizer::AddKeptDelimiters ( const DelimiterRanges & ranges )

Adds a collection of delimiter ranges to the set of "kept" delimiters.

void vtkTokenizer::DropPunctuation ( )