vtkTokenizer Class Reference

#include <vtkTokenizer.h>

Detailed Description

Converts a document collection into a term collection.

Given an artifact table containing text documents, splits each document into its component tokens, producing a feature table containing the results.

Tokenization is performed by splitting input text into tokens based on character delimiters. Delimiters are divided into two categories: "dropped" and "kept". "Dropped" delimiters are discarded from the output, while "kept" delimiters are retained in the output as individual tokens. Initially, vtkTokenizer has no delimiters defined, so you must set some delimiters before use.

Users can reset and append to the lists of delimiters for each category. Delimiters are specified as half-open ranges of Unicode code points. This makes it easy to tokenize logosyllabic scripts such as Chinese, Korean, and Japanese by specifying an entire range of logograms as "kept" delimiters, so that individual glyphs become tokens.

Inputs: Input port 0: (required) A vtkTable containing zero-to-many "documents", with one document per table row, a vtkIdTypeArray column containing document ids, and a vtkUnicodeStringArray column containing the contents of each document. Input port 1: (optional) A vtkTable containing zero-to-many document ranges to be processed, with one range per table row, a vtkIdTypeArray column containing document ids, a vtkIdTypeArray containing begin offsets, and a vtkIdTypeArray column containing end offsets. If input port 1 is left unconnected, the filter will automatically process the entire contents of every input document.

Outputs: Output port 0: A vtkTable containing "document", "begin", "end", "type", and "text" columns.

Use SetInputArrayToProcess(0, ...) to specify the input table column that contains document ids (must be a vtkIdTypeArray). Default: "document"

Use SetInputArrayToProcess(1, ...) to specify the input table column that contains document contents (must be a vtkUnicodeStringArray). Default: "text"

Use SetInputArrayToProcess(2, 1, ...) to specify the input table column that contains range document ids (must be a vtkIdTypeArray). Defaults to "document".

Use SetInputArrayToProcess(3, 1, ...) to specify the input table column that contains range begin offsets (must be a vtkIdTypeArray). Defaults to "begin".

Use SetInputArrayToProcess(4, 1, ...) to specify the input table column that contains range end offsets (must be a vtkIdTypeArray). Defaults to "end".

Thanks:: Developed by Timothy M. Shead (tshead@sandia.gov) at Sandia National Laboratories.

BTX void AddKeptDelimiters(const DelimiterRanges& ranges); ETX

BTX Internals* const Implementation; ETX

Events:: vtkCommand::ProgressEvent

Tests:: vtkTokenizer (Tests)

The documentation for this class was generated from the following file:

dox/TextAnalysis/vtkTokenizer.h