VTK
|
Converts a document collection into a term collection. More...
#include <vtkTokenizer.h>
Converts a document collection into a term collection.
Given an artifact table containing text documents, splits each document into its component tokens, producing a feature table containing the results.
Tokenization is performed by splitting input text into tokens based on character delimiters. Delimiters are divided into two categories: "dropped" and "kept". "Dropped" delimiters are discarded from the output, while "kept" delimiters are retained in the output as individual tokens. Initially, vtkTokenizer has no delimiters defined, so you must set some delimiters before use.
Users can reset and append to the lists of delimiters for each category. Delimiters are specified as half-open ranges of Unicode code points. This makes it easy to tokenize logosyllabic scripts such as Chinese, Korean, and Japanese by specifying an entire range of logograms as "kept" delimiters, so that individual glyphs become tokens.
Inputs: Input port 0: (required) A vtkTable containing zero-to-many "documents", with one document per table row, a vtkIdTypeArray column containing document ids, and a vtkUnicodeStringArray column containing the contents of each document. Input port 1: (optional) A vtkTable containing zero-to-many document ranges to be processed, with one range per table row, a vtkIdTypeArray column containing document ids, a vtkIdTypeArray containing begin offsets, and a vtkIdTypeArray column containing end offsets. If input port 1 is left unconnected, the filter will automatically process the entire contents of every input document.
Outputs: Output port 0: A vtkTable containing "document", "begin", "end", "type", and "text" columns.
Use SetInputArrayToProcess(0, ...) to specify the input table column that contains document ids (must be a vtkIdTypeArray). Default: "document"
Use SetInputArrayToProcess(1, ...) to specify the input table column that contains document contents (must be a vtkUnicodeStringArray). Default: "text"
Use SetInputArrayToProcess(2, 1, ...) to specify the input table column that contains range document ids (must be a vtkIdTypeArray). Defaults to "document".
Use SetInputArrayToProcess(3, 1, ...) to specify the input table column that contains range begin offsets (must be a vtkIdTypeArray). Defaults to "begin".
Use SetInputArrayToProcess(4, 1, ...) to specify the input table column that contains range end offsets (must be a vtkIdTypeArray). Defaults to "end".
Definition at line 87 of file vtkTokenizer.h.
Reimplemented from vtkTableAlgorithm.
Definition at line 92 of file vtkTokenizer.h.
typedef std::pair<vtkUnicodeString::value_type, vtkUnicodeString::value_type> vtkTokenizer::DelimiterRange |
Defines storage for a half-open range of Unicode characters [begin, end).
Definition at line 98 of file vtkTokenizer.h.
typedef std::vector<DelimiterRange> vtkTokenizer::DelimiterRanges |
Defines storage for a collection of half-open ranges of Unicode characters.
Definition at line 101 of file vtkTokenizer.h.
vtkTokenizer::vtkTokenizer | ( | ) | [protected] |
vtkTokenizer::~vtkTokenizer | ( | ) | [protected] |
static vtkTokenizer* vtkTokenizer::New | ( | ) | [static] |
Create an object with Debug turned off, modified time initialized to zero, and reference counting on.
Reimplemented from vtkTableAlgorithm.
virtual const char* vtkTokenizer::GetClassName | ( | ) | [virtual] |
Reimplemented from vtkTableAlgorithm.
static int vtkTokenizer::IsTypeOf | ( | const char * | name | ) | [static] |
Return 1 if this class type is the same type of (or a subclass of) the named class. Returns 0 otherwise. This method works in combination with vtkTypeMacro found in vtkSetGet.h.
Reimplemented from vtkTableAlgorithm.
virtual int vtkTokenizer::IsA | ( | const char * | name | ) | [virtual] |
Return 1 if this class is the same type of (or a subclass of) the named class. Returns 0 otherwise. This method works in combination with vtkTypeMacro found in vtkSetGet.h.
Reimplemented from vtkTableAlgorithm.
static vtkTokenizer* vtkTokenizer::SafeDownCast | ( | vtkObject * | o | ) | [static] |
Reimplemented from vtkTableAlgorithm.
void vtkTokenizer::PrintSelf | ( | ostream & | os, |
vtkIndent | indent | ||
) | [virtual] |
Methods invoked by print to print information about the object including superclasses. Typically not called by the user (use Print() instead) but used in the hierarchical print process to combine the output of several classes.
Reimplemented from vtkTableAlgorithm.
static const DelimiterRanges vtkTokenizer::Punctuation | ( | ) | [static] |
Returns a set of delimiter ranges that match Unicode punctuation codepoints.
static const DelimiterRanges vtkTokenizer::Whitespace | ( | ) | [static] |
Returns a set of delimiter ranges that match Unicode whitespace codepoints.
static const DelimiterRanges vtkTokenizer::Logosyllabic | ( | ) | [static] |
Returns a set of delimiter ranges that match logosyllabic languages where characters represent words instead of sounds, such as Chinese, Japanese, and Korean.
void vtkTokenizer::AddDroppedDelimiters | ( | vtkUnicodeString::value_type | begin, |
vtkUnicodeString::value_type | end | ||
) |
Adds the half-open range of Unicode characters [begin, end) to the set of "dropped" delimiters.
void vtkTokenizer::AddDroppedDelimiters | ( | const DelimiterRanges & | ranges | ) |
Adds a collection of delimiter ranges to the set of "dropped" delimiters.
void vtkTokenizer::AddKeptDelimiters | ( | vtkUnicodeString::value_type | begin, |
vtkUnicodeString::value_type | end | ||
) |
Adds the half-open range of Unicode characters [begin, end) to the set of "kept" delimiters.
void vtkTokenizer::AddKeptDelimiters | ( | const DelimiterRanges & | ranges | ) |
Adds a collection of delimiter ranges to the set of "kept" delimiters.
void vtkTokenizer::DropPunctuation | ( | ) |
Convenience functions to specify delimiters, mainly intended for use from Python and the ParaView server manager. C++ developers are strongly encouraged to use AddDroppedDelimiters(...) and AddKeptDelimiters(...) instead.
void vtkTokenizer::DropWhitespace | ( | ) |
Convenience functions to specify delimiters, mainly intended for use from Python and the ParaView server manager. C++ developers are strongly encouraged to use AddDroppedDelimiters(...) and AddKeptDelimiters(...) instead.
void vtkTokenizer::KeepPunctuation | ( | ) |
Convenience functions to specify delimiters, mainly intended for use from Python and the ParaView server manager. C++ developers are strongly encouraged to use AddDroppedDelimiters(...) and AddKeptDelimiters(...) instead.
void vtkTokenizer::KeepWhitespace | ( | ) |
Convenience functions to specify delimiters, mainly intended for use from Python and the ParaView server manager. C++ developers are strongly encouraged to use AddDroppedDelimiters(...) and AddKeptDelimiters(...) instead.
void vtkTokenizer::KeepLogosyllabic | ( | ) |
Convenience functions to specify delimiters, mainly intended for use from Python and the ParaView server manager. C++ developers are strongly encouraged to use AddDroppedDelimiters(...) and AddKeptDelimiters(...) instead.
void vtkTokenizer::ClearDroppedDelimiters | ( | ) |
Clears the set of "dropped" delimiters.
void vtkTokenizer::ClearKeptDelimiters | ( | ) |
Clears the set of "kept" delimiters.
int vtkTokenizer::FillInputPortInformation | ( | int | port, |
vtkInformation * | info | ||
) | [protected, virtual] |
Fill the input port information objects for this algorithm. This is invoked by the first call to GetInputPortInformation for each port so subclasses can specify what they can handle.
Reimplemented from vtkTableAlgorithm.
virtual int vtkTokenizer::RequestData | ( | vtkInformation * | request, |
vtkInformationVector ** | inputVector, | ||
vtkInformationVector * | outputVector | ||
) | [protected, virtual] |
This is called by the superclass. This is the method you should override.
Reimplemented from vtkTableAlgorithm.