vtkTextExtraction Class Reference

#include <vtkTextExtraction.h>

Detailed Description

Extracts text from documents based on their MIME type.

Given a table containing document ids, URIs, Mime types and document contents, extracts plain text from each document, and generates a list of 'tags' that delineate ranges of text. The actual work of extracting text and generating tags is performed by an ordered list of vtkTextExtractionStrategy objects.

By default, vtkTextExtraction has just a single strategy for extracting plain text documents. Callers will almost certainly want to supplement or replace the default with their own strategies.

Inputs: Input port 0: (required) A vtkTable containing document ids, Mime types and document contents (which could be binary).

Outputs: Output port 0: The same table with an additional "text" column that contains the text extracted from each document. Output port 1: A table of document tags that includes "document", "uri", "begin", "end", and "type" columns.

Use SetInputArrayToProcess(0, ...) to specify the input table column that contains document ids (must be a vtkIdTypeArray). Default: "document".

Use SetInputArrayToProcess(1, ...) to specify the input table column that contains URIs (must be a vtkStringArray). Default: "uri".

Use SetInputArrayToProcess(2, ...) to specify the input table column that contains Mime types (must be a vtkStringArray). Default: "mime_type".

Use SetInputArrayToProcess(3, ...) to specify the input table column that contains document contents (must be a vtkStringArray). Default: "content".

Warning:: The input document contents array must be a string array, even though the individual document contents may be binary data.

See also:: vtkTextExtractionStrategy, vtkPlainTextExtractionStrategy

Thanks:: Developed by Timothy M. Shead (tshead@sandia.gov) at Sandia National Laboratories.

BTX Implementation* const Internal; ETX

Events:: vtkCommand::ProgressEvent

Tests:: vtkTextExtraction (Tests)

Definition at line 26 of file vtkTextExtraction.h.

The documentation for this class was generated from the following file:

dox/TextAnalysis/vtkTextExtraction.h