📖 Parse API documentation

TextualParse class

class tonic_textual.parse_api.TextualParse(
base_url: str = 'https://textual.tonic.ai',
api_key: str | None = None,
verify: bool = True,
)

Wrapper class for invoking Tonic Textual API

Parameters:
  • base_url (Optional[str]) – The URL to your Tonic Textual instance. Do not include trailing backslashes. The default value is https://textual.tonic.ai.

  • api_key (Optional[str]) – Optional. Your API token. Instead of providing the API token here, we recommended that you set the API key in your environment as the value of TEXTUAL_API_KEY.

  • verify (bool) – Whether to verify SSL certification verification. By default, this is enabled.

Examples

>>> from tonic_textual.parse_api import TextualParse
>>> textual = TonicTextualParse("https://textual.tonic.ai")
parse_file(
file: IOBase,
file_name: str,
timeout: int | None = None,
) FileParseResult

Parse a given file. To open binary files, use the ‘rb’ option.

Parameters:
  • file (io.IOBase) – The opened file, available for reading, to parse.

  • file_name (str) – The name of the file.

  • timeout (Optional[int]) – Optional timeout in seconds. Indicates to stop waiting for the parsed result after the specified time.

Returns:

The parsed document.

Return type:

FileParseResult

parse_s3_file(
bucket: str,
key: str,
timeout: int | None = None,
) FileParseResult

Parse a given file found in Amazon S3. Uses boto3 to fetch files from Amazon S3.

Parameters:
  • bucket (str) – The bucket that contains the file to parse.

  • key (str) – The key of the file to parse.

  • timeout (Optional[int]) – Optional timeout in seconds. Indicates to stop waiting for parsed result after the specified time.

Returns:

The parsed document.

Return type:

FileParseResult

File parse results

class tonic_textual.classes.parse_api_responses.file_parse_result.FileParseResult(
response: Dict,
client: HttpClient,
document: Dict = None,
)

A class that represents the result of a parsed file.

Parameters:
  • response (Dict) – The response from the API.

  • client (HttpClient) – The HTTP client to use.

describe() str

Returns the parsed file path.

get_all_entities() List[SingleDetectionResult]

Returns a list of all of the detected entities in the file.

Returns:

A list of detected entities in the file.

Return type:

List[SingleDetectionResult]

get_chunks(
max_chars=15000,
generator_config: Dict[str, PiiState] = {},
generator_default: PiiState = PiiState.Off,
metadata_entities: List[str] = [],
include_metadata=True,
) List

Returns a list of chunks of text from the document. The chunks are filtered by the generator_default configuration.

Parameters:
  • max_chars (int = 15_000) – The maximum number of characters in each chunk.

  • generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.

  • generator_default (PiiState = PiiState.Redaction) – The default redaction to use for all entity types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.

include_metadata: bool = True

If True, the metadata is included in the chunk.

Returns:

A list of strings that contain the chunks of text.

Return type:

List[str]

get_entities(
generator_config: Dict[str, PiiState] = {},
generator_default: PiiState = PiiState.Redaction,
allow_overlap: bool = False,
) List[SingleDetectionResult]

Returns a list of entities in the document. The entities are filtered by the generator_default configuration.

Parameters:
  • generator_default (PiiState) – The default redaction to use for all entity types that not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.

  • generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.

Returns:

A list of the detected entities. Each item in list contains the entity type, source start index, source end index, the entity text, and replacement text.

Return type:

List[SingleDetectionResult]

get_json() Dict

Returns the raw JSON generated by Tonic Textual.

Returns:

The raw JSON that Textal generates when it parses the file, in the form of a dictionary.

Return type:

Dict

get_markdown(
generator_config: Dict[str, PiiState] = {},
generator_default: PiiState = PiiState.Off,
random_seed: int | None = None,
) str

Returns the file in Markdown format. In the file, the entities are redacted or synthesized based on the specified configuration.

Parameters:
  • generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.

  • generator_default (PiiState = PiiState.Redaction) – The default redaction to use for all entity types that not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.

  • random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The file in Markdown format. In the file, the entities are redacted or synthesized based on generator_config and generator_default.

Return type:

str

get_tables() List[Table]

Returns a list of tables found in the document. Applies to CSV, XLSX, PDF, and image files.

Returns:

Returns the list of tables found in the document.

Return type:

List[Table]

is_sensitive(
sensitive_entity_types: List[str],
start: int = 0,
end: int = -1,
) bool

Returns True if the element contains sensitive data. Otherwise returns False.

Parameters:
  • sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.

  • start (int = 0) – The start index to check for sensitive data.

  • end (int = -1) – The end index to check for sensitive data.

Returns:

Returns True if the element contains sensitive data. Otherwise returns False.

Return type:

bool