📖 Parse API documentation

TextualParse class

class tonic_textual.parse_api.TextualParse( base_url: str = 'https://textual.tonic.ai', api_key: str | None = None, verify: bool = True, )

Wrapper class for invoking Tonic Textual API

Parameters:

base_url (Optional[str]) – The URL to your Tonic Textual instance. Do not include trailing backslashes. The default value is https://textual.tonic.ai.
api_key (Optional[str]) – Optional. Your API token. Instead of providing the API token here, we recommended that you set the API key in your environment as the value of TEXTUAL_API_KEY.
verify (bool) – Whether to verify SSL certification verification. By default, this is enabled.

Examples

>>> from tonic_textual.parse_api import TextualParse
>>> textual = TonicTextualParse("https://textual.tonic.ai")

parse_file( file: IOBase, file_name: str, timeout: int | None = None, ) → FileParseResult

Parse a given file. To open binary files, use the ‘rb’ option.

Parameters:

file (io.IOBase) – The opened file, available for reading, to parse.
file_name (str) – The name of the file.
timeout (Optional[int]) – Optional timeout in seconds. Indicates to stop waiting for the parsed result after the specified time.

Returns:

The parsed document.

Return type:

FileParseResult

parse_s3_file( bucket: str, key: str, timeout: int | None = None, ) → FileParseResult

Parse a given file found in Amazon S3. Uses boto3 to fetch files from Amazon S3.

Parameters:

bucket (str) – The bucket that contains the file to parse.
key (str) – The key of the file to parse.
timeout (Optional[int]) – Optional timeout in seconds. Indicates to stop waiting for parsed result after the specified time.

Returns:

The parsed document.

Return type:

FileParseResult

File parse results

class tonic_textual.classes.parse_api_responses.file_parse_result.FileParseResult( response: Dict, client: HttpClient, document: Dict = None, )

A class that represents the result of a parsed file.

Parameters:

response (Dict) – The response from the API.
client (HttpClient) – The HTTP client to use.

describe() → str: Returns the parsed file path.

get_all_entities() → List[SingleDetectionResult]

Returns a list of all of the detected entities in the file.

Returns:: A list of detected entities in the file.
Return type:: List[SingleDetectionResult]

get_chunks( max_chars=15000, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Off, metadata_entities: List[str] = [], include_metadata=True, ) → List

Returns a list of chunks of text from the document. The chunks are filtered by the generator_default configuration.

Parameters:

max_chars (int = 15_000) – The maximum number of characters in each chunk.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction to use for all entity types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.

include_metadata: bool = True: If True, the metadata is included in the chunk.

Returns:: A list of strings that contain the chunks of text.
Return type:: List[str]

get_entities( generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, allow_overlap: bool = False, ) → List[SingleDetectionResult]

Returns a list of entities in the document. The entities are filtered by the generator_default configuration.

Parameters:

generator_default (PiiState) – The default redaction to use for all entity types that not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.

Returns:

A list of the detected entities. Each item in list contains the entity type, source start index, source end index, the entity text, and replacement text.

Return type:

List[SingleDetectionResult]

get_json() → Dict

Returns the raw JSON generated by Tonic Textual.

Returns:: The raw JSON that Textal generates when it parses the file, in the form of a dictionary.
Return type:: Dict

get_markdown( generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Off, random_seed: int | None = None, ) → str

Returns the file in Markdown format. In the file, the entities are redacted or synthesized based on the specified configuration.

Parameters:

generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction to use for all entity types that not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The file in Markdown format. In the file, the entities are redacted or synthesized based on generator_config and generator_default.

Return type:

str

get_tables() → List[Table]

Returns a list of tables found in the document. Applies to CSV, XLSX, PDF, and image files.

Returns:: Returns the list of tables found in the document.
Return type:: List[Table]

is_sensitive( sensitive_entity_types: List[str], start: int = 0, end: int = -1, ) → bool

Returns True if the element contains sensitive data. Otherwise returns False.

Parameters:

sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.
start (int = 0) – The start index to check for sensitive data.
end (int = -1) – The end index to check for sensitive data.

Returns:

Returns True if the element contains sensitive data. Otherwise returns False.

Return type:

bool