📖 Parse API documentation
TextualParse class
- class tonic_textual.parse_api.TextualParse(
- base_url: str = 'https://textual.tonic.ai',
- api_key: str | None = None,
- verify: bool = True,
Wrapper class for invoking Tonic Textual API
- Parameters:
base_url (Optional[str]) – The URL to your Tonic Textual instance. Do not include trailing backslashes. The default value is https://textual.tonic.ai.
api_key (Optional[str]) – Optional. Your API token. Instead of providing the API token here, we recommended that you set the API key in your environment as the value of TEXTUAL_API_KEY.
verify (bool) – Whether to verify SSL certification verification. By default, this is enabled.
Examples
>>> from tonic_textual.parse_api import TextualParse >>> textual = TonicTextualParse("https://textual.tonic.ai")
- parse_file(
- file: IOBase,
- file_name: str,
- timeout: int | None = None,
Parse a given file. To open binary files, use the ‘rb’ option.
- Parameters:
file (io.IOBase) – The opened file, available for reading, to parse.
file_name (str) – The name of the file.
timeout (Optional[int]) – Optional timeout in seconds. Indicates to stop waiting for the parsed result after the specified time.
- Returns:
The parsed document.
- Return type:
- parse_s3_file(
- bucket: str,
- key: str,
- timeout: int | None = None,
Parse a given file found in Amazon S3. Uses boto3 to fetch files from Amazon S3.
- Parameters:
bucket (str) – The bucket that contains the file to parse.
key (str) – The key of the file to parse.
timeout (Optional[int]) – Optional timeout in seconds. Indicates to stop waiting for parsed result after the specified time.
- Returns:
The parsed document.
- Return type:
File parse results
- class tonic_textual.classes.parse_api_responses.file_parse_result.FileParseResult(
- response: Dict,
- client: HttpClient,
- document: Dict = None,
A class that represents the result of a parsed file.
- Parameters:
response (Dict) – The response from the API.
client (HttpClient) – The HTTP client to use.
- describe() str
Returns the parsed file path.
- get_all_entities() List[SingleDetectionResult]
Returns a list of all of the detected entities in the file.
- Returns:
A list of detected entities in the file.
- Return type:
List[SingleDetectionResult]
- get_chunks(
- max_chars=15000,
- generator_config: Dict[str, PiiState] = {},
- generator_default: PiiState = PiiState.Off,
- metadata_entities: List[str] = [],
- include_metadata=True,
Returns a list of chunks of text from the document. The chunks are filtered by the generator_default configuration.
- Parameters:
max_chars (int = 15_000) – The maximum number of characters in each chunk.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction to use for all entity types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
- include_metadata: bool = True
If True, the metadata is included in the chunk.
- Returns:
A list of strings that contain the chunks of text.
- Return type:
List[str]
- get_entities(
- generator_config: Dict[str, PiiState] = {},
- generator_default: PiiState = PiiState.Redaction,
- allow_overlap: bool = False,
Returns a list of entities in the document. The entities are filtered by the generator_default configuration.
- Parameters:
generator_default (PiiState) – The default redaction to use for all entity types that not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.
- Returns:
A list of the detected entities. Each item in list contains the entity type, source start index, source end index, the entity text, and replacement text.
- Return type:
List[SingleDetectionResult]
- get_json() Dict
Returns the raw JSON generated by Tonic Textual.
- Returns:
The raw JSON that Textal generates when it parses the file, in the form of a dictionary.
- Return type:
Dict
- get_markdown(
- generator_config: Dict[str, PiiState] = {},
- generator_default: PiiState = PiiState.Off,
- random_seed: int | None = None,
Returns the file in Markdown format. In the file, the entities are redacted or synthesized based on the specified configuration.
- Parameters:
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entity types. For each entity type, indicates whether to redact, synthesize, or ignore the detected entities. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction to use for all entity types that not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
- Returns:
The file in Markdown format. In the file, the entities are redacted or synthesized based on generator_config and generator_default.
- Return type:
str
- get_tables() List[Table]
Returns a list of tables found in the document. Applies to CSV, XLSX, PDF, and image files.
- Returns:
Returns the list of tables found in the document.
- Return type:
List[Table]
- is_sensitive(
- sensitive_entity_types: List[str],
- start: int = 0,
- end: int = -1,
Returns True if the element contains sensitive data. Otherwise returns False.
- Parameters:
sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.
start (int = 0) – The start index to check for sensitive data.
end (int = -1) – The end index to check for sensitive data.
- Returns:
Returns True if the element contains sensitive data. Otherwise returns False.
- Return type:
bool