Dataset API documentation

Dataset class

class tonic_textual.classes.dataset.Dataset(
client: HttpClient,
id: str,
name: str,
files: List[Dict[str, Any]],
custom_pii_entity_ids: List[str],
generator_config: Dict[str, PiiState] | None = None,
generator_metadata: Dict[str, BaseMetadata] | None = None,
label_block_lists: Dict[str, List[str]] | None = None,
label_allow_lists: Dict[str, List[str]] | None = None,
docx_image_policy_name: docx_image_policy | None = docx_image_policy.redact,
docx_comment_policy_name: docx_comment_policy | None = docx_comment_policy.remove,
docx_table_policy_name: docx_table_policy | None = docx_table_policy.remove,
pdf_signature_policy_name: pdf_signature_policy | None = pdf_signature_policy.redact,
pdf_synth_mode_policy: pdf_synth_mode_policy | None = pdf_synth_mode_policy.V1,
)
add_file(
file_path: str | None = None,
file_name: str | None = None,
file: IOBase | None = None,
) DatasetFile | None

Uploads a file to the dataset.

Parameters:
  • file_path (Optional[str]) – The absolute path of the file to upload. If specified, you cannot also provide the ‘file’ argument.

  • file_name (Optional[str]) – The name of the file to save to Tonic Textual. Optional if you use file_path to upload the file. Required if you use the ‘file’ argument.

  • file (Optional[io.IOBase]) – The bytes of a file to upload. If specified, you must also provide the ‘file_name’ argument. You cannnot use the ‘file_path’ argument in the same call.

Raises:

DatasetFileMatchesExistingFile – Returned if the file content matches an existing file.

delete_file(
file_id: str,
)

Deletes the given file from the dataset

Parameters:

file_id (str) – The identifier of the dataset file to delete.

describe() str

Returns a string of the dataset name, identifier, and the list of files.

Examples

>>> workspace.describe()
Dataset: your_dataset_name [dataset_id]
Number of Files: 2
Number of Rows: 1000
edit(
name: str | None = None,
generator_config: Dict[str, PiiState] | None = None,
generator_metadata: Dict[str, BaseMetadata] | None = None,
label_block_lists: Dict[str, List[str]] | None = None,
label_allow_lists: Dict[str, List[str]] | None = None,
docx_image_policy_name: docx_image_policy | None = None,
docx_comment_policy_name: docx_comment_policy | None = None,
docx_table_policy_name: docx_table_policy | None = None,
pdf_signature_policy_name: pdf_signature_policy | None = None,
pdf_synth_mode_policy_name: pdf_synth_mode_policy | None = None,
should_rescan=True,
copy_from_dataset: Dataset | None = None,
)

Edit dataset. Only edits fields that are provided as function arguments. Currently, you can edit the name of the dataset and the generator setup, which indicate how to handle each entity.

Parameters:
  • name (Optional[str]) – The new name of the dataset. Returns an error if the new name conflicts with an existing dataset name.

  • generator_config (Optional[Dict[str, PiiState]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.

  • generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.

  • label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored entities). When an entity of the specified type matches a regular expression in the list, the value is ignored and not redacted or synthesized.

  • label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, included entities). When a piece of text matches a regular expression in the list, the text is marked as the entity type and is included in the redaction or synthesis.

  • docx_image_policy_name (Optional[docx_image_policy] = None) – The policy for handling images in DOCX files. Options are ‘redact’, ‘ignore’, and ‘remove’.

  • docx_comment_policy_name (Optional[docx_comment_policy] = None) – The policy for handling comments in DOCX files. Options are ‘remove’ and ‘ignore’.

  • docx_table_policy_name (Optional[docx_table_policy] = None) – The policy for handling tables in DOCX files. Options are ‘redact’ and ‘remove’.

  • pdf_signature_policy_name (Optional[pdf_signature_policy] = None) – The policy for handling signatures in PDF files. Options are ‘redact’ and ‘ignore’.

  • pdf_synth_mode_policy_name (Optional[pdf_synth_mode_policy] = None) – The policy for which version of PDF synthesis to use. Options are V1 and V2.

  • copy_from_dataset (Optional[Dataset]) – Another dataset object to copy settings from. This parameter is mutually exclusive with the other parameters.

Raises:
  • DatasetNameAlreadyExists – Raised if a dataset with the same name already exists.

  • BadArgumentsException – Raised if the copy_from_dataset parameter is not None while another parameter is not None.

fetch_all_df()

Fetches all of the data in the dataset as a pandas dataframe.

Returns:

Dataset data in a pandas dataframe.

Return type:

pd.DataFrame

fetch_all_json() str

Fetches all of the data in the dataset as JSON.

Returns:

Dataset data in JSON format.

Return type:

str

files: List[DatasetFile]

Class to represent and provide access to a Tonic Textual dataset.

Parameters:
  • id (str) – Dataset identifier.

  • name (str) – Dataset name.

  • files (Dict) – Serialized DatasetFile objects that represent the files in a dataset.

  • client (HttpClient) – The HTTP client to use.

get_entity_mappings() DatasetEntityMappingsResponse

Gets the entities detected in the dataset, grouped by file, together with the redacted, synthetic, and final output values produced by the current dataset configuration.

Returns:

Entity mappings grouped by file. Files with no applicable entities are returned with an empty entity list.

Return type:

DatasetEntityMappingsResponse

get_failed_files(
refetch: bool | None = True,
) List[DatasetFile]

Gets all of the dataset files that encountered an error when they were processed. These files are effectively ignored.

Parameters:

refetch (Optional[bool]) – Default True. Will make an API call first to ensure an up-to-date list of files is retrieved

Returns:

The list of files that had processing errors.

Return type:

List[DatasetFile]

get_processed_files(
refetch: bool | None = True,
) List[DatasetFile]

Gets all of the dataset files for which processing is complete. The data in these files is returned when data is requested.

Parameters:

refetch (Optional[bool]) – Default True. Will make an API call first to ensure an up-to-date list of files is retrieved

Returns:

The list of processed dataset files.

Return type:

List[DatasetFile]

get_queued_files(
refetch: bool | None = True,
) List[DatasetFile]

Gets all of the dataset files that are waiting to be processed.

Parameters:

refetch (Optional[bool]) – Default True. Will make an API call first to ensure an up-to-date list of files is retrieved

Returns:

The list of dataset files that await processing.

Return type:

List[DatasetFile]

get_running_files(
refetch: bool | None = True,
) List[DatasetFile]

Gets all of the dataset files that are currently being processed.

Parameters:

refetch (Optional[bool]) – Default True. Will make an API call first to ensure an up-to-date list of files is retrieved

Returns:

The list of files that are being processed.

Return type:

List[DatasetFile]

property pii_info

DatasetFile class

class tonic_textual.classes.datasetfile.DatasetFile(
client: HttpClient,
id: str,
dataset_id: str,
name: str,
num_rows: int | None,
num_columns: int,
processing_status: str,
processing_error: str | None,
label_allow_lists: Dict[str, LabelCustomList] | None = None,
docx_image_policy_name: docx_image_policy | None = docx_image_policy.redact,
docx_comment_policy_name: docx_comment_policy | None = docx_comment_policy.remove,
docx_table_policy_name: docx_table_policy | None = docx_table_policy.redact,
pdf_signature_policy_name: pdf_signature_policy | None = pdf_signature_policy.redact,
pdf_synth_mode_policy: pdf_synth_mode_policy | None = pdf_synth_mode_policy.V1,
)

Class to store the metadata for a dataset file.

Parameters:
  • id (str) – The identifier of the dataset file.

  • name (str) – The file name of the dataset file.

  • num_rows (long) – The number of rows in the dataset file.

  • num_columns (int) – The number of columns in the dataset file.

  • processing_status (string) – The status of the dataset file in the processing pipeline. Possible values are ‘Completed’, ‘Failed’, ‘Cancelled’, ‘Running’, and ‘Queued’.

  • processing_error (string) – If the dataset file processing failed, a description of the issue that caused the failure.

  • label_allow_lists (Dict[str, LabelCustomList]) – A dictionary of custom entity detection regular expressions for the dataset file. Each key is an entity type to detect, and each values is a LabelCustomList object, whose regular expressions should be recognized as the specified entity type.

  • docx_image_policy_name (Optional[docx_image_policy] = None) – The policy for handling images in DOCX files. Options are ‘redact’, ‘ignore’, and ‘remove’.

  • docx_comment_policy_name (Optional[docx_comment_policy] = None) – The policy for handling comments in DOCX files. Options are ‘remove’ and ‘ignore’.

  • docx_table_policy_name (Optional[docx_table_policy] = None) – The policy for handling tables in DOCX files. Options are ‘redact’ and ‘remove’.

  • pdf_signature_policy_name (Optional[pdf_signature_policy] = None) – The policy for handling signatures in PDF files. Options are ‘redact’ and ‘ignore’.

  • pdf_synth_mode_policy (Optional[pdf_synth_mode_policy] = None) – The policy for which version of PDF synthesis to use. Options are V1 and V2.

describe() str

Returns the dataset file metadata as string. Includes the identifier, file name, number of rows, and number of columns.

download(
random_seed: int | None = None,
num_retries: int = 6,
wait_between_retries: int = 10,
) bytes

Download a redacted file

Parameters:
  • random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

  • num_retries (int = 6) – An optional value to specify the number of times to attempt to download the file. If a file is not yet ready for download, there is a 10-second pause before retrying. (The default value is 6)

  • wait_between_retries (int = 10) – The number of seconds to wait between retry attempts.

Returns:

The redacted file as a byte array.

Return type:

bytes

get_entities(
pii_types: List[PiiType | str] | None = None,
) Dict[PiiType, List[NerRedactionApiModel]]