NER API documentation¶

TextualNer class¶

class tonic_textual.redact_api.TextualNer( base_url: str = 'https://textual.tonic.ai', api_key: str | None = None, verify: bool = True, )¶

Wrapper class to invoke the Tonic Textual API

Parameters:

base_url (str) – The URL to your Tonic Textual instance. Do not include trailing backslashes. The default value is https://textual.tonic.ai.
api_key (str) – Optional. Your API token. Instead of providing the API token here, we recommended that you set the API key in your environment as the value of TONIC_TEXTUAL_API_KEY.
verify (bool) – Whether to verify SSL certification. By default, this is enabled.

Examples

>>> from tonic_textual.redact_api import TextualNer
>>> textual = TextualNer()

create_dataset( dataset_name: str, )¶

Creates a dataset. A dataset is a collection of 1 or more files for Tonic Textual to scan and redact.

Parameters:: dataset_name (str) – The name of the dataset. Dataset names must be unique.
Returns:: The newly created dataset.
Return type:: Dataset
Raises:: DatasetNameAlreadyExists – Raised if a dataset with the same name already exists.

create_model_entity( name: str, guidelines: str, display_name: str | None = None, ) → ModelEntity¶

Create a new model-based custom entity.

Model-based entities use ML models trained on your data to detect custom entity types. The workflow is: 1. Create entity with initial guidelines 2. Upload test data with ground truth annotations 3. Iterate on guidelines based on LLM suggestions 4. Train a model on annotated data 5. Activate the model for use in datasets

Parameters:

name (str) – Internal name for the entity (used as identifier)
guidelines (str) – Initial annotation guidelines for the LLM annotator
display_name (str, optional) – Display name for the UI

Returns:

The newly created model entity

Return type:

ModelEntity

Examples

>>> entity = textual.create_model_entity(
...     name="PRODUCT_CODE",
...     guidelines="Identify product codes in format ABC-1234"
... )
>>> entity.upload_test_data([
...     {"text": "Order ABC-1234", "spans": [{"start": 6, "end": 14}]}
... ])

delete_dataset( dataset_name: str, )¶

Deletes dataset by name.

Parameters:: dataset_name (str) – The name of the dataset to delete.

delete_model_entity( entity_id: str, ) → None¶

Delete a model-based entity.

Parameters:: entity_id (str) – The entity’s unique identifier

download_redacted_file( job_id: str, generator_default: PiiState | str = PiiState.Redaction, generator_config: Dict[str, PiiState | str] = {}, generator_metadata: Dict[str, BaseMetadata] = {}, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, num_retries: int = 6, wait_between_retries: int = 10, custom_entities: List[str] | None = None, ) → bytes¶

Download a redacted file

Parameters:

job_id (str) – The identifier of the redaction job.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.v
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
num_retries (int = 6) – An optional value to specify the number of times to attempt to download the file. If a file is not yet ready for download, Textual pauses for 10 seconds before retrying. (The default value is 6)
wait_between_retries (int = 10) – The number of seconds to wait between retry attempts. (The default value is 10)
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.

Returns:

The redacted file as a byte array.

Return type:

bytes

get_all_datasets() → List[Dataset]¶

Gets all of the user’s datasets

Returns:: The list of all datasets
Return type:: List[Dataset]

Examples

>>> datasets = tonic.get_all_datasets()

get_dataset( dataset_name: str, ) → Dataset¶

Gets the dataset for the specified dataset name.

Parameters:: dataset_name (str) – The name of the dataset.
Return type:: Dataset

Examples

>>> dataset = tonic.get_dataset("llama_2_chatbot_finetune_v5")

get_files( dataset_id: str, ) → List[DatasetFile]¶

Gets all of the files in the dataset.

Returns:: A list of all of the files in the dataset.
Return type:: List[DatasetFile]

get_model_entity( entity_id: str, ) → ModelEntity¶

Get a model-based entity by ID.

Parameters:: entity_id (str) – The entity’s unique identifier
Returns:: The model entity object
Return type:: ModelEntity

list_model_entities() → List[ModelEntity]¶

List all model-based entities.

Returns:: All model entities accessible to the user
Return type:: List[ModelEntity]

redact( string: str, generator_default: PiiState = PiiState.Redaction, generator_config: Dict[str, PiiState | str] = {}, generator_metadata: Dict[str, BaseMetadata] = {}, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, record_options: RecordApiRequestOptions = {'record': False, 'retention_time_in_hours': 0, 'tags': []}, custom_entities: List[str] | None = None, ) → RedactionResponse¶

Redacts a string. Depending on the configured handling for each sensitive data type, values are either redacted, synthesized, or ignored.

Parameters:

string (str) – The string to redact.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for an entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
record_options (RecordApiRequestOptions) – A value to record the API request and results for analysis in the Textual application. The default value is to not record the API request. Must specify a time between 1 and 720 hours (inclusive).
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.

Returns:

The redacted string along with ancillary information.

Return type:

RedactionResponse

Examples

>>> textual.redact(
>>>     "John Smith is a person",
>>>     # only redacts NAME_GIVEN
>>>     generator_config={"NAME_GIVEN": "Redaction", "CUSTOM_COGNITIVE_ACCESS_KEY": "Synthesis"},
>>>     generator_default="Off",
>>>     # Occurrences of "There" are treated as NAME_GIVEN entities
>>>     label_allow_lists={"NAME_GIVEN": ["There"]},
>>>     # Text matching the regex ` ([a-z]{2}) ` is not treated as an occurrence of NAME_FAMILY
>>>     label_block_lists={"NAME_FAMILY": [" ([a-z]{2}) "]},
>>>     # The custom entities passed here will be included in the redaction and may be included in generator_config
>>>     custom_entities=["CUSTOM_COGNITIVE_ACCESS_KEY", "CUSTOM_PERSONAL_GRAVITY_INDEX"],
>>> )

redact_audio_file( audio_file_path: str, output_file_path: str, generator_default: PiiState = PiiState.Redaction, generator_config: Dict[str, PiiState] = {}, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, custom_entities: List[str] | None = None, before_beep_buffer: float = 250.0, after_beep_buffer: float = 250.0, )¶

Generates a redacted audio file by identifying and removing sensitive audio segments.

Parameters:

audio_file_path (str) – The path to the input audio file. Supported file types are wav, mp3, ogg, flv, wma, aac, and others. See https://github.com/jiaaro/pydub for complete information on file types supported.
output_file_path (str) – The path to save the redacted output file. The output file path specifies the audio file type that the output is written as via it’s extension. Supported file types are wav, mp3, ogg, flv, wma, and aac. See https://github.com/jiaaro/pydub for complete information on file types supported.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for an entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.
before_beep_buffer (float, optional) – Buffer time (in milliseconds) to include before redaction interval (default is 250.0).
after_beep_buffer (float, optional) – Buffer time (in milliseconds) to include after redaction interval (default is 250.0).

Returns:

The path to the redacted output audio file.

Return type:

str

redact_bulk( strings: List[str], generator_default: PiiState | str = PiiState.Redaction, generator_config: Dict[str, PiiState | str] = {}, generator_metadata: Dict[str, BaseMetadata] = {}, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, custom_entities: List[str] | None = None, ) → BulkRedactionResponse¶

Redacts a string. Depending on the configured handling for each sensitive data type, values are either redacted, synthesized, or ignored.

Parameters:

strings (List[str]) – The array of strings to redact.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for an entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.

Returns:

The redacted string along with ancillary information.

Return type:

RedactionResponse

Examples

>>> textual.redact_bulk(
>>>     ["John Smith is a person", "I live in Atlanta"],
>>>     # only redacts NAME_GIVEN
>>>     generator_config={"NAME_GIVEN": "Redaction", "CUSTOM_COGNITIVE_ACCESS_KEY": "Synthesis"},
>>>     generator_default="Off",
>>>     # Occurrences of "There" are treated as NAME_GIVEN entities
>>>     label_allow_lists={"NAME_GIVEN": ["There"]},
>>>     # Text matching the regex ` ([a-z]{2}) ` is not treated as an occurrence of NAME_FAMILY
>>>     label_block_lists={"NAME_FAMILY": [" ([a-z]{2}) "]},
>>>     # The custom entities passed here will be included in the redaction and may be included in generator_config
>>>     custom_entities=["CUSTOM_COGNITIVE_ACCESS_KEY", "CUSTOM_PERSONAL_GRAVITY_INDEX"],
>>> )

redact_html( html_data: str, generator_default: PiiState | str = PiiState.Redaction, generator_config: Dict[str, PiiState | str] = {}, generator_metadata: Dict[str, BaseMetadata] = {}, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, custom_entities: List[str] | None = None, record_options: RecordApiRequestOptions = {'record': False, 'retention_time_in_hours': 0, 'tags': []}, ) → RedactionResponse¶

Redacts the values in an HTML blob. Depending on the configured handling for each entity type, values are either redacted, synthesized, or ignored.

Parameters:

html_data (str) – The HTML for which to redact values.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). The ignored values are regular expressions. When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). The additional values are regular expressions. When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.
record_options (RecordApiRequestOptions) – A value to record the API request and results for analysis in the Textual application. The default value is to not record the API request. Must specify a time between 1 and 720 hours (inclusive).

Returns:

The redacted string plus additional information.

Return type:

RedactionResponse

redact_json( json_data: str | dict, generator_default: PiiState | str = PiiState.Redaction, generator_config: Dict[str, PiiState | str] = {}, generator_metadata: Dict[str, BaseMetadata] = {}, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, jsonpath_allow_lists: Dict[str, List[str]] | None = None, json_path_ignore_paths: List[str] | None = None, custom_entities: List[str] | None = None, ) → RedactionResponse¶

Redacts the values in a JSON blob. Depending on the configured handling for each sensitive data type, values are either redacted, synthesized, or ignored.

Parameters:

json_data (Union[str, dict]) – The JSON for which to redact values. This can be either a JSON string or a Python dictionary.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
jsonpath_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, path expression). When an element in the JSON document matches the JSON path expression, the entire text value is treated as the specified entity type. Only supported for path expressions that point to JSON primitive values. This setting overrides any results found by the NER model or in label allow and block lists. If multiple path expressions point to the same JSON node, but specify different entity types, then the value is redacted as one of those types. However, the chosen type is selected at random - it could use any of the types.
json_path_ignore_paths (Optional[List[str]]) – Optional list of JSONPath expressions for values that should not be redacted. Any JSON element matching these paths will be left unchanged in the output.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.

Returns:

The redacted string along with ancillary information.

Return type:

RedactionResponse

redact_structured( values: List[str], pii_type: str, generator_metadata: BaseMetadata | None = None, random_seed: int | None = None, ) → List[str]¶

Synthesizes a column of structured values for a given entity type. Unlike redact/redact_bulk, this does not perform PII detection — every value is treated as the specified entity type and replaced accordingly.

Parameters:

values (List[str]) – The column of values to synthesize.
pii_type (str) – The entity type label to apply to every value (e.g. “EMAIL_ADDRESS”, “PHONE_NUMBER”, “NAME_GIVEN”).
generator_metadata (Optional[BaseMetadata] = None) – Optional generator metadata to control synthesis behavior for the given entity type. For example, an EmailGeneratorMetadata with preserve_domain=True. If not provided, the server default is used.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The synthesized values, in the same order as the input.

Return type:

List[str]

Examples

>>> from tonic_textual.classes.generator_metadata.email_generator_metadata import EmailGeneratorMetadata
>>> textual.redact_structured(
>>>     ["john@example.com", "jane@company.org"],
>>>     pii_type="EMAIL_ADDRESS",
>>>     generator_metadata=EmailGeneratorMetadata(preserve_domain=True),
>>> )

redact_xml( xml_data: str, generator_default: PiiState | str = PiiState.Redaction, generator_config: Dict[str, PiiState | str] = {}, generator_metadata: Dict[str, BaseMetadata] = {}, random_seed: int | None = None, label_block_lists: Dict[str, List[str]] | None = None, label_allow_lists: Dict[str, List[str]] | None = None, custom_entities: List[str] | None = None, ) → RedactionResponse¶

Redacts the values in an XML blob. Depending on the configured handling for each entity type, values are either redacted, synthesized, or ignored.

Parameters:

xml_data (str) – The XML for which to redact values.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.

Returns:

The redacted string plus additional information.

Return type:

RedactionResponse

send_redact_bulk_request( endpoint: str, payload: Dict, random_seed: int | None = None, ) → BulkRedactionResponse¶: Helper function to send redact requests, handle responses, and catch errors.

send_redact_request( endpoint: str, payload: Dict, random_seed: int | None = None, ) → RedactionResponse¶: Helper function to send redact requests, handle responses, and catch errors.

start_file_redaction( file: IOBase, file_name: str, custom_entities: List[str] | None = None, ) → str¶

Redact a provided file

Parameters:

file (io.IOBase) – The opened file, available for reading, to upload and redact.
file_name (str) – The name of the file.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.

Returns:

The job identifier, which can be used to download the redacted file when it is ready.

Return type:

str

unredact( redacted_string: str, random_seed: int | None = None, ) → str¶

Removes the redaction from a provided string. Returns the string with the original values.

Parameters:

redacted_string (str) – The redacted string from which to remove the redaction.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The string with the redaction removed.

Return type:

str

unredact_bulk( redacted_strings: List[str], random_seed: int | None = None, ) → List[str]¶

Removes redaction from a list of strings. Returns the strings with the original values.

Parameters:

redacted_strings (List[str]) – The list of redacted strings from which to remove the redaction.
random_seed (Optional[int] = None) – Ann optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The list of strings with the redaction removed.

Return type:

List[str]

class tonic_textual.classes.record_api_request_options.RecordApiRequestOptions( record: bool, retention_time_in_hours: int, tags: List[str] = [], )¶

Class to denote whether to record an API request.

Parameters:

record (bool) – Whether to record the request.
retention_time_in_hours (int) – The number of hours to store the request. The request is then purged automatically.
tags (List[str]) – A list of tags to assign to the request. Used to help search for the request on the API Explorer page. The default is the empty list [], which corresponds to assigning no tags to the request.

Redaction response¶

class tonic_textual.classes.redact_api_responses.redaction_response.RedactionResponse( original_text: str, redacted_text: str, usage: int, de_identify_results: List[Replacement], )¶

Redaction response object

Variables:

original_text (str) – The original text.
redacted_text (str) – The redacted and synthesized text.
usage (int) – The number of words used
de_identify_results (List[Replacement]) – The list of named entities that were found in original_text.

class tonic_textual.classes.common_api_responses.replacement.Replacement( start: int, end: int, new_start: int, new_end: int, label: str, text: str, score: float, language: str, new_text: str | None = None, example_redaction: str | None = None, json_path: str | None = None, xml_path: str | None = None, )¶

A span of text that was detected as a named entity.

Variables:

start (int) – The start index of the entity in the original text.
end (int) – The end index of the entity in the original text. The end index is exclusive.
new_start (int) – The start index of the entity in the redacted/synthesized text.
new_end (int) – The end index of the entity in the redacted/synthesized text. The end index is exclusive.
python_start (Optional[int]) – The start index in Python (if different from start).
python_end (Optional[int]) – The end index in Python (if different from end).
label (str) – The label of the entity.
text (str) – The substring of the original text that was detected as an entity.
new_text (Optional[str]) – The new text to replace the original entity.
score (float) – The confidence score of the detection.
language (str) – The language of the entity.
example_redaction (Optional[str]) – An example redaction for the entity.
json_path (Optional[str]) – The JSON path of the entity in the original JSON document. This is only present if the input text was a JSON document.
xml_path (Optional[str]) – The xpath of the entity in the original XML document. This is only present if the input text was an XML document. NOTE: Arrays in xpath are 1-based.

Dataset entity mappings response¶

class tonic_textual.classes.common_api_responses.dataset_entity_mappings_response.DatasetEntityMappingsResponse( files: List[DatasetFileEntityMappings], )¶

Entity mappings for a dataset, grouped by file.

Variables:: files (List[DatasetFileEntityMappings]) – The entity mappings for the dataset, grouped by file.

class tonic_textual.classes.common_api_responses.dataset_file_entity_mappings.DatasetFileEntityMappings( file_id: str, file_name: str, entities: List[EntityMapping], )¶

The entity mappings detected for a single dataset file.

Variables:

file_id (str) – The identifier of the dataset file.
file_name (str) – The file name shown in the dataset.
entities (List[EntityMapping]) – The entity mappings detected for this file after the dataset generator configuration is applied.

class tonic_textual.classes.common_api_responses.entity_mapping.EntityMapping( label: str, text: str, redacted_text: str | None = None, synthetic_text: str | None = None, applied_generator_state: str | None = None, output_text: str | None = None, row_number: int | None = None, column_index: int | None = None, score: float | None = None, )¶

An entity detected in a dataset file and the values it maps to in output.

Variables:

label (str) – The entity label detected in the dataset file.
text (str) – The original text value that was detected as the entity.
redacted_text (Optional[str]) – The redacted token that would replace the original value when the entity is configured for redaction.
synthetic_text (Optional[str]) – The synthetic value that would replace the original value when the entity is configured for synthesis.
applied_generator_state (Optional[str]) – The dataset generator state currently applied to this entity type, such as redaction or synthesis.
output_text (Optional[str]) – The final value that would appear in generated dataset output after the current generator configuration is applied.
row_number (Optional[int]) – The 1-based row number for tabular files, when available.
column_index (Optional[int]) – The 0-based column index for tabular files, when available.
score (Optional[float]) – The confidence score for the detected entity, when available.

Helper classes¶

class tonic_textual.helpers.replace_text_helper.ReplaceTextHelper¶: A helper class for modifying synthetic values returned from redaction calls

class tonic_textual.helpers.json_conversation_helper.JsonConversationHelper¶

A helper class for processing generic chat data and transcripted audio where the conversation is broken down into pieces and represented in JSON.

For example:

{
    "conversations": [
        {"role":"customer", "text": "Hi, this is Adam"},
        {"role":"agent", "text": "Hi Adam, nice to meet you this is Jane."},
    ]
}

redact( conversation: dict, items_getter: Callable[[dict], list], text_getter: Callable[[Any], list], redact_func: Callable[[str], RedactionResponse], join_char: str | None = '\n', ) → List[RedactionResponse]¶

Redacts a conversation.

Parameters:

conversation (dict) – The python dictionary, loaded from JSON, which contains the text parts of the conversation
items_getter (Callable[[dict], list]) –

A function that can retrieve the array of conversation items. e.g. if conversation is represented in JSON as:
```
{
    "conversations": [
        {"role":"customer", "text": "Hi, this is Adam"},
        {"role":"agent", "text": "Hi Adam, nice to meet you this is Jane."},
    ]
}
```
Then items_getter would be defined as lambda x: x["conversations"]
text_getter (Callable[[dict], str]) –

A function to retrieve the text from a given item returned by the items_getter. For example, if the items_getter returns a list of objects such as:
```
{"role":"customer", "text": "Hi, this is Adam"}
```
Then the items_getter would be defined as lambda x: x["text"]
redact_func (Callable[[str], RedactionResponse]) – The function you use to make the Textual redaction call. This should be an invocation of the TextualNer.redact such as lambda x: ner.redact(x).

Generator metadata¶

class tonic_textual.classes.generator_metadata.base_metadata.BaseMetadata( custom_generator: GeneratorType | None = None, generator_version: GeneratorVersion = GeneratorVersion.V1, swaps: Dict[str, str] | None = {}, constant_value: str | None = None, )¶

Base class for all generator metadata configurations.

Provides common parameters shared by all metadata types. You typically do not instantiate this class directly. Instead, use a specific metadata subclass such as NameGeneratorMetadata or EmailGeneratorMetadata.

Parameters:

custom_generator (GeneratorType, optional) – The generator type. Set automatically by subclasses.
generator_version (GeneratorVersion) – The generator version to use. Default is V1.
swaps (dict of str to str, optional) – A dictionary of explicit replacement mappings. When a detected value matches a key in the dictionary, the corresponding value is used as the synthesized replacement instead of a generated one.
constant_value (str, optional) – A string value that will be used as the replacement, when not None and there is no matching source value in swaps.

class tonic_textual.classes.generator_metadata.base_date_time_generator_metadata.BaseDateTimeGeneratorMetadata( custom_generator: GeneratorType | None = None, generator_version: GeneratorVersion = GeneratorVersion.V1, scramble_unrecognized_dates: bool = True, swaps: Dict[str, str] | None = {}, constant_value: str | None = None, )¶

Base class for date and time related generator metadata.

Extends BaseMetadata with a common date/time parameter. You typically do not instantiate this class directly. Instead, use DateTimeGeneratorMetadata or PersonAgeGeneratorMetadata.

Parameters:: scramble_unrecognized_dates (bool) – When True, dates that Textual cannot parse into a standard format are scrambled. When False, unrecognized dates are left unchanged. Default is True.

class tonic_textual.classes.generator_metadata.name_generator_metadata.NameGeneratorMetadata( generator_version: GeneratorVersion = GeneratorVersion.V1, is_consistency_case_sensitive: bool = False, preserve_gender: bool = False, swaps: Dict[str, str] | None = {}, constant_value: str | None = None, )¶

Metadata configuration for name synthesis.

Controls how synthesized names are generated for entity types such as NAME_GIVEN and NAME_FAMILY.

Parameters:

is_consistency_case_sensitive (bool) – When True, name consistency is case-sensitive. For example, "john" and "John" are treated as different names and might receive different replacements. Default is False.
preserve_gender (bool) – When True, the synthesized name preserves the gender of the original name. Male names are replaced with male names, and female names are replaced with female names. Default is False.

class tonic_textual.classes.generator_metadata.email_generator_metadata.EmailGeneratorMetadata( generator_version: GeneratorVersion = GeneratorVersion.V1, preserve_domain: bool = False, swaps: Dict[str, str] | None = {}, constant_value: str | None = None, )¶

Metadata configuration for email address synthesis.

Controls how synthesized email addresses are generated for the EMAIL_ADDRESS entity type.

Parameters:: preserve_domain (bool) – When True, the domain portion of the email address is kept intact. For example, "john@example.com" might become "alan@example.com". Default is False.

class tonic_textual.classes.generator_metadata.phone_number_generator_metadata.PhoneNumberGeneratorMetadata( generator_version: GeneratorVersion = GeneratorVersion.V1, use_us_phone_number_generator: bool = False, replace_invalid_numbers: bool = True, preserve_us_area_code: bool = False, swaps: Dict[str, str] | None = {}, constant_value: str | None = None, )¶

Metadata configuration for phone number synthesis.

Controls how synthesized telephone numbers are generated for the PHONE_NUMBER entity type.

Parameters:

use_us_phone_number_generator (bool) – When True, generated telephone numbers use a US phone number format. Default is False.
replace_invalid_numbers (bool) – When True, phone numbers that are detected but are not valid phone numbers are replaced with synthesized values. Default is True.
preserve_us_area_code (bool) – When True and use_us_phone_number_generator is also True, the area code of the original phone number is preserved in the synthesized value. Default is False.

class tonic_textual.classes.generator_metadata.date_time_generator_metadata.DateTimeGeneratorMetadata( generator_version: GeneratorVersion = GeneratorVersion.V1, scramble_unrecognized_dates: bool = True, additional_date_formats: List[str] = [], apply_constant_shift_to_document: bool = False, use_clear_date_and_passthrough_or_group_year_generator: bool = False, metadata: TimestampShiftMetadata = None, swaps: Dict[str, str] | None = {}, constant_value: str | None = None, )¶

Metadata configuration for date and time synthesis.

Controls how synthesized date and time values are generated for the DATE_TIME entity type. Dates are shifted by a random number of days within a configurable range.

Parameters:

scramble_unrecognized_dates (bool) – When True, dates that Textual cannot parse into a standard format are scrambled. Default is True.
additional_date_formats (list of str) – A list of additional date format patterns that Textual should recognize. Use Python strftime/strptime format codes. Default is an empty list.
apply_constant_shift_to_document (bool) – When True, all dates within the same document are shifted by the same random offset. This preserves relative time differences between dates. Default is False.
use_clear_date_and_passthrough_or_group_year_generator (bool) – When True sets the date to January 1st and if the year is less than 90 years ago, passes through the year. Otherwise, sets the year to the current year - 90. When False it has no effect. Default is False.
metadata (TimestampShiftMetadata) – Configuration for the date shift range. By default dates shift by -7 to +7 days.

class tonic_textual.classes.generator_metadata.timestamp_shift_metadata.TimestampShiftMetadata( left_shift_in_days: int | None = -7, right_shift_in_days: int | None = 7, time_stamp_shift_in_days: int | None = None, swaps: Dict[str, str] | None = {}, constant_value: str | None = None, )¶

Configuration for the date shift range used by DateTimeGeneratorMetadata.

Defines the range of days by which dates can be shifted. The actual shift for each date is randomly chosen within the specified range.

Parameters:

left_shift_in_days (int, optional) – The minimum (leftmost) shift in days. Use a negative value to shift dates into the past. Default is -7.
right_shift_in_days (int, optional) – The maximum (rightmost) shift in days. Use a positive value to shift dates into the future. Default is 7.
time_stamp_shift_in_days (int, optional) – Deprecated. Use left_shift_in_days and right_shift_in_days instead.

class tonic_textual.classes.generator_metadata.person_age_generator_metadata.PersonAgeGeneratorMetadata( generator_version: GeneratorVersion = GeneratorVersion.V1, scramble_unrecognized_dates: bool = True, metadata: AgeShiftMetadata = None, swaps: Dict[str, str] | None = {}, constant_value: str | None = None, use_passthrough_or_group_age_generator: bool = False, )¶

Metadata configuration for person age synthesis.

Controls how synthesized ages are generated for the PERSON_AGE entity type. Ages are shifted by a configurable number of years.

Parameters:

scramble_unrecognized_dates (bool) – When True, dates that Textual cannot parse into a standard format are scrambled. Default is True.
metadata (AgeShiftMetadata) – Configuration for the age shift amount. By default, ages shift by 7 years.
use_passthrough_or_group_age_generator (bool) – When True passes through ages 89 or under. Changes other ages to “90+” Default is False

class tonic_textual.classes.generator_metadata.age_shift_metadata.AgeShiftMetadata( age_shift_in_years: int = 7, )¶

Configuration for the age shift amount used by PersonAgeGeneratorMetadata.

Defines how many years to shift detected ages by.

Parameters:: age_shift_in_years (int) – The number of years to shift the age. Default is 7.

class tonic_textual.classes.generator_metadata.hipaa_address_generator_metadata.HipaaAddressGeneratorMetadata( generator_version: GeneratorVersion = GeneratorVersion.V1, use_non_hipaa_address_generator: bool = False, replace_truncated_zeros_in_zip_code: bool = True, realistic_synthetic_values: bool = True, swaps: Dict[str, str] | None = {}, constant_value: str | None = None, use_three_digit_zips: bool = False, replace_foreign_zip_codes_with_zeros: bool = False, )¶

Metadata configuration for HIPAA-compliant address synthesis.

Controls how synthesized addresses are generated for location entity types such as LOCATION_ADDRESS and LOCATION_ZIP. By default, address synthesis follows HIPAA Safe Harbor de-identification rules.

Parameters:

use_non_hipaa_address_generator (bool) – When True, uses a non-HIPAA-compliant address generator that may produce more realistic addresses, but does not guarantee HIPAA Safe Harbor compliance. Default is False.
replace_truncated_zeros_in_zip_code (bool) – When True, for ZIP codes that are truncated to three digits (per HIPAA Safe Harbor), the removed digits are replaced with zeros. Default is True.
realistic_synthetic_values (bool) – When True, generates realistic-looking synthetic address values. Default is True.
use_three_digit_zips (bool) – When True zip codes are always truncated to three digits. Default is False
replace_foreign_zip_codes_with_zeros (bool) – When True foreign zip codes become all zeros Default is False

class tonic_textual.classes.generator_metadata.numeric_value_generator_metadata.NumericValueGeneratorMetadata( generator_version: GeneratorVersion = GeneratorVersion.V1, use_oracle_integer_pk_generator: bool = False, swaps: Dict[str, str] | None = {}, constant_value: str | None = None, )¶

Metadata configuration for numeric value synthesis.

Controls how synthesized numeric values are generated for the NUMERIC_VALUE entity type.

Parameters:: use_oracle_integer_pk_generator (bool) – When True, uses a generator designed for Oracle integer primary keys. Default is False.