NER API documentation¶
TextualNer class¶
- class tonic_textual.redact_api.TextualNer(
- base_url: str = 'https://textual.tonic.ai',
- api_key: str | None = None,
- verify: bool = True,
Wrapper class to invoke the Tonic Textual API
- Parameters:
base_url (str) – The URL to your Tonic Textual instance. Do not include trailing backslashes. The default value is https://textual.tonic.ai.
api_key (str) – Optional. Your API token. Instead of providing the API token here, we recommended that you set the API key in your environment as the value of TONIC_TEXTUAL_API_KEY.
verify (bool) – Whether to verify SSL certification. By default, this is enabled.
Examples
>>> from tonic_textual.redact_api import TextualNer >>> textual = TextualNer()
- create_dataset(
- dataset_name: str,
Creates a dataset. A dataset is a collection of 1 or more files for Tonic Textual to scan and redact.
- Parameters:
dataset_name (str) – The name of the dataset. Dataset names must be unique.
- Returns:
The newly created dataset.
- Return type:
- Raises:
DatasetNameAlreadyExists – Raised if a dataset with the same name already exists.
- create_model_entity(
- name: str,
- guidelines: str,
- display_name: str | None = None,
Create a new model-based custom entity.
Model-based entities use ML models trained on your data to detect custom entity types. The workflow is: 1. Create entity with initial guidelines 2. Upload test data with ground truth annotations 3. Iterate on guidelines based on LLM suggestions 4. Train a model on annotated data 5. Activate the model for use in datasets
- Parameters:
name (str) – Internal name for the entity (used as identifier)
guidelines (str) – Initial annotation guidelines for the LLM annotator
display_name (str, optional) – Display name for the UI
- Returns:
The newly created model entity
- Return type:
ModelEntity
Examples
>>> entity = textual.create_model_entity( ... name="PRODUCT_CODE", ... guidelines="Identify product codes in format ABC-1234" ... ) >>> entity.upload_test_data([ ... {"text": "Order ABC-1234", "spans": [{"start": 6, "end": 14}]} ... ])
- delete_dataset(
- dataset_name: str,
Deletes dataset by name.
- Parameters:
dataset_name (str) – The name of the dataset to delete.
- delete_model_entity(
- entity_id: str,
Delete a model-based entity.
- Parameters:
entity_id (str) – The entity’s unique identifier
- download_redacted_file(
- job_id: str,
- generator_default: PiiState | str = PiiState.Redaction,
- generator_config: Dict[str, PiiState | str] = {},
- generator_metadata: Dict[str, BaseMetadata] = {},
- random_seed: int | None = None,
- label_block_lists: Dict[str, List[str]] | None = None,
- num_retries: int = 6,
- wait_between_retries: int = 10,
- custom_entities: List[str] | None = None,
Download a redacted file
- Parameters:
job_id (str) – The identifier of the redaction job.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.v
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
num_retries (int = 6) – An optional value to specify the number of times to attempt to download the file. If a file is not yet ready for download, Textual pauses for 10 seconds before retrying. (The default value is 6)
wait_between_retries (int = 10) – The number of seconds to wait between retry attempts. (The default value is 10)
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.
- Returns:
The redacted file as a byte array.
- Return type:
bytes
- get_all_datasets() List[Dataset]¶
Gets all of the user’s datasets
- Returns:
The list of all datasets
- Return type:
List[Dataset]
Examples
>>> datasets = tonic.get_all_datasets()
- get_dataset(
- dataset_name: str,
Gets the dataset for the specified dataset name.
- Parameters:
dataset_name (str) – The name of the dataset.
- Return type:
Examples
>>> dataset = tonic.get_dataset("llama_2_chatbot_finetune_v5")
- get_files(
- dataset_id: str,
Gets all of the files in the dataset.
- Returns:
A list of all of the files in the dataset.
- Return type:
List[DatasetFile]
- get_model_entity(
- entity_id: str,
Get a model-based entity by ID.
- Parameters:
entity_id (str) – The entity’s unique identifier
- Returns:
The model entity object
- Return type:
ModelEntity
- list_model_entities() List[ModelEntity]¶
List all model-based entities.
- Returns:
All model entities accessible to the user
- Return type:
List[ModelEntity]
- redact(
- string: str,
- generator_default: PiiState = PiiState.Redaction,
- generator_config: Dict[str, PiiState | str] = {},
- generator_metadata: Dict[str, BaseMetadata] = {},
- random_seed: int | None = None,
- label_block_lists: Dict[str, List[str]] | None = None,
- label_allow_lists: Dict[str, List[str]] | None = None,
- record_options: RecordApiRequestOptions = {'record': False, 'retention_time_in_hours': 0, 'tags': []},
- custom_entities: List[str] | None = None,
Redacts a string. Depending on the configured handling for each sensitive data type, values are either redacted, synthesized, or ignored.
- Parameters:
string (str) – The string to redact.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for an entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
record_options (RecordApiRequestOptions) – A value to record the API request and results for analysis in the Textual application. The default value is to not record the API request. Must specify a time between 1 and 720 hours (inclusive).
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.
- Returns:
The redacted string along with ancillary information.
- Return type:
Examples
>>> textual.redact( >>> "John Smith is a person", >>> # only redacts NAME_GIVEN >>> generator_config={"NAME_GIVEN": "Redaction", "CUSTOM_COGNITIVE_ACCESS_KEY": "Synthesis"}, >>> generator_default="Off", >>> # Occurrences of "There" are treated as NAME_GIVEN entities >>> label_allow_lists={"NAME_GIVEN": ["There"]}, >>> # Text matching the regex ` ([a-z]{2}) ` is not treated as an occurrence of NAME_FAMILY >>> label_block_lists={"NAME_FAMILY": [" ([a-z]{2}) "]}, >>> # The custom entities passed here will be included in the redaction and may be included in generator_config >>> custom_entities=["CUSTOM_COGNITIVE_ACCESS_KEY", "CUSTOM_PERSONAL_GRAVITY_INDEX"], >>> )
- redact_audio_file(
- audio_file_path: str,
- output_file_path: str,
- generator_default: PiiState = PiiState.Redaction,
- generator_config: Dict[str, PiiState] = {},
- label_block_lists: Dict[str, List[str]] | None = None,
- label_allow_lists: Dict[str, List[str]] | None = None,
- custom_entities: List[str] | None = None,
- before_beep_buffer: float = 250.0,
- after_beep_buffer: float = 250.0,
Generates a redacted audio file by identifying and removing sensitive audio segments.
- Parameters:
audio_file_path (str) – The path to the input audio file. Supported file types are wav, mp3, ogg, flv, wma, aac, and others. See https://github.com/jiaaro/pydub for complete information on file types supported.
output_file_path (str) – The path to save the redacted output file. The output file path specifies the audio file type that the output is written as via it’s extension. Supported file types are wav, mp3, ogg, flv, wma, and aac. See https://github.com/jiaaro/pydub for complete information on file types supported.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for an entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.
before_beep_buffer (float, optional) – Buffer time (in milliseconds) to include before redaction interval (default is 250.0).
after_beep_buffer (float, optional) – Buffer time (in milliseconds) to include after redaction interval (default is 250.0).
- Returns:
The path to the redacted output audio file.
- Return type:
str
- redact_bulk(
- strings: List[str],
- generator_default: PiiState | str = PiiState.Redaction,
- generator_config: Dict[str, PiiState | str] = {},
- generator_metadata: Dict[str, BaseMetadata] = {},
- random_seed: int | None = None,
- label_block_lists: Dict[str, List[str]] | None = None,
- label_allow_lists: Dict[str, List[str]] | None = None,
- custom_entities: List[str] | None = None,
Redacts a string. Depending on the configured handling for each sensitive data type, values are either redacted, synthesized, or ignored.
- Parameters:
strings (List[str]) – The array of strings to redact.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for an entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.
- Returns:
The redacted string along with ancillary information.
- Return type:
Examples
>>> textual.redact_bulk( >>> ["John Smith is a person", "I live in Atlanta"], >>> # only redacts NAME_GIVEN >>> generator_config={"NAME_GIVEN": "Redaction", "CUSTOM_COGNITIVE_ACCESS_KEY": "Synthesis"}, >>> generator_default="Off", >>> # Occurrences of "There" are treated as NAME_GIVEN entities >>> label_allow_lists={"NAME_GIVEN": ["There"]}, >>> # Text matching the regex ` ([a-z]{2}) ` is not treated as an occurrence of NAME_FAMILY >>> label_block_lists={"NAME_FAMILY": [" ([a-z]{2}) "]}, >>> # The custom entities passed here will be included in the redaction and may be included in generator_config >>> custom_entities=["CUSTOM_COGNITIVE_ACCESS_KEY", "CUSTOM_PERSONAL_GRAVITY_INDEX"], >>> )
- redact_html(
- html_data: str,
- generator_default: PiiState | str = PiiState.Redaction,
- generator_config: Dict[str, PiiState | str] = {},
- generator_metadata: Dict[str, BaseMetadata] = {},
- random_seed: int | None = None,
- label_block_lists: Dict[str, List[str]] | None = None,
- label_allow_lists: Dict[str, List[str]] | None = None,
- custom_entities: List[str] | None = None,
- record_options: RecordApiRequestOptions = {'record': False, 'retention_time_in_hours': 0, 'tags': []},
Redacts the values in an HTML blob. Depending on the configured handling for each entity type, values are either redacted, synthesized, or ignored.
- Parameters:
html_data (str) – The HTML for which to redact values.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). The ignored values are regular expressions. When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). The additional values are regular expressions. When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.
record_options (RecordApiRequestOptions) – A value to record the API request and results for analysis in the Textual application. The default value is to not record the API request. Must specify a time between 1 and 720 hours (inclusive).
- Returns:
The redacted string plus additional information.
- Return type:
- redact_json(
- json_data: str | dict,
- generator_default: PiiState | str = PiiState.Redaction,
- generator_config: Dict[str, PiiState | str] = {},
- generator_metadata: Dict[str, BaseMetadata] = {},
- random_seed: int | None = None,
- label_block_lists: Dict[str, List[str]] | None = None,
- label_allow_lists: Dict[str, List[str]] | None = None,
- jsonpath_allow_lists: Dict[str, List[str]] | None = None,
- json_path_ignore_paths: List[str] | None = None,
- custom_entities: List[str] | None = None,
Redacts the values in a JSON blob. Depending on the configured handling for each sensitive data type, values are either redacted, synthesized, or ignored.
- Parameters:
json_data (Union[str, dict]) – The JSON for which to redact values. This can be either a JSON string or a Python dictionary.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
jsonpath_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, path expression). When an element in the JSON document matches the JSON path expression, the entire text value is treated as the specified entity type. Only supported for path expressions that point to JSON primitive values. This setting overrides any results found by the NER model or in label allow and block lists. If multiple path expressions point to the same JSON node, but specify different entity types, then the value is redacted as one of those types. However, the chosen type is selected at random - it could use any of the types.
json_path_ignore_paths (Optional[List[str]]) – Optional list of JSONPath expressions for values that should not be redacted. Any JSON element matching these paths will be left unchanged in the output.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.
- Returns:
The redacted string along with ancillary information.
- Return type:
- redact_structured(
- values: List[str],
- pii_type: str,
- generator_metadata: BaseMetadata | None = None,
- random_seed: int | None = None,
Synthesizes a column of structured values for a given entity type. Unlike redact/redact_bulk, this does not perform PII detection — every value is treated as the specified entity type and replaced accordingly.
- Parameters:
values (List[str]) – The column of values to synthesize.
pii_type (str) – The entity type label to apply to every value (e.g. “EMAIL_ADDRESS”, “PHONE_NUMBER”, “NAME_GIVEN”).
generator_metadata (Optional[BaseMetadata] = None) – Optional generator metadata to control synthesis behavior for the given entity type. For example, an EmailGeneratorMetadata with preserve_domain=True. If not provided, the server default is used.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
- Returns:
The synthesized values, in the same order as the input.
- Return type:
List[str]
Examples
>>> from tonic_textual.classes.generator_metadata.email_generator_metadata import EmailGeneratorMetadata >>> textual.redact_structured( >>> ["john@example.com", "jane@company.org"], >>> pii_type="EMAIL_ADDRESS", >>> generator_metadata=EmailGeneratorMetadata(preserve_domain=True), >>> )
- redact_xml(
- xml_data: str,
- generator_default: PiiState | str = PiiState.Redaction,
- generator_config: Dict[str, PiiState | str] = {},
- generator_metadata: Dict[str, BaseMetadata] = {},
- random_seed: int | None = None,
- label_block_lists: Dict[str, List[str]] | None = None,
- label_allow_lists: Dict[str, List[str]] | None = None,
- custom_entities: List[str] | None = None,
Redacts the values in an XML blob. Depending on the configured handling for each entity type, values are either redacted, synthesized, or ignored.
- Parameters:
xml_data (str) – The XML for which to redact values.
generator_default (Union[PiiState, str] = PiiState.Redaction) – The default redaction used for types that are not specified in generator_config. Value must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, Union[PiiState, str]]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_metadata (Dict[str, BaseMetadata]) – A dictionary of sensitive data entities. For each entity, indicates generator configuration in case synthesis is selected. Values must be of types appropriate to the PII type.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
label_block_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, ignored values). When a value for the entity type matches a listed regular expression, the value is ignored and is not redacted or synthesized.
label_allow_lists (Optional[Dict[str, List[str]]]) – A dictionary of (entity type, additional values). When a piece of text matches a listed regular expression, the text is marked as the entity type and is included in the redaction or synthesis.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.
- Returns:
The redacted string plus additional information.
- Return type:
- send_redact_bulk_request(
- endpoint: str,
- payload: Dict,
- random_seed: int | None = None,
Helper function to send redact requests, handle responses, and catch errors.
- send_redact_request(
- endpoint: str,
- payload: Dict,
- random_seed: int | None = None,
Helper function to send redact requests, handle responses, and catch errors.
- start_file_redaction(
- file: IOBase,
- file_name: str,
- custom_entities: List[str] | None = None,
Redact a provided file
- Parameters:
file (io.IOBase) – The opened file, available for reading, to upload and redact.
file_name (str) – The name of the file.
custom_entities (Optional[List[str]]) – A list of custom entity type identifiers to include. Each custom entity type included here may also be included in the generator config. Custom entity types will respect generator defaults if they are not specified in the generator config.
- Returns:
The job identifier, which can be used to download the redacted file when it is ready.
- Return type:
str
- unredact(
- redacted_string: str,
- random_seed: int | None = None,
Removes the redaction from a provided string. Returns the string with the original values.
- Parameters:
redacted_string (str) – The redacted string from which to remove the redaction.
random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
- Returns:
The string with the redaction removed.
- Return type:
str
- unredact_bulk(
- redacted_strings: List[str],
- random_seed: int | None = None,
Removes redaction from a list of strings. Returns the strings with the original values.
- Parameters:
redacted_strings (List[str]) – The list of redacted strings from which to remove the redaction.
random_seed (Optional[int] = None) – Ann optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.
- Returns:
The list of strings with the redaction removed.
- Return type:
List[str]
- class tonic_textual.classes.record_api_request_options.RecordApiRequestOptions(
- record: bool,
- retention_time_in_hours: int,
- tags: List[str] = [],
Class to denote whether to record an API request.
- Parameters:
record (bool) – Whether to record the request.
retention_time_in_hours (int) – The number of hours to store the request. The request is then purged automatically.
tags (List[str]) – A list of tags to assign to the request. Used to help search for the request on the API Explorer page. The default is the empty list [], which corresponds to assigning no tags to the request.
Redaction response¶
- class tonic_textual.classes.redact_api_responses.redaction_response.RedactionResponse(
- original_text: str,
- redacted_text: str,
- usage: int,
- de_identify_results: List[Replacement],
Redaction response object
- Variables:
original_text (str) – The original text.
redacted_text (str) – The redacted and synthesized text.
usage (int) – The number of words used
de_identify_results (List[Replacement]) – The list of named entities that were found in original_text.
- class tonic_textual.classes.common_api_responses.replacement.Replacement(
- start: int,
- end: int,
- new_start: int,
- new_end: int,
- label: str,
- text: str,
- score: float,
- language: str,
- new_text: str | None = None,
- example_redaction: str | None = None,
- json_path: str | None = None,
- xml_path: str | None = None,
A span of text that was detected as a named entity.
- Variables:
start (int) – The start index of the entity in the original text.
end (int) – The end index of the entity in the original text. The end index is exclusive.
new_start (int) – The start index of the entity in the redacted/synthesized text.
new_end (int) – The end index of the entity in the redacted/synthesized text. The end index is exclusive.
python_start (Optional[int]) – The start index in Python (if different from start).
python_end (Optional[int]) – The end index in Python (if different from end).
label (str) – The label of the entity.
text (str) – The substring of the original text that was detected as an entity.
new_text (Optional[str]) – The new text to replace the original entity.
score (float) – The confidence score of the detection.
language (str) – The language of the entity.
example_redaction (Optional[str]) – An example redaction for the entity.
json_path (Optional[str]) – The JSON path of the entity in the original JSON document. This is only present if the input text was a JSON document.
xml_path (Optional[str]) – The xpath of the entity in the original XML document. This is only present if the input text was an XML document. NOTE: Arrays in xpath are 1-based.
Dataset entity mappings response¶
- class tonic_textual.classes.common_api_responses.dataset_entity_mappings_response.DatasetEntityMappingsResponse(
- files: List[DatasetFileEntityMappings],
Entity mappings for a dataset, grouped by file.
- Variables:
files (List[DatasetFileEntityMappings]) – The entity mappings for the dataset, grouped by file.
- class tonic_textual.classes.common_api_responses.dataset_file_entity_mappings.DatasetFileEntityMappings(
- file_id: str,
- file_name: str,
- entities: List[EntityMapping],
The entity mappings detected for a single dataset file.
- Variables:
file_id (str) – The identifier of the dataset file.
file_name (str) – The file name shown in the dataset.
entities (List[EntityMapping]) – The entity mappings detected for this file after the dataset generator configuration is applied.
- class tonic_textual.classes.common_api_responses.entity_mapping.EntityMapping(
- label: str,
- text: str,
- redacted_text: str | None = None,
- synthetic_text: str | None = None,
- applied_generator_state: str | None = None,
- output_text: str | None = None,
- row_number: int | None = None,
- column_index: int | None = None,
- score: float | None = None,
An entity detected in a dataset file and the values it maps to in output.
- Variables:
label (str) – The entity label detected in the dataset file.
text (str) – The original text value that was detected as the entity.
redacted_text (Optional[str]) – The redacted token that would replace the original value when the entity is configured for redaction.
synthetic_text (Optional[str]) – The synthetic value that would replace the original value when the entity is configured for synthesis.
applied_generator_state (Optional[str]) – The dataset generator state currently applied to this entity type, such as redaction or synthesis.
output_text (Optional[str]) – The final value that would appear in generated dataset output after the current generator configuration is applied.
row_number (Optional[int]) – The 1-based row number for tabular files, when available.
column_index (Optional[int]) – The 0-based column index for tabular files, when available.
score (Optional[float]) – The confidence score for the detected entity, when available.
Helper classes¶
- class tonic_textual.helpers.replace_text_helper.ReplaceTextHelper¶
A helper class for modifying synthetic values returned from redaction calls
- class tonic_textual.helpers.json_conversation_helper.JsonConversationHelper¶
A helper class for processing generic chat data and transcripted audio where the conversation is broken down into pieces and represented in JSON.
For example:
{ "conversations": [ {"role":"customer", "text": "Hi, this is Adam"}, {"role":"agent", "text": "Hi Adam, nice to meet you this is Jane."}, ] }
- redact(
- conversation: dict,
- items_getter: Callable[[dict], list],
- text_getter: Callable[[Any], list],
- redact_func: Callable[[str], RedactionResponse],
- join_char: str | None = '\n',
Redacts a conversation.
- Parameters:
conversation (dict) – The python dictionary, loaded from JSON, which contains the text parts of the conversation
items_getter (Callable[[dict], list]) –
A function that can retrieve the array of conversation items. e.g. if conversation is represented in JSON as:
{ "conversations": [ {"role":"customer", "text": "Hi, this is Adam"}, {"role":"agent", "text": "Hi Adam, nice to meet you this is Jane."}, ] }
Then items_getter would be defined as
lambda x: x["conversations"]text_getter (Callable[[dict], str]) –
A function to retrieve the text from a given item returned by the items_getter. For example, if the items_getter returns a list of objects such as:
{"role":"customer", "text": "Hi, this is Adam"}Then the items_getter would be defined as
lambda x: x["text"]redact_func (Callable[[str], RedactionResponse]) – The function you use to make the Textual redaction call. This should be an invocation of the TextualNer.redact such as lambda x: ner.redact(x).
Generator metadata¶
- class tonic_textual.classes.generator_metadata.base_metadata.BaseMetadata(
- custom_generator: GeneratorType | None = None,
- generator_version: GeneratorVersion = GeneratorVersion.V1,
- swaps: Dict[str, str] | None = {},
- constant_value: str | None = None,
Base class for all generator metadata configurations.
Provides common parameters shared by all metadata types. You typically do not instantiate this class directly. Instead, use a specific metadata subclass such as
NameGeneratorMetadataorEmailGeneratorMetadata.- Parameters:
custom_generator (GeneratorType, optional) – The generator type. Set automatically by subclasses.
generator_version (GeneratorVersion) – The generator version to use. Default is
V1.swaps (dict of str to str, optional) – A dictionary of explicit replacement mappings. When a detected value matches a key in the dictionary, the corresponding value is used as the synthesized replacement instead of a generated one.
constant_value (str, optional) – A string value that will be used as the replacement, when not
Noneand there is no matching source value inswaps.
- class tonic_textual.classes.generator_metadata.base_date_time_generator_metadata.BaseDateTimeGeneratorMetadata(
- custom_generator: GeneratorType | None = None,
- generator_version: GeneratorVersion = GeneratorVersion.V1,
- scramble_unrecognized_dates: bool = True,
- swaps: Dict[str, str] | None = {},
- constant_value: str | None = None,
Base class for date and time related generator metadata.
Extends
BaseMetadatawith a common date/time parameter. You typically do not instantiate this class directly. Instead, useDateTimeGeneratorMetadataorPersonAgeGeneratorMetadata.- Parameters:
scramble_unrecognized_dates (bool) – When
True, dates that Textual cannot parse into a standard format are scrambled. WhenFalse, unrecognized dates are left unchanged. Default isTrue.
- class tonic_textual.classes.generator_metadata.name_generator_metadata.NameGeneratorMetadata(
- generator_version: GeneratorVersion = GeneratorVersion.V1,
- is_consistency_case_sensitive: bool = False,
- preserve_gender: bool = False,
- swaps: Dict[str, str] | None = {},
- constant_value: str | None = None,
Metadata configuration for name synthesis.
Controls how synthesized names are generated for entity types such as
NAME_GIVENandNAME_FAMILY.- Parameters:
is_consistency_case_sensitive (bool) – When
True, name consistency is case-sensitive. For example,"john"and"John"are treated as different names and might receive different replacements. Default isFalse.preserve_gender (bool) – When
True, the synthesized name preserves the gender of the original name. Male names are replaced with male names, and female names are replaced with female names. Default isFalse.
- class tonic_textual.classes.generator_metadata.email_generator_metadata.EmailGeneratorMetadata(
- generator_version: GeneratorVersion = GeneratorVersion.V1,
- preserve_domain: bool = False,
- swaps: Dict[str, str] | None = {},
- constant_value: str | None = None,
Metadata configuration for email address synthesis.
Controls how synthesized email addresses are generated for the
EMAIL_ADDRESSentity type.- Parameters:
preserve_domain (bool) – When
True, the domain portion of the email address is kept intact. For example,"john@example.com"might become"alan@example.com". Default isFalse.
- class tonic_textual.classes.generator_metadata.phone_number_generator_metadata.PhoneNumberGeneratorMetadata(
- generator_version: GeneratorVersion = GeneratorVersion.V1,
- use_us_phone_number_generator: bool = False,
- replace_invalid_numbers: bool = True,
- preserve_us_area_code: bool = False,
- swaps: Dict[str, str] | None = {},
- constant_value: str | None = None,
Metadata configuration for phone number synthesis.
Controls how synthesized telephone numbers are generated for the
PHONE_NUMBERentity type.- Parameters:
use_us_phone_number_generator (bool) – When
True, generated telephone numbers use a US phone number format. Default isFalse.replace_invalid_numbers (bool) – When
True, phone numbers that are detected but are not valid phone numbers are replaced with synthesized values. Default isTrue.preserve_us_area_code (bool) – When
Trueanduse_us_phone_number_generatoris alsoTrue, the area code of the original phone number is preserved in the synthesized value. Default isFalse.
- class tonic_textual.classes.generator_metadata.date_time_generator_metadata.DateTimeGeneratorMetadata(
- generator_version: GeneratorVersion = GeneratorVersion.V1,
- scramble_unrecognized_dates: bool = True,
- additional_date_formats: List[str] = [],
- apply_constant_shift_to_document: bool = False,
- use_clear_date_and_passthrough_or_group_year_generator: bool = False,
- metadata: TimestampShiftMetadata = None,
- swaps: Dict[str, str] | None = {},
- constant_value: str | None = None,
Metadata configuration for date and time synthesis.
Controls how synthesized date and time values are generated for the
DATE_TIMEentity type. Dates are shifted by a random number of days within a configurable range.- Parameters:
scramble_unrecognized_dates (bool) – When
True, dates that Textual cannot parse into a standard format are scrambled. Default isTrue.additional_date_formats (list of str) – A list of additional date format patterns that Textual should recognize. Use Python
strftime/strptimeformat codes. Default is an empty list.apply_constant_shift_to_document (bool) – When
True, all dates within the same document are shifted by the same random offset. This preserves relative time differences between dates. Default isFalse.use_clear_date_and_passthrough_or_group_year_generator (bool) – When
Truesets the date to January 1st and if the year is less than 90 years ago, passes through the year. Otherwise, sets the year to the current year - 90. WhenFalseit has no effect. Default isFalse.metadata (TimestampShiftMetadata) – Configuration for the date shift range. By default dates shift by -7 to +7 days.
- class tonic_textual.classes.generator_metadata.timestamp_shift_metadata.TimestampShiftMetadata(
- left_shift_in_days: int | None = -7,
- right_shift_in_days: int | None = 7,
- time_stamp_shift_in_days: int | None = None,
- swaps: Dict[str, str] | None = {},
- constant_value: str | None = None,
Configuration for the date shift range used by
DateTimeGeneratorMetadata.Defines the range of days by which dates can be shifted. The actual shift for each date is randomly chosen within the specified range.
- Parameters:
left_shift_in_days (int, optional) – The minimum (leftmost) shift in days. Use a negative value to shift dates into the past. Default is
-7.right_shift_in_days (int, optional) – The maximum (rightmost) shift in days. Use a positive value to shift dates into the future. Default is
7.time_stamp_shift_in_days (int, optional) – Deprecated. Use
left_shift_in_daysandright_shift_in_daysinstead.
- class tonic_textual.classes.generator_metadata.person_age_generator_metadata.PersonAgeGeneratorMetadata(
- generator_version: GeneratorVersion = GeneratorVersion.V1,
- scramble_unrecognized_dates: bool = True,
- metadata: AgeShiftMetadata = None,
- swaps: Dict[str, str] | None = {},
- constant_value: str | None = None,
- use_passthrough_or_group_age_generator: bool = False,
Metadata configuration for person age synthesis.
Controls how synthesized ages are generated for the
PERSON_AGEentity type. Ages are shifted by a configurable number of years.- Parameters:
scramble_unrecognized_dates (bool) – When
True, dates that Textual cannot parse into a standard format are scrambled. Default isTrue.metadata (AgeShiftMetadata) – Configuration for the age shift amount. By default, ages shift by 7 years.
use_passthrough_or_group_age_generator (bool) – When
Truepasses through ages 89 or under. Changes other ages to “90+” Default isFalse
- class tonic_textual.classes.generator_metadata.age_shift_metadata.AgeShiftMetadata(
- age_shift_in_years: int = 7,
Configuration for the age shift amount used by
PersonAgeGeneratorMetadata.Defines how many years to shift detected ages by.
- Parameters:
age_shift_in_years (int) – The number of years to shift the age. Default is
7.
- class tonic_textual.classes.generator_metadata.hipaa_address_generator_metadata.HipaaAddressGeneratorMetadata(
- generator_version: GeneratorVersion = GeneratorVersion.V1,
- use_non_hipaa_address_generator: bool = False,
- replace_truncated_zeros_in_zip_code: bool = True,
- realistic_synthetic_values: bool = True,
- swaps: Dict[str, str] | None = {},
- constant_value: str | None = None,
- use_three_digit_zips: bool = False,
- replace_foreign_zip_codes_with_zeros: bool = False,
Metadata configuration for HIPAA-compliant address synthesis.
Controls how synthesized addresses are generated for location entity types such as
LOCATION_ADDRESSandLOCATION_ZIP. By default, address synthesis follows HIPAA Safe Harbor de-identification rules.- Parameters:
use_non_hipaa_address_generator (bool) – When
True, uses a non-HIPAA-compliant address generator that may produce more realistic addresses, but does not guarantee HIPAA Safe Harbor compliance. Default isFalse.replace_truncated_zeros_in_zip_code (bool) – When
True, for ZIP codes that are truncated to three digits (per HIPAA Safe Harbor), the removed digits are replaced with zeros. Default isTrue.realistic_synthetic_values (bool) – When
True, generates realistic-looking synthetic address values. Default isTrue.use_three_digit_zips (bool) – When
Truezip codes are always truncated to three digits. Default isFalsereplace_foreign_zip_codes_with_zeros (bool) – When
Trueforeign zip codes become all zeros Default isFalse
- class tonic_textual.classes.generator_metadata.numeric_value_generator_metadata.NumericValueGeneratorMetadata(
- generator_version: GeneratorVersion = GeneratorVersion.V1,
- use_oracle_integer_pk_generator: bool = False,
- swaps: Dict[str, str] | None = {},
- constant_value: str | None = None,
Metadata configuration for numeric value synthesis.
Controls how synthesized numeric values are generated for the
NUMERIC_VALUEentity type.- Parameters:
use_oracle_integer_pk_generator (bool) – When
True, uses a generator designed for Oracle integer primary keys. Default isFalse.