Redact raw text

To redact sensitive information from a text string, pass the string to the redact method:

from tonic_textual.redact_api import TextualNer

textual = TextualNer()

raw_redaction = textual.redact("My name is John, and today I am demoing Textual, a software product created by Tonic")
print(raw_redaction.describe())

This produces the following output:

My name is [NAME_GIVEN_HI1h7], and [DATE_TIME_4hKfrH] I am demoing Textual, a software product created by [ORGANIZATION_P5XLAH]
{
    "start": 11,
    "end": 15,
    "new_start": 11,
    "new_end": 29,
    "label": "NAME_GIVEN",
    "text": "John",
    "score": 0.9,
    "language": "en",
    "new_text": "[NAME_GIVEN_HI1h7]"
}
{
    "start": 21,
    "end": 26,
    "new_start": 35,
    "new_end": 53,
    "label": "DATE_TIME",
    "text": "today",
    "score": 0.9,
    "language": "en",
    "new_text": "[DATE_TIME_4hKfrH]"
}
{
    "start": 79,
    "end": 84,
    "new_start": 106,
    "new_end": 127,
    "label": "ORGANIZATION",
    "text": "Tonic",
    "score": 0.9,
    "language": "en",
    "new_text": "[ORGANIZATION_P5XLAH]"
}

You can also record redact calls, so that you can view and analyze results in the Textual application. To learn more, read Recording API requests

Bulk redact raw text

In the same way that you use the redact method to redact strings, you can use the redact_bulk method to redact many strings at the same time.

Each string is redacted individually. Each string is fed into our model independently and cannot affect other strings.

To redact sensitive information from a list of text strings, pass the list to the redact_bulk method:

from tonic_textual.redact_api import TextualNer

textual = TextualNer()

raw_redaction = textual.redact_bulk(["Tonic was founded in 2018", "John Smith is a person"])
print(raw_redaction.describe())

This produces the following output:

[ORGANIZATION_5Ve7OH] was founded in [DATE_TIME_DnuC1]
{
    "start": 0,
    "end": 5,
    "new_start": 0,
    "new_end": 21,
    "label": "ORGANIZATION",
    "text": "Tonic",
    "score": 0.9,
    "language": "en",
    "new_text": "[ORGANIZATION_5Ve7OH]"
}
{
    "start": 21,
    "end": 25,
    "new_start": 37,
    "new_end": 54,
    "label": "DATE_TIME",
    "text": "2018",
    "score": 0.9,
    "language": "en",
    "new_text": "[DATE_TIME_DnuC1]"
}
[NAME_GIVEN_dySb5] [NAME_FAMILY_7w4Db3] is a person
{
    "start": 0,
    "end": 4,
    "new_start": 0,
    "new_end": 18,
    "label": "NAME_GIVEN",
    "text": "John",
    "score": 0.9,
    "language": "en",
    "new_text": "[NAME_GIVEN_dySb5]"
}
{
    "start": 5,
    "end": 10,
    "new_start": 19,
    "new_end": 39,
    "label": "NAME_FAMILY",
    "text": "Smith",
    "score": 0.9,
    "language": "en",
    "new_text": "[NAME_FAMILY_7w4Db3]"
}

Redact JSON data

To redact sensitive information from a JSON string or Python dict, pass the object to the redact_json method:

from tonic_textual.redact_api import TextualNer
import json

textual = TextualNer()

d=dict()
d['person']={'first':'John','last':'OReilly'}
d['address']={'city': 'Memphis', 'state':'TN', 'street': '847 Rocky Top', 'zip':1234}
d['description'] = 'John is a man that lives in Memphis.  He is 37 years old and is married to Cynthia'

json_redaction = textual.redact_json(d, {"LOCATION_ZIP":"Synthesis"})

print(json.dumps(json.loads(json_redaction.redacted_text), indent=2))

This produces the following output:

{
"person": {
    "first": "[NAME_GIVEN_WpFV4]",
    "last": "[NAME_FAMILY_orTxwj3I]"
},
"address": {
    "city": "[LOCATION_CITY_UtpIl2tL]",
    "state": "[LOCATION_STATE_n24]",
    "street": "[LOCATION_ADDRESS_KwZ3MdDLSrzNhwB]",
    "zip": 0
},
"description": "[NAME_GIVEN_WpFV4] is a man that lives in [LOCATION_CITY_UtpIl2tL].  He is [DATE_TIME_LLr6L3gpNcOcl3] and is married to [NAME_GIVEN_yWfthDa6]"
}

Redact XML data

To redact sensitive information from XML, pass the XML document string to the redact_xml method:

from tonic_textual.redact_api import TextualNer
import json

textual = TextualNer()

xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
<!-- This XML document contains sample PII with namespaces and attributes -->
<PersonInfo xmlns="http://www.example.com/default" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:contact="http://www.example.com/contact">
    <!-- Personal Information with an attribute containing PII -->
    <Name preferred="true" contact:userID="john.doe123">
        <FirstName>John</FirstName>
        <LastName>Doe</LastName>He was born in 1980.</Name>

    <contact:Details>
        <!-- Email stored in an attribute for demonstration -->
        <contact:Email address="[email protected]"/>
        <contact:Phone type="mobile" number="555-6789"/>
    </contact:Details>

    <!-- SSN stored as an attribute -->
    <SSN value="987-65-4321" xsi:nil="false"/>
    <data>his name was John Doe</data>
</PersonInfo>'''

xml_redaction = textual.redact_xml(xml_string)

The response includes entity level information, including the XPATH at which the sensitive entity is found. The start and end positions are relative to the beginning of thhe XPATH location where the entity is found.

Redact HTML data

To redact sensitive information from HTML, pass the HTML document string to the redact_html method:

from tonic_textual.redact_api import TextualNer
import json

textual = TextualNer()

html_content = """
<!DOCTYPE html>
<html>
    <head>
        <title>John Doe</title>
    </head>
    <body>
        <h1>John Doe</h1>
        <p>John Doe is a person who lives in New York City.</p>
        <p>John Doe's phone number is 555-555-5555.</p>
    </body>
</html>
"""

xml_redaction = textual.redact_html(html_content)

The response includes entity level information, including the XPATH at which the sensitive entity is found. The start and end positions are relative to the beginning of thhe XPATH location where the entity is found.

Choosing tokenization or synthesis raw text

You can choose whether to synthesize or tokenize a given entity. By default, all entities are tokenized.

To specify the entities to synthesize or tokenize, use the generator_config parameter. This works the same way for all of the redact functions.

The following example passes a string to the redact method, but sets some entities to Synthesis, which indicates to use realistic replacement values:

from tonic_textual.redact_api import TextualNer

textual = TextualNer()
generator_config = {"NAME_GIVEN":"Synthesis", "ORGANIZATION":"Synthesis"}
raw_synthesis = textual.redact(
    "My name is John, and today I am demoing Textual, a software product created by Tonic",
    generator_config=generator_config)
print(raw_synthesis.describe())

This produces the following output:

My name is Alfonzo, and today I am demoing Textual, a software product created by New Ignition Worldwide
{
    "start": 11,
    "end": 15,
    "new_start": 11,
    "new_end": 18,
    "label": "NAME_GIVEN",
    "text": "John",
    "score": 0.9,
    "language": "en",
    "new_text": "Alfonzo"
}
{
    "start": 79,
    "end": 84,
    "new_start": 82,
    "new_end": 104,
    "label": "ORGANIZATION",
    "text": "Tonic",
    "score": 0.9,
    "language": "en",
    "new_text": "New Ignition Worldwide"
}

Using LLM synthesis

The following example passes the string to the llm_synthesis method:

from tonic_textual.redact_api import TextualNer

textual = TextualNer()

raw_synthesis = textual.llm_synthesis("My name is John, and today I am demoing Textual, a software product created by Tonic")
print(raw_synthesis.describe())

This produces the following output:

My name is Matthew, and today I am demoing Textual, a software product created by Google.
{
    "start": 11,
    "end": 15,
    "label": "NAME_GIVEN",
    "text": "John",
    "score": 0.9
}
{
    "start": 79,
    "end": 84,
    "label": "ORGANIZATION",
    "text": "Tonic",
    "score": 0.9
}

Note that LLM Synthesis is non-deterministic — you will likely get different results each time you run it.

Recording API requests

When you use the redact method to redact text, you can optionally record these requests to view and analyze later in the Textual application. The redact method takes an optional record_options (RecordApiRequestOptions) argument. To record an API request:

from tonic_textual.redact_api import TextualNer
from tonic_textual.classes.record_api_request_options import RecordApiRequestOptions

ner = TextualNer()

ner.redact("My name is John Doe", record_options=RecordApiRequestOptions(
    record=True,
    retention_time_in_hours=1,
    tags=["my_first_request"])
)

The above code runs the redaction in the same way as any other redaction request, and then records the API request and its results. The request itself is automatically purged after 1 hour. You can view the results from the API Explorer page in Textual. The retention time is specified in hours and can be set to a value between 1 and 720.

Working with DataFrames

The redact function can be called as a user-defined function (UDF) on a DataFrame column. As an example, lets read a CSV file redact a given column, and write the CSV back to disk. Make sure to first install pandas.

pip install pandas
from tonic_textual.redact_api import TextualNer
import pandas as pd

ner = TextualNer()

df = pd.read_csv('file.csv')

# Let's say there is a notes column in the CSV containing unstructured text
df['notes'] = df['notes'].apply(lambda x: ner.redact(x).redacted_text if pd.isnull(x) else None))

df.to_csv('file_redacted.csv')

Working with large data sets

For most use cases the redact and redact functions are sufficient. However, sometimes you need to process a lot of data quickly. Typically this means making multiple redact requests concurrently instead of sequentially.

As a first example, here is some sample code to process a large number of files through concurrent requests using asyncio. Make sure to first install asyncio.

pip install asyncio
from tonic_textual.redact_api import TextualNer
import asyncio

ner = TextualNer()

file_names = ['...'] # The list of files to be processed asynchronously


loop = asyncio.get_event_loop()
tasks = [loop.run_in_executor(None, ner.redact, open(file,'r').read()) for file in file_names]
loop.run_until_complete(asyncio.gather(*tasks))

results = [task.result() for task in tasks]

In another case, perhaps you are processing DataFrames but the frames themselves are quite large and you wish to redact rows in parallel. For this we can use Dask, a framework that sits on top of Pandas for concurrent execution. Make sure to first install dask[dataframe] and pandas.

pip install pandas
pip install dask[dataframe]
from tonic_textual.redact_api import TextualNer
import pandas as pd
import dask.dataframe as dd

# Load your DataFrame from disk, a live DB connection, etc.
df = get_dataframe()

npartitions=25 # Sets the number of requests to make concurrently.
df[col] = dd.from_pandas(df[col], npartitions=npartitions).apply(lambda x: redact(x) if not pd.isnull(x) else x, meta=pd.Series(dtype='str', name=col)).compute()