Viewing the PII information for a dataset

You can also retrieve a list of entities found in the files of a dataset. You can retrieve all entities found or just specific entity types. The below will retrieve information on ALL entities.

ds = ner.get_dataset('<dataset name>')

#retrieve all processed files in the dataset
files = ds.get_processed_files(refetch=True)

for file in files:
    entities = file.get_entities()

It will return a response a dictionary whose key is the type of PII and whose value is a list of found entities. The returned entity includes the original text value of the entity as well as the few words preceding and following the entity, e.g.

{
    "NAME_GIVEN": [
        {
            "head": "Last Name: First Name: Smith ",
            "tail": " Date of Birth: Phone Number: 07/16/1988 (929) 555",
            "entity": "John"
        },
        {
            "head": "The value for key 'Beneficiary First Name:' is: ",
            "tail": "",
            "entity": "John"
        }
    ],
    "NAME_FAMILY": [
        {
            "head": "Last Name: First Name: ",
            "tail": " John Date of Birth: Phone Number: 07/16/1988 (929",
            "entity": "Smith"
        },
        {
            "head": "le cell with no column or row header: First Name: ",
            "tail": "",
            "entity": "Richardson"
        },
        {
            "head": "ber Information Last Name: First Name: Richardson ",
            "tail": " Prescriber NPI Number: Prescriber Specialty: 2982",
            "entity": "Scott"
        },
        {
            "head": "Beneficiary First Name: ",
            "tail": "",
            "entity": "Smith"
        }
    ]
}

The call to get_entities() can also take an optional list of entities. For example, you could pass in a hard coded list as:

file.get_entities(['NAME_GIVEN','NAME_FAMILY'])

Or do the same using the PiiType enum

from tonic_textual.enums.pii_type import PiiType
file.get_entities([PiiType.NAME_GIVEN, PiiType.NAME_FAMILY])

Or you could even just pass in the current set of entities enabled by the dataset configuration, e.g.

from tonic_textual.enums.pii_state import PiiState

#Get list of all enabled entities for the dataset
entities = [k for k in ds.generator_config.keys() if ds.generator_config[k]!=PiiState.Off]
entities = file.get_entities(entities)

file.get_entities(entities)

Viewing redaction and synthesis mappings for a dataset

You can retrieve the original, redacted, synthetic, and final output values for entities in a dataset after the current generator configuration is applied. The response is grouped by file.

ds = ner.get_dataset('<dataset name>')
mappings = ds.get_entity_mappings()

for file in mappings.files:
    for entity in file.entities:
        print(file.file_name, entity.text, entity.output_text)