Customizing synthesis with generator metadata¶
When you use generator_config to set an entity type to Synthesis, Textual uses default synthesis settings. The generator_metadata parameter allows you to fine-tune how each entity type’s synthesizer behaves.
generator_metadata is a dictionary that maps entity type names (such as "NAME_GIVEN" or "EMAIL_ADDRESS") to metadata instances that control synthesis behavior for that type.
from tonic_textual.redact_api import TextualNer
from tonic_textual.classes.generator_metadata.name_generator_metadata import NameGeneratorMetadata
from tonic_textual.classes.generator_metadata.email_generator_metadata import EmailGeneratorMetadata
textual = TextualNer()
generator_metadata = {
"NAME_GIVEN": NameGeneratorMetadata(preserve_gender=True),
"NAME_FAMILY": NameGeneratorMetadata(is_consistency_case_sensitive=True),
"EMAIL_ADDRESS": EmailGeneratorMetadata(preserve_domain=True),
}
result = textual.redact(
"Contact John Smith at john.smith@example.com",
generator_default="Synthesis",
generator_metadata=generator_metadata,
)
Note
The redact_structured method takes a single Optional[BaseMetadata] instead of a dictionary, because it operates on a single entity type at a time.
Common parameters¶
All metadata classes inherit from BaseMetadata and share the following parameter:
swaps(dict of str to str, default{}) – A dictionary of explicit replacement mappings. When a detected value matches a key, the corresponding value is used as the synthesized replacement instead of a generated one.
from tonic_textual.classes.generator_metadata.name_generator_metadata import NameGeneratorMetadata
# Always replace "Acme" with "Globex" instead of generating a random name
metadata = NameGeneratorMetadata(swaps={"Acme": "Globex"})
Name synthesis¶
NameGeneratorMetadata controls how synthesized names are generated. Use it with the NAME_GIVEN and NAME_FAMILY entity types.
is_consistency_case_sensitive(bool, defaultFalse) – WhenTrue, name consistency is case-sensitive."john"and"John"are treated as different names and might receive different replacements.preserve_gender(bool, defaultFalse) – WhenTrue, the synthesized name preserves the gender of the original. Male names are replaced with male names, and female names with female names.
from tonic_textual.classes.generator_metadata.name_generator_metadata import NameGeneratorMetadata
generator_metadata = {
"NAME_GIVEN": NameGeneratorMetadata(preserve_gender=True),
}
result = textual.redact(
"John told Mary about the project.",
generator_default="Synthesis",
generator_metadata=generator_metadata,
)
Email synthesis¶
EmailGeneratorMetadata controls how synthesized email addresses are generated. Use it with the EMAIL_ADDRESS entity type.
preserve_domain(bool, defaultFalse) – WhenTrue, the domain portion of the email address is preserved. For example,"john@example.com"might become"alan@example.com".
from tonic_textual.classes.generator_metadata.email_generator_metadata import EmailGeneratorMetadata
generator_metadata = {
"EMAIL_ADDRESS": EmailGeneratorMetadata(preserve_domain=True),
}
result = textual.redact(
"Reach me at john@example.com",
generator_default="Synthesis",
generator_metadata=generator_metadata,
)
Phone number synthesis¶
PhoneNumberGeneratorMetadata controls how synthesized telephone numbers are generated. Use it with the PHONE_NUMBER entity type.
use_us_phone_number_generator(bool, defaultFalse) – WhenTrue, generated telephone numbers use a US phone number format.replace_invalid_numbers(bool, defaultTrue) – WhenTrue, detected telephone numbers that are not valid are still replaced with synthesized values.
from tonic_textual.classes.generator_metadata.phone_number_generator_metadata import PhoneNumberGeneratorMetadata
generator_metadata = {
"PHONE_NUMBER": PhoneNumberGeneratorMetadata(
use_us_phone_number_generator=True,
replace_invalid_numbers=True,
),
}
result = textual.redact(
"Call me at 555-0123.",
generator_default="Synthesis",
generator_metadata=generator_metadata,
)
Date and time synthesis¶
DateTimeGeneratorMetadata controls how synthesized dates and times are generated. Use it with the DATE_TIME entity type. Dates are shifted by a random number of days within a configurable range.
scramble_unrecognized_dates(bool, defaultTrue) – WhenTrue, dates that Textual cannot parse into a standard format are scrambled.additional_date_formats(list of str, default[]) – Additional date format patterns that Textual should recognize. Uses Pythonstrftime/strptimeformat codes.apply_constant_shift_to_document(bool, defaultFalse) – WhenTrue, all dates within the same document are shifted by the same random offset. This preserves the relative time differences between dates.metadata(TimestampShiftMetadata) – Controls the date shift range. By default, dates shift by -7 to +7 days.
TimestampShiftMetadata¶
TimestampShiftMetadata configures the range of days by which dates can be shifted.
left_shift_in_days(int, default-7) – The minimum shift in days. Use a negative value to shift dates into the past.right_shift_in_days(int, default7) – The maximum shift in days. Use a positive value to shift dates into the future.
from tonic_textual.classes.generator_metadata.date_time_generator_metadata import DateTimeGeneratorMetadata
from tonic_textual.classes.generator_metadata.timestamp_shift_metadata import TimestampShiftMetadata
generator_metadata = {
"DATE_TIME": DateTimeGeneratorMetadata(
apply_constant_shift_to_document=True,
metadata=TimestampShiftMetadata(
left_shift_in_days=-30,
right_shift_in_days=30,
),
),
}
result = textual.redact(
"The meeting is on 2024-01-15 and the deadline is 2024-02-01.",
generator_default="Synthesis",
generator_metadata=generator_metadata,
)
Person age synthesis¶
PersonAgeGeneratorMetadata controls how synthesized ages are generated. Use it with the PERSON_AGE entity type.
scramble_unrecognized_dates(bool, defaultTrue) – WhenTrue, dates that Textual cannot parse are scrambled.metadata(AgeShiftMetadata) – Controls the age shift amount. By default, ages shift by 7 years.
AgeShiftMetadata¶
AgeShiftMetadata configures the number of years to shift detected ages.
age_shift_in_years(int, default7) – The number of years to shift the age.
from tonic_textual.classes.generator_metadata.person_age_generator_metadata import PersonAgeGeneratorMetadata
from tonic_textual.classes.generator_metadata.age_shift_metadata import AgeShiftMetadata
generator_metadata = {
"PERSON_AGE": PersonAgeGeneratorMetadata(
metadata=AgeShiftMetadata(age_shift_in_years=3),
),
}
result = textual.redact(
"The patient is 45 years old.",
generator_default="Synthesis",
generator_metadata=generator_metadata,
)
Address synthesis (HIPAA)¶
HipaaAddressGeneratorMetadata controls how synthesized addresses are generated for location entity types such as LOCATION_ADDRESS and LOCATION_ZIP. By default, address synthesis follows HIPAA Safe Harbor de-identification rules.
use_non_hipaa_address_generator(bool, defaultFalse) – WhenTrue, uses a non-HIPAA-compliant address generator that might produce more realistic addresses, but does not guarantee HIPAA Safe Harbor compliance.replace_truncated_zeros_in_zip_code(bool, defaultTrue) – WhenTrue, for ZIP codes that are truncated to three digits (per HIPAA Safe Harbor), the removed digits are replaced with zeros.realistic_synthetic_values(bool, defaultTrue) – WhenTrue, generates realistic-looking synthetic address values.
from tonic_textual.classes.generator_metadata.hipaa_address_generator_metadata import HipaaAddressGeneratorMetadata
generator_metadata = {
"LOCATION_ADDRESS": HipaaAddressGeneratorMetadata(
realistic_synthetic_values=True,
replace_truncated_zeros_in_zip_code=True,
),
}
result = textual.redact(
"She lives at 123 Main St, Springfield, IL 62704.",
generator_default="Synthesis",
generator_metadata=generator_metadata,
)
Numeric value synthesis¶
NumericValueGeneratorMetadata controls how synthesized numeric values are generated. Use it with the NUMERIC_VALUE entity type.
use_oracle_integer_pk_generator(bool, defaultFalse) – WhenTrue, uses a generator designed for Oracle integer primary keys.
from tonic_textual.classes.generator_metadata.numeric_value_generator_metadata import NumericValueGeneratorMetadata
generator_metadata = {
"NUMERIC_VALUE": NumericValueGeneratorMetadata(
use_oracle_integer_pk_generator=True,
),
}
Combining multiple metadata configurations¶
You can combine multiple metadata configurations in a single call. This example configures synthesis for names, emails, and dates:
from tonic_textual.redact_api import TextualNer
from tonic_textual.classes.generator_metadata.name_generator_metadata import NameGeneratorMetadata
from tonic_textual.classes.generator_metadata.email_generator_metadata import EmailGeneratorMetadata
from tonic_textual.classes.generator_metadata.date_time_generator_metadata import DateTimeGeneratorMetadata
from tonic_textual.classes.generator_metadata.timestamp_shift_metadata import TimestampShiftMetadata
textual = TextualNer()
result = textual.redact(
"John Smith (john@acme.com) joined on 2024-01-15.",
generator_default="Off",
generator_config={
"NAME_GIVEN": "Synthesis",
"NAME_FAMILY": "Synthesis",
"EMAIL_ADDRESS": "Synthesis",
"DATE_TIME": "Synthesis",
},
generator_metadata={
"NAME_GIVEN": NameGeneratorMetadata(preserve_gender=True),
"EMAIL_ADDRESS": EmailGeneratorMetadata(preserve_domain=True),
"DATE_TIME": DateTimeGeneratorMetadata(
apply_constant_shift_to_document=True,
metadata=TimestampShiftMetadata(
left_shift_in_days=-14,
right_shift_in_days=14,
),
),
},
)