Finding duplicate contact names with special characters

This post is build on the excellent work of @jperon on using unicodedata and @dmitry-grist on finding duplicates in grist tables.

Here’s a synthesis including some additional learning’s to ensure you can find duplicates which

  • include or
  • not include

characters with diacritical marks used in various Latin-based alphabets (excluding characters from non-Latin scripts like Chinese, Arabic, etc.). A list of some commonly used diacritics and the characters is provided at the end of this post.

In addition and prior to identifying duplicates the script cleans the texts from spaces, hyphens, points, underscores, and apostrophes and finally put all characters in upper case.

The process works in 2 steps.

Generate a “clean” unicode name

  1. Create a new column, for example, called unicode_name.
  2. copy below python code into your column and ensure that the variables Contact_name_last and Contact_name_first are connected with the corresponding columns in table.
import unicodedata

def clean_name(contact_name_last, contact_name_first):
    # Combine last and first names, normalize, encode, and decode to remove accents
    full_name = f"{contact_name_last} {contact_name_first}"
    clean_name = unicodedata.normalize('NFKD', full_name).encode('ascii', 'ignore').decode('ascii')

    # Remove unwanted characters in one go using `str.translate`
    unwanted_chars = dict.fromkeys(map(ord, " -._'"), None)
    clean_name = clean_name.translate(unwanted_chars)

    # Strip leading/trailing whitespace and convert to upper case
    clean_name = clean_name.strip().upper()
    
    return clean_name

# Usage
Contact_name_last = $Contact_name_last
Contact_name_first = $Contact_name_first
CleanName = clean_name(Contact_name_last, Contact_name_first)
return CleanName
  1. The column should show the last and first name all in unicode ascii and in upper case and without and seperation or special characters.

Find the duplicates

Here we largely follow @dmitry-grist recommendations

  1. Create a new column, for example, called DUP
  2. Copy below python code into your column. Ensure that ‘unicode_name’ should be replaced with the column name generated in step 1. Idem ditto replace the table name contacts with the table name you are working with.
"DUP" if len(contacts.lookupRecords(unicode_name=$uni_codename)) >1 else ""
  1. Filter for “DUP” in this column to identify all duplicates

Appendices

Accents and Diacritics

  1. Acute Accent (´)

    • Examples: á, é, í, ó, ú, ý
  2. Grave Accent (`)

    • Examples: à, è, ì, ò, ù
  3. Circumflex (^)

    • Examples: â, ê, î, ô, û
  4. Tilde (~)

    • Examples: ã, ñ, õ
  5. Umlaut or Diaeresis (¨)

    • Examples: ä, ë, ï, ö, ü, ÿ
  6. Cedilla (¸)

    • Examples: ç
  7. Caron or Háček (ˇ)

  • Examples: č, ď, ě, ň, ř, š, ž
  1. Ring (˚)
  • Examples: å
  1. Ogonek (˛)
  • Examples: ą, ę
  1. Breve (˘)

    • Examples: ă
  2. Macron (¯)

    • Examples: ā, ē, ī, ō, ū
  3. Dot Above (˙)

    • Examples: ż
  4. Double Acute (˝)

    • Examples: ő, ű
  5. Ligatures

    • Examples: æ (ae), œ (oe)

Additional Characters:

  • Certain characters with diacritics that are unique to specific languages, such as the Icelandic þ (thorn) or the Danish/Norwegian ø.
  • Letters that are considered separate characters in some languages, like the Spanish “ñ” or the French “ç”.

This list is not exhaustive, as there are many more characters with diacritics used in languages that utilize the Latin script. The usage of these characters varies by language, and some languages may use certain diacritics very infrequently or only in specific contexts.

If you need a comprehensive list, you might consider looking into Unicode tables for Latin characters with diacritics, as Unicode aims to encompass all the characters used in writing systems worldwide, including all special characters and diacritics used in Latin-based languages.

2 Likes

This is a great reference, thank you!

For anyone else looking at unifying text that may have accents or diacritics, the clean_name function in the post does a great job of stripping all the accents and diacritics listed in the appendices, and converting additional characters to nearby Latin ones (like çc and æae). For characters without a Latin equivalent (e.g. þ, ø, or other alphabets like Cyrillic), it will strip them out entirely.

This is hard to do well, thanks for sharing @Peter_P_Breithaupt !

1 Like