How to transliterate text input?

enthus1ast · November 17, 2022, 3:49pm

Is there a build in way to transliterate a text input?

So for example:
“Đikić” to “Dikic”
“Müller” to “Muller” or “Mueller”
?

Since i (we) selfhost i could install unidecode and do it like:

import unidecode
return unidecode.unidecode(value)

but i wonder if there is a built in way already.

jperon · November 17, 2022, 8:09pm

This one doesn’t give the expected result with “Đikić”, but works well with accented characters; perhaps it could serve as a start for elaboration:

import unicodedata
unicodedata.normalize('NFD', "Müller").encode('ascii', 'ignore').decode('ascii')

Riccardo_Polignieri · November 18, 2022, 10:00am

There is no meaningful way to do this in the general case (i.e., covering the entire Unicode range) - just think of emoji and all the other graphic symbols around.
The proposed solution (1) may work in some cases, but it is far from ideal. It leverages on the standard Unicode decomposition mapping, then hoping that a decomposition is provided and the base character happens to be ASCII (otherwise… 'ignore'): unfortunately, tons of “characters” have no such an easy path to ASCII - CJK, greek, arabic, cyrillic… but also many “reasonable”, latin-looking characters such as your “d-with-stroke” example.
For some reason that is beyond me, Unicode doesn’t acknowledge that “D” is the base character for “D-with-stroke”… in fact, no decomposition mapping is given for “Đ”, so you have no standard way to convert Đ → D with Unicode tools alone (ie, in Python, with the unicodedata module). But this is just one of many, many possible examples.

Now, you could use your own custom mapping of course… in fact, there are many libraries in many languages that do just that. In Pyhton, that would be this package. You could give it a try… the result may be more to your liking, depending on the data you are working on.

Of course, you will have to host Grist yourself in order to use an external package, afaik.

(1) btw, I would probably use the 'NFKD' normal form, instead of 'NFD'… it will behave marginally better with certain ligatures (try “ĳ” or “œ” for example)…