Tag Archives: unicode

Removing Diacritics Marks from Strings

Many Latin alphabets (like my native Czech) contain characters with diacritical marks (or can be called accent marks). For some application in computers (like searching, cross systems compatible file names etc.) we would like to remove diacritics and translate  to string containing just ASCII  characters.   Common approach for this is to use UNICODE character decomposition.

It utilizes fact, that unicode has two ways how to represent characters with diacritics – for instance character á (LATIN SMALL LETTER A WITH ACUTE) normaly has code 225,  but this character can be decomposed into two unicode characters   code 97 (LATIN SMALL LETTER A) and character code 769 (COMBINING ACUTE ACCENT).   This process will work for majority of common ‘special’ Latin characters, however there are still few left, for which unicode does not have decomposition defined – these include characters like ø  (LATIN SMALL LETTER O WITH STROKE)  used in Norwegian language or ł (LATIN SMALL LETTER L WITH STROKE) used in Polish language.  A special handling is needed for these characters – basically a transcription table to map these into some basic Latin characters (it could be 1 to many mapping – for instance æ (LATIN SMALL LETTER AE) should map to ‘ae’).

Characters decomposition is defined in unicode standard and all common computer languages contains libraries which contain unicode definitions and can decompose characters.  Below Ishow how this can be done in python. Continue reading Removing Diacritics Marks from Strings