Removing Diacritics Marks from Strings

Many Latin alphabets (like my native Czech) contain characters with diacritical marks (or can be called accent marks). For some application in computers (like searching, cross systems compatible file names etc.) we would like to remove diacritics and translate  to string containing just ASCII  characters.   Common approach for this is to use UNICODE character decomposition.

It utilizes fact, that unicode has two ways how to represent characters with diacritics – for instance character á (LATIN SMALL LETTER A WITH ACUTE) normaly has code 225,  but this character can be decomposed into two unicode characters   code 97 (LATIN SMALL LETTER A) and character code 769 (COMBINING ACUTE ACCENT).   This process will work for majority of common ‘special’ Latin characters, however there are still few left, for which unicode does not have decomposition defined – these include characters like ø  (LATIN SMALL LETTER O WITH STROKE)  used in Norwegian language or ł (LATIN SMALL LETTER L WITH STROKE) used in Polish language.  A special handling is needed for these characters – basically a transcription table to map these into some basic Latin characters (it could be 1 to many mapping – for instance æ (LATIN SMALL LETTER AE) should map to ‘ae’).

Characters decomposition is defined in unicode standard and all common computer languages contains libraries which contain unicode definitions and can decompose characters.  Below Ishow how this can be done in python.

Python contains module unicodedata, which holds all key information from unicode standard. This module can be used to get metadata about individual characters – like name, category,   to decompose a character or to normalize whole unicode string.

For our solution we first need to define mapping (transcription) of non-decomposable characters to basic Latin characters (ASCII characters with code <128):

This map contains only most common characters – there is much more of similar characters in the unicode standard. The key quetion is how to map these characters – I prefer mapping based on look of the character (glyphs), which I think is more natural for people, who do not know that language (rather then more official transcriptions, which are base more on actual pronunciation of the character  (ø -> oe, but people outside Norway will probably never figure it out)).

With this mapping we can write this function to remove diacritics:

The key trick in this function is to normalize string to NFKD form (line  11), when all characters that have some decomposition are decomposed. Then we iterate over normalized string and leave out all characters with category Mn – ‘Mark, Nonspacing’, that is category, where all combining diacritical marks resides (line 14).   Then for non-composed characters we check if we have special mapping and  replace all remaining non-ascii characters with space.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">