Using the ICU4C Transliterator

Recently, while developing a small application, I needed to strip all accents from strings, regardless of the strings’ encoding. I found multiple hints on the web, but none that were really useful (such as performing a search-and-replace on all possible accentuated characters), until I found an obscure post somewhere that suggested the use of ICU (International Components for Unicode) to perform this task.

It seemed that the ICU “Transliterator” could do this automatically, with the proper transliteration rule. However, the use of ICU4C is not particularly well documented on the web (at least, not the use of the Transliterator C++ class). After spending a few hours trying to get it to work, it finally worked. So here are my findings.

First of all, I am developing on Mac OS X, so the ICU library is already installed by default (version 3.8.1 in my case). So, of course, you need to include the proper headers. The headers, in Mac OS X, are located in /usr/local/include/ (add the -I/usr/local/include/ switch to GCC). For the Transliterator class, only one header file needs to be included:

#include <unicode/translit.h>

The rule that was described on the website I found was “NFD; [:M:] Remove; NFC;“, which, I believe (correct me if I’m wrong), means: decompose string, remove nonspacing characters, recompose string. All accents are considered nonspacing, so this is perfect to strip accents.

At first, I thought I had to create a rule-based Transliterator (with the createFromRules function), and set this string as its rule. Wrong! This “rule” is actually the Transliterator’s ID. I was unable to initialize the Transliterator until I actually set this ID instead of the dummy ID I was using. In other words, it is not necessary to create a rule-based Transliterator to perform this task (in fact, it won’t work); one only has to set this string as the ID, which is much easier than creating a rule-based Transliterator anyway.

It is then possible to declare a pointer to a Transliterator object, and initialize it:

UParseError          parseError      = { 0 };
UErrorCode           lStatus         = U_ZERO_ERROR;
icu::Transliterator* pTransliterator = 0;

pTransliterator = icu::Transliterator::createInstance(
     "NFD; [:M:] Remove; NFC;"
   , UTRANS_FORWARD
   , parseError
   , lStatus
   );

if((pTransliterator == 0) || U_FAILURE(lStatus))
{
   /* error handling */
}
else
{
   icu::Transliterator::registerInstance(pTransliterator);
   /* success, we can use the pTransliterator object */
}

The registerInstance trick is particularly neat: it means that ICU will take care of the Transliterator cleanup automatically for you when the application closes (you don’t have to free the Transliterator’s memory).

Once the object is created, and there was no error, you can start transliterating. However, the Transliterator only operates on UnicodeString objects. This can be overcome easily if you want to work with char arrays:

UnicodeString      str;
UErrorCode         lStatus            = U_ZERO_ERROR;
const unsigned int kBufferSize        = 512;
char               szBuf[kBufferSize] = {0};

/* ... */

str.setTo(stringToConvert);
pTransliterator->transliterate(str);
str.extract(szBuf, kBufferSize, NULL, lStatus);

if(U_FAILURE(lStatus))
{
   /* error handling */
}
else
{
   /* string successfully transliterated and extracted */
}

Once this is done, the rule/filter was successfully applied to the string, and works like a charm based on my testing. It is probably suboptimal, but was more than enough for my needs (and who needs high performance when you’re actually trying to strip accents from a string?).

Of course, it’s important to link against the ICU library to use its functions. I’m using GCC, so I added these arguments to the command line:

g++ [...] -licudata -licui18n -licuio -liculx -licuuc [...]

Some of these were probably not necessary but I didn’t want to figure out which 🙂 It links and runs, so this is perfect for my small tool.

So I guess this is it! Hopefully it will be of use to some who were lost trying to figure out how to use the ICU Transliterator (or other ICU classes, for that matter). Let me know if you have questions. Your best bet is to have a look at the ICU4C API reference, which can be quite useful.

Tags: C++, GCC, ICU, ICU4C, Strip Accents, Transliterator

This entry was posted on Thursday, August 28th, 2008 at 17:04 and is filed under C++, Development. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

You must be logged in to post a comment.

Mark C

Using the ICU4C Transliterator

Leave a Reply