Boost.Locale
|
Why do we need a localization library, when standard C++ facets (should) provide most of the required functionality:
std::ctype
facetstd::collate
and has nice integration with std::locale
std::num_put
, std::num_get
, std::money_put
, std::money_get
, std::time_put
and std::time_get
for numbers, time, and currency formatting and parsing.std::messages
class that supports localized message formatting.So why do we need such library if we have all the functionality within the standard library?
Almost every(!) facet has design flaws:
std::collate
supports only one level of collation, not allowing you to choose whether case- or accent-sensitive comparisons should be performed.std::ctype
, which is responsible for case conversion, assumes that all conversions can be done on a per-character basis. This is probably correct for many languages but it isn't correct in general. toupper
function works on a single-character basis.char
's or two wchar_t
's on the Windows platform. This makes std::ctype
totally useless with these encodings.std::numpunct
and std::moneypunct
do not specify the code points for digit representation at all, so they cannot format numbers with the digits used under Arabic locales. For example, the number "103" is expected to be displayed as "١٠٣" in the ar_EG
locale. std::numpunct
and std::moneypunct
assume that the thousands separator is a single character. This is untrue for the UTF-8 encoding where only Unicode 0-0x7F range can be represented as a single character. As a result, localized numbers can't be represented correctly under locales that use the Unicode "EN SPACE" character for the thousands separator, such as Russian. std::time_put
and std::time_get
have several flaws:std::tm
for time representation, ignoring the fact that in many countries dates may be displayed using different calendars.std::tm
doesn't even include a timezone field at all.std::time_get
is not symmetric with std::time_put
, so you cannot parse dates and times created with std::time_put
. (This issue is addressed in C++0x and some STL implementation like the Apache standard C++ library.)std::messages
does not provide support for plural forms, making it impossible to correctly localize such simple strings as "There are X files in the directory".Also, many features are not really supported by std::locale
at all: timezones (as mentioned above), text boundary analysis, number spelling, and many others. So it is clear that the standard C++ locales are problematic for real-world applications.
ICU is a very good localization library, but it has several serious flaws:
For example: Boost.Locale provides direct integration with iostream
allowing a more natural way of data formatting. For example:
ICU is one of the best localization/Unicode libraries available. It consists of about half a million lines of well-tested, production-proven source code that today provides state-of-the art localization tools.
Reimplementing of even a small part of ICU's abilities is an infeasible project which would require many man-years. So the question is not whether we need to reimplement the Unicode and localization algorithms from scratch, but "Do we need a good localization library in Boost?"
Thus Boost.Locale wraps ICU with a modern C++ interface, allowing future reimplementation of parts with better alternatives, but bringing localization support to Boost today and not in the not-so-near-if-at-all future.
Yes, the entire ICU API is hidden behind opaque pointers and users have no access to it. This is done for several reasons:
There are many available localization formats. The most popular so far are OASIS XLIFF, GNU gettext po/mo files, POSIX catalogs, Qt ts/tm files, Java properties, and Windows resources. However, the last three are useful only in their specific areas, and POSIX catalogs are too simple and limited, so there are only two reasonable options:
The first one generally seems like a more correct localization solution, but it requires XML parsing for loading documents, it is very complicated format, and even ICU requires preliminary compilation of it into ICU resource bundles.
On the other hand:
So, even though the GNU Gettext mo catalog format is not an officially approved file format:
There are several reasons:
ptime
– definitely could be used, but it has several problems: time()
gives a representation that is independent of time zones (usually GMT time), and only later should it be represented in a time zone that the user requests. ptime
already defines operator<<
and operator>>
for time formatting and parsing.ptime
formatting and parsing were not designed in a way that the user can override. The major formatting and parsing functions are not virtual. This makes it impossible to reimplement the formatting and parsing functions of ptime
unless the developers of the Boost.DateTime library decide to change them. ptime
are not "correctly" designed in terms of division of formatting information and locale information. Formatting information should be stored within std::ios_base
and information about locale-specific formatting should be stored in the facet itself. Thus, at this point, ptime
is not supported for formatting localized dates and times.
There are several reasons:
There are two reasons:
std::codecvt
API works on streams of any size without problems.There are several major reasons:
std::locale
class is build. Each feature is represented using a subclass of std::locale::facet
that provides an abstract API for specific operations it works on, see Introduction to C++ Standard Library localization support.There are several reasons:
char16_t
and char32_t
as distinct types, so substituting is with something like uint16_t
or uint32_t
would not work as for example writing uint16_t
to uint32_t
stream would write a number to stream.std::num_put
are installed into the existing instance of std::locale
, however in the many standard C++ libraries these facets are specialized for each specific character that the standard library supports, so an attempt to create a new facet would fail as it is not specialized.These are exactly the reasons why Boost.Locale fails with current limited C++0x characters support on GCC-4.5 (the second reason) and MSVC-2010 (the first reason)
So basically it is impossible to use non-C++ characters with the C++'s locales framework.
The best and the most portable solution is to use the C++'s char
type and UTF-8 encodings.