Unicode Strings in Ada 2012
Ada has a fun history of character support. They were reasonably quick to jump on the Unicode train, but despite that, there is almost no material on the web on how to deal with anything other than Latin-1 (iso-8859-1) characters in Ada. So here’s what I know:
General Ground Rules
These rules apply no matter which language you are using. In other words, they are not specific to Ada, and you should already know them, but they bear repeating.
Internally in your application, you should not have to care about encodings. At all.
In your application, you should have The String Type that represents text of any flavour, be it English, Navaho or Chinese. It’s really simple. What encoding does The String Type use? Who cares. Well, I care, of course, but it doesn’t matter for my code.
Ideally, The String Type should be an abstraction that supports various operations you might expect to perform on written text, such as
- concatenation (smushing two strings together)
- search (and replace),
- converting it to uppercase (trickier than you think!),
- truncating it after x characters, and
- splitting it up into lines.
I say “ideally” because they’re not strictly required. Only if you actually want to use them do you need The String Type to support them.
A word of caution: in many languages, The String Type is not actually the type
called String
. Examples where this confusion occurs include Python 2, Haskell
and Ada. In these languages, the type called String
is not The String Type.
So when do encodings matter? When you want text to exit your application. Maybe you write it to a file, or you send it over the internet, or you print it to the user. This is when encodings matter, because the receiver will expect a certain pattern of bits, so you need to put out the right pattern of bits.
To put text data out of your application, you take a value of The String Type, you specify an encoding (which is essentially something that tells you how to convert a text value to bits) and you write out the result of running the text value through the encoding.
To get text data inside your application, you do the opposite: take some data source, take an encoding, and decode the data into a value of The String Type.
This sound familiar? This is essentially how you deal with every data type ever.
A Python dict
is never “encoded” as long as it stays inside the Python
application. It’s just a dict
. Only when you want to put it out on the internet
do you encode it to e.g. JSON, which is a bit pattern representing a dict
, but
it is not a dict
itself.
Same thing with text: utf-8 data is a bit pattern representing text, but it is not text itself.
Ada Specifics
Update : The Library Path
I have been notified by a reader that if one is not constrained to using only the standard libraries, there is a very feature-rich implementation of The String Type for Ada in the Leage library, which is part of the Matreshka framework. Here’s an excerpt of their email:
Matreshka provides a tagged type
Universal_String
which is an abstraction of Unicode string. The type implements many handy string operations likeIndex
,Slice
,Split
to vector of strings, case conversions, etc. The library also providesText_Codecs
to encode/decode string using multiple encodings.Recently we have released new version of the library - 18.1. It’s open source (BSD license).
Reader email.
The type they refer to is described in their wiki page about Localisation, Internationalisation and Globalisation. They go on to explain in their email that
Having custom strings isn’t enough to be useful in application developing, so Matreshka provides advanced extension that use strings: RegExp, xml, json, sql access, calendar. Then we went further and added web development framework, some soap stuff and Ada modeling framework. Most of these are implemented as separate libraries.
Reader email.
I haven’t had a chance to use this – or audit the code – but from what I can tell, it looks a lot like things you’d expect from a modern standard library. I will try to use it when I feel like I can take the time to!
In Ada 2012 and Ada 2005
As I said, the history of character support in Ada has been a funny one. But! If we’re working in Ada 2005 or newer, we don’t need to worry about the history of it. To achieve modern character support in your Ada programs, you should know that
- Ada source files support full Unicode in both identifiers (variable names)
and string literals. However, you may need to tell your compiler which
encoding the source code files are in. To tell gnat your files are encoded
using utf-8, you pass the
-gnatW8
flag. You want to store characters in variables of type
Wide_Wide_Character
. This is part of the language so there is nothing special to import.Wide_Wide_Character
has full Unicode support and as such can store any character you might want.This comes with a caveat that applies to all the following types, though. Technically, it doesn’t actually store a character. It stores a Unicode code point, which may or may not be a character. This is to be expected, though, because it’s pretty much the only reasonable thing to do.
If you have an array of characters, the type for that is
Wide_Wide_String
. It is also part of the language, so no imports required. However, note that this is still a low-level fixed-size array, which means it cannot reliably support operations such as “convert to upper case”, which may change the length of the string. (It does support such operations, but their results may not necessarily be what you expect for some languages.)It also carries over the caveat from
Wide_Wide_Character
: a single index in this string may not actually be a character, it can be a combining mark that is meant to be used with the character that comes before or after.String literals in Ada are also automatically converted to this type, so if you can write
Hello : Wide_Wide_String := "你好,世界";
.If you want a dynamic string, you’ll have to import
Ada.Strings.Wide_Wide_Unbounded
, which has a typeUnbounded_Wide_Wide_String
which is the closest you’ll get to The String Type in standard Ada 2005 and Ada 2012.The
Ada.Strings.Wide_Wide_Unbounded
library is pretty much a copy of theAda.Strings.Unbounded
library, except it deals withWide_Wide_Character
s instead.While
Unbounded_Wide_Wide_String>
will store any Unicode character you throw at it, and it does support some basic string operations, it does not support all operations you may want it to. For example, converting a string to uppercase is done on a “codepoint-by-codepoint” basis, which is even more wrong than if it was done “character by character”. However, I can’t fault Ada for this because almost every language gets this wrong anyway. It is a hard problem.For input/output, the
Ada.Wide_Wide_Text_IO
looks pretty much likeAda.Text_IO
except it reads and writes to and from values of typeWide_Wide_String
. There is alsoAda.Wide_Wide_Text_IO.Wide_Wide_Unbounded_IO
which does input/output directly with unbounded strings.However, if you want other people to read or write the text data you’re outputting, you may want to specify an encoding to be used outside your application. Since input/output is sort of platform-dependent, how to do this is not strictly mandated by the Ada standard. The
Open
procedure call is required to take aForm
string which specifies platform-specific instructions. What the string looks like depends on platform, however,For writing utf-8 using gnat, you specify the string
"ENCODING=UTF8,WCEM=8"
.If you’re used to convert values to strings with the
Image
attribute, you might want to know that there is anWide_Wide_Image
attribute that does the same thing except it can handle unicode values.So that’s the deal in standard Ada 2005 and standard Ada 2012. It’s not as sunny in earlier versions, but I’ll quickly go through the important details anyway.
In Ada 95
For a long time, it was believed that Unicode could get by with 16 bits to represent the characters for all languages of the world. Originally, “Unicode” was defined as “16 bit characters”. History showed this was a bad idea, but it was believed to be true for long enough that many systems are stuck with 16 bit characters; both Java and Windows, for example, deal in 16 bit characters.
Aaand so does Ada 95. The types are Wide_Character
, Wide_String
and
Unbounded_Wide_String
. Read the advice for Ada 2012, and replace every
occurrence of “Wide_Wide
” with simply “Wide
” and you’ll be good. Except, of
course, that you’re limited to 16 bit code points, surrogate pairs and all that
comes with it.
Another difference you’ll find is that while Ada 95 allows 16 bit codepoints in strings, it does not in identifiers. So variable and function names and such are still limited to 8 bits of iso-8859-1 (“latin-1”).
…In Ada 83
Originally, Ada 83 only supported ascii, i.e. 7 bit codepoints. This is what you should expect if you’re using Ada 83. It’s also worth knowing that Ada 83 does not have the raii-style controlled types that were introduced in Ada 95, so you cannot have “unbounded” strings in Ada 83. Only fixed-size strings are available.
Oh, and many compilers sneakily switched over to an 8-bit encoding during the lifetime of Ada 83, so if you desperately need it, check if this is the case with yours. If so, you’re probably dealing with iso-8859-1, also known as “latin-1”.