This content originally appeared on DEV Community and was authored by David Cantrell
Credits
I originally wrote this at work, after my team spent far too many days yelling at the computer because of Mojibake. Thanks to my employer for allowing me to publish it, and the several colleagues who provided helpful feedback. Any errors are, naturally, not their fault.
Table of Contents
- 12:45. Restate my assumptions
- The Royal Road
- The Encode module
- Debugging
- The many ways of writing a character
12:45. Re-state my assumptions
We will normally want to read and write UTF-8 encoded data. Therefore you should make sure that your terminal can handle it. While we will occasionally have to deal with other encodings, and will often want to look at the byte sequences that we are reading and writing and not just the characters they represent, your life will still be much easier if you have a UTF-8 capable terminal. You can test your terminal thus:
$ perl -E 'binmode(STDOUT, ":encoding(UTF-8)"); say "\N{GREEK SMALL LETTER LAMDA}"'
That should print λ
, a letter that looks a bit like a lower-case y
mirrored through the horizontal axis.
And if you pipe the output from that into hexdump -C
you should see the byte sequence 0xce 0xbb 0x0a
.
The Royal Road
Ideally, your code will only have to care about any of this at the edges - that is, where data enters and leaves the application. That could be when reading or writing a file, sending/receiving data across the network, making system calls, or talking to a database. And in many of these cases - especially talking to a database - you will be using a library which already handles everything for you. In a brand new code-base which doesn’t have to deal with any legacy baggage you should, in theory, only have to read this first section of this document.
Alas, most real programming is a habitation of devils, who will beset you from all around and make you have to care about the rest of it.
Characters, representations, and strings
Perl can work with strings containing any character in Unicode. Characters are written in source code either as a literal character such as "m" or in several other ways. These are all equivalent:
"m"
chr(0x6d) # or chr(109), of course
"\x{6d}"
"\N{U+6d}"
"\N{LATIN SMALL LETTER M}"
As are these:
chr(0x3bb)
"\x{3bb}"
"\N{U+3bb}"
"\N{GREEK SMALL LETTER LAMDA}"
Non-ASCII characters can also appear as literals in your code, for example "λ"
, but this is not recommended - see the discussion of the utf8
pragma below. You can also use octal - "\154"
- but this too is not recommended as hexadecimal encodings are marginally more familiar and easier to read.
Internally, characters have a representation, a sequence of bytes that is unique for a particular combination of character and encoding. Most modern languages default to using UTF-8 for that representation, but perl is old enough to pre-date UTF-8 - and indeed to pre-date any concern for most character sets. For backward-compatibility reasons, and for compatibility with the many C libraries for which perl bindings exist, it was decided when perl sprouted its Unicode tentacle that the default representation should be ISO-Latin-1. This is a single-byte character set that covers most characters used in most modern Western European languages, and is a strict superset of ASCII.
Any string consisting solely of characters in ISO-Latin-1 will by default be represented internally in ISO-Latin-1. Consider these strings:
Release the raccoon! - consists solely of ASCII characters. ASCII is a subset of ISO-Latin-1, so the string’s internal representation is an ISO-Latin-1-encoded string of bytes.
Libérez le raton laveur! - consists solely of characters that exist in ISO-Latin-1, so the string’s internal representation is an ISO-Latin-1-encoded string of bytes. The "é" character has code point 0xe9 and is represented as the byte 0xe9 internally.
Rhyddhewch y racŵn! - the "ŵ" does not exist in ISO-Latin-1. But it does exist in Unicode, with code point 0x175. As soon as perl sees a non-ISO-Latin-1 character in a string, it switches to using something UTF-8-ish, so code point 0x175 is represented by byte sequence 0xc5 0xb5. Note that while valid characters’ internal representations are valid UTF-8 byte sequences, this can also encode invalid characters.
Libérez le raton laveur! Rhyddhewch y racŵn!
- this contains both an "é" (which is in ISO-Latin-1) and a "ŵ" (which is not), so the whole string is UTF-8 encoded. The "ŵ" is as before encoded as byte sequence 0xc5 0xb5, but the "é" must also be UTF-8 encoded instead of ISO-Latin-1-encoded, so becomes byte sequence 0xc3 0xa9.
But notice that ISO-Latin-1 not only contains ASCII, and characters like "é" (at code point 0xe9, remember), it also contains characters "Ã" (capital A with a tilde, code point 0xc3) and "©" (copyright symbol, code point 0xa9). So how do we tell the difference between the ISO-Latin-1 byte sequence 0xc3 0xa9 representing "é" and the UTF-8 byte sequence 0xc3 0xa9 representing "é"? Remember that a representation is "a sequence of bytes that is unique for a particular combination of character and encoding". So perl stores the encoding as well as the byte sequence. It is stored as a single bit flag. If the flag is unset then the sequence is ISO-Latin-1, if it is set then it is UTF-8.
Source code encoding, the utf8 pragma, and why you shouldn’t use it
It is possible to put non-ASCII characters into your source code. For example, consider this file:
my $string = "é";
print "$string contains ".length($string)." characters\n";
from which some problems arise. First, if the file is encoded in UTF-8, how can perl tell when it comes across the byte sequence 0xc3 0xa9 what encoding that is? Is it ISO-Latin-1? Well, it could be. Is it UTF-8? Again, it could be. In general, it isn’t possible to tell from a sequence of bytes what encoding is in use. For backward-compatibility reasons, perl assumes ISO-Latin-1.
If you save that file encoded in UTF-8, and have a UTF-8-savvy terminal, that code will output:
é contains 2 characters
which is quite clearly wrong. It interpreted the 0xc3 0xa9 as two characters, but then when it spat those two characters out your terminal treated them as one.
We can tell perl that the file contains UTF-8-encoded source code by adding a use utf8
. We also need to fix the output encoding - use utf8
doesn’t do that for you, it only asserts that the source file is UTF-8 encoded:
use utf8;
binmode(STDOUT, ":encoding(UTF-8)");
my $string = "é";
print "$string contains ".length($string)." character\n";
(For more on output encoding see the next section)
And now we get this:
é contains 1 character
Hurrah!
At this point a second problem arises. Some editors aren’t very clever about encodings and even if they correctly read a file that is encoded in UTF-8, they will save it in ISO-Latin-1. VSCode for example is known to do this at least some of the time. If that happens, you’re still asserting via use utf8
that the file is UTF-8, but the "é"
in the sample file will be encoded as byte 0xe9, and the following double-quote and semicolon as 0x22 0x3b. This results in a fatal error:
Malformed UTF-8 character: \xe9\x22\x3b (unexpected non-continuation byte 0x22,
immediately after start byte 0xe9; need 3 bytes, got 1) at ...
So given that you’re basically screwed if you have non-ASCII source code no matter whether you use utf8 or not, I recommend that you just don’t do it. If you need a non-ASCII character in your code, use any of the many other ways of specifying it, and if necessary put a comment nearby so that whoever next has to fiddle with the code knows what it is:
chr(0xe9); # e-acute
Input and output
Strings aren’t the only things that have encodings. File handles do too. Just like how perl defaults to assuming that your source code is encoded in ISO-Latin-1, it assumes unless told otherwise that file handles similarly are ISO-Latin-1, and so if you try to print "é" to a a handle, what actually gets written is the byte 0xe9.
Even if your source code has the use utf8
pragma, and your code contains the byte sequence 0xc3 0xa9, which will internally by decoded as the character "é", your handles are still ISO-Latin-1 and you'll get a single byte for that character. For how this happens see "PerlIO layers" below.
Things get a bit more interesting if you try to send a non-ISO-Latin-1 character to an ISO-Latin-1 handle. Perl does the best it can and sends the internal representation - which is UTF-8, remember - to the handle and emits a warning "Wide character in print". Pay attention to the warnings!
This behaviour is another common source of bugs. If you send the two strings "Libérez le raton laveur!" followed by "Rhyddhewch y racŵn!" to an ISO-Latin-1 handle, then the first one will sail through, correctly encoded, but the second will also go through. You’ve now got two different character encodings in your output stream and no matter what encoding is expected at the other end you’ll get mojibake.
PerlIO layers
We’ve seen how by default input and output is assumed to be in ISO-Latin-1. But that can be changed. Perl has supported different encodings for I/O since the dawn of time - since at least perl 3.016. That’s when it started to automatically convert "\n" into "\r\n" and vice versa on MSDOS, and the binmode()
function was introduced in case you wanted to open a file on DOS without any translation.
These days this is implemented via PerlIO layers, which allows you to open a file with all kinds of translation layers, including those which you write yourself or grab from the CPAN (see for example File::BOM). You can also add and remove layers from an already open handle.
In general these days, you always want to read/write UTF-8 or raw binary, so will open files something like this:
open(my $fh, ">:encoding(UTF-8)", "some.log")
open(my $fh, "<:raw", "image.jpg")
or to change the encoding of an already open handle:
binmode(STDOUT, ":encoding(UTF-8)")
(NB that encodings applied to bare-word file handles such as STDOUT have global effect!)
Provided that we don’t have to worry about Windows, we generally will only ever have one layer doing anything significant on a handle (on Windows the :crlf
layer is useful in addition to any others, to cope with Windows’s endearing backward-compatibility with CP/M), but it's possible to have more. In general, when a handle is opened for reading, encodings are applied to data in the order that they are specified in the open()
function call, from left to right. When writing, they are applied from right to left.
If you ever think you need more than one layer, or want a layer other than those in the examples above, see PerlIO.
The Encode module
The above explains the "royal road", where you are in complete control of how data gets into and out of your code. In that situation, you should never need to re-encode data, as it will always be Just A Bunch Of Characters whose underlying representation you don’t care about. That is, however, often not the case in the real world where we are beset by demons. We sometimes have to deal with libraries that do their own encoding/decoding and expect us to supply them with a byte stream (XML::LibXML, for example), or which have had incorrect or partial bug fixes applied for any of the problems mentioned above and for which we can’t easily provide a proper fix because of other code now relying on the buggy behaviour (by for example having work-arounds to correct badly-encoded data).
Encode::encode
The Encode::encode()
function takes a string of characters and returns a string of bytes that represent that string in your desired encoding. For example:
my $string = "Libérez le raton laveur!";
encode("UTF-8", $string, Encode::FB_CROAK|Encode::LEAVE_SRC);
will return a string where the character "é" has been replaced by the two bytes 0xc3 0xa9. If the original string was encoded in UTF-8 then the underlying representation of the input and output strings will be the same, but their encodings will be different, and the output will be reported as being one character longer by the length()
function.
Encode::encode
can sometimes for Complicated Internals Optimisation Reasons modify its input. To avoid this set the Encode::LEAVE_SRC
bit in its third argument.
If you are encoding to anything other than UTF-8 or your string may contain characters outside of Unicode then you should consider telling encode()
to be strict about characters that it can't encode, such as if you try to encode "ŵ" into a ISO-Latin-1 byte sequence. That's what the Encode::FB_CROAK
bit is about in the example - in real code the encode should be in a try
/catch
block to deal with the exception that may arise. Encode
's documentation has a whole section on handling malformed data.
Encode::decode
It is quite common for us to receive data, either from a network connection or from a library, which is a UTF-8-encoded byte stream. Naively treating this as ISO-Latin-1 characters will lead to doom and disaster, as the byte sequence 0xc3 0xa9 will, as already explained, be interpreted as the characters "Ã" and "©". Encode::decode()
takes a bunch of bytes and turns them into characters assuming that they are in a specified encoding. For example, this will return a "é" character:
decode("UTF-8", chr(0xc3).chr(0xa9), Encode::FB_CROAK)
You should consider how to handle a byte stream that turns out to not be valid in your desired encoding and again I recommend use of Encode::FB_CROAK
.
Encode:: everything else
The "Encode" module provides some other functions that, on the surface, look useful. They are, mostly, not.
Remember how waaaay back I briefly mentioned that perl’s internal representation for non-ISO-Latin-1 characters was UTF-8-ish and how they could contain invalid characters? That’s why you shouldn’t use encode_utf8
or decode_utf8
. You may be tempted to use Encode::is_utf8()
to check a string's encoding. Don't, for the same reason.
You will generally not be calling encode()
with a string literal as its input, but with a variable as its input. However, any errors like "Modification of a read-only value attempted" are your fault, you should have told it to Encode::LEAVE_SRC
.
Don't even think about using the _utf8_on
and _utf8_off
functions. They are only useful for deliberately breaking things at a lower level than you should care about.
Debugging
the UTF8 flag
The UTF8 flag is a reliable indicator that the underlying representation uses multiple bytes per non-ASCII character, but that’s about it. It is not a reliable indicator whether a string’s underlying representation is valid UTF-8 or that the string is valid Unicode.
The result of this:
Encode::encode("UTF-8", chr(0xe9), 8)
is a string whose underlying representation is valid UTF-8 but the flag is off.
This, on the other hand has the flag on but the underlying representation is not valid UTF-8 because the character is out of range:
chr(2097153)
This is an invalid character in Unicode, but perl encodes it (it has to encode it so it can store it) and turns the UTF8 flag on (so that it knows how the underlying representation is encoded):
chr(0xfff8)
And finally, this variable that someone else’s broken code might pass to you contains an invalid encoding of a valid character:
my $str = chr(0xf0).chr(0x82).chr(0x82).chr(0x1c);
Encode::_utf8_on($str);
Devel::Peek
This is a very useful module for looking at the internals of perl variables, in particular for looking at what perl thinks the characters are and what their underlying representation is. It exports a Dump()
function, which prints details about its argument’s internal structure to STDERR. For example:
$ perl -MDevel::Peek -E 'Dump(chr(0xe9))'
SV = PV(0x7fa98980b690) at 0x7fa98a00bf90
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,PROTECT,pPOK)
PV = 0x7fa989408170 "\351"\0
CUR = 1
LEN = 10
For the purposes of debugging character encoding issues, the two important things to look at are the lines beginning with FLAGS =
and PV =
. Note that there is no UTF8 flag set, indicating that the string uses the single-byte ISO-Latin-1 encoding. And the string’s underlying representation is shown (in octal, annoyingly), as "\351"
.
And here’s what it looks like when the string contains code points outside ISO-Latin-1, or has been decoded from a byte stream into UTF-8:
$ perl -MDevel::Peek -E 'Dump(chr(0x3bb))'
SV = PV(0x7ff37e80b090) at 0x7ff388012390
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,PROTECT,pPOK,UTF8)
PV = 0x7ff37f907350 "\316\273"\0 [UTF8 "\x{3bb}"]
CUR = 2
LEN = 10
Notice that the UTF8 flag has appeared, and that we are shown both the underlying representation as two octal bytes "\316\273"
and the characters (in hexadecimal if necessary - mmm, consistency) that those bytes represent.
hexdump
For debugging input and output I recommend the external hexdump
utility. Feed it a file and it will show you the bytes therein, avoiding any clever UTF-8-decoding that your terminal might do if you were to simply cat
the file:
$ cat greek
αβγ
$ hexdump -C greek
00000000 ce b1 ce b2 ce b3 0a |.......|
00000007
It can of course also read from STDIN.
PerlIO::get_layers
Once you’re sure that your code isn’t doing anything perverse, but your data is still getting screwed up on input/output you can see what encoding layers are in use on a handle with the PerlIO::get_layers
function. PerlIO
is a Special built-in namespace, you don’t need to use
it. Indeed, if you do try to use
it you will fail, as it doesn’t exist as a module. Layers are returned in an array, in the order that you would tell open()
about them.
Layers can apply to any handle, not just file handles. If you’re dealing with a socket then remember that they have both an input side and an output side which may have different layers - see the PerlIO manpage for details. And also see the doco if you care about the difference between :utf8
and :encoding(UTF-8)
- although if you diligently follow the sage advice in this document you won’t care, because you won’t use :utf8
.
The many ways of writing a character
There are numerous different ways of representing a character in your code.
String literals
"m"
For the reasons outlined above please only use this for ASCII characters.
The chr function
This function takes a number as its argument and returns the character with the corresponding codepoint. For example, chr(0x3bb)
returns λ
.
Octal
You can use up to three octal digits "\155"
for ISO-Latin-1 characters only but please don’t. It’s a less familiar encoding than hexadecimal so hex is marginally easier to read, and it also suffers from the “how long is this number” problem described below.
Hexadecimal
"\x{e9}"
You can put any number of hexadecimal digits between the braces. There is also a version of this which doesn’t use braces: "\xe9"
. It can only take one or two hexadecimal digits and so is only valid for ISO-Latin-1 characters. The lack of delimiters can lead to confusion and error. Consider "\xa9"
. Brace-less \x
can take one or two hex digits, so is that \xa
(a line-feed character) followed by the digit 9
, or is it \xa9
, the copyright symbol? Brace-less \x
is greedy, so if it looks like there are two hex digits it will assume that there are. Only if the first digit is followed by the end-of-string or by a non-hex-digit will it assume that you meant to use the single digit form. This means that \xap
, for example, is a single hex digit, so is equivalent to \x{0a}p
, a new line followed by the letter p
. I think you will agree that use of braces makes things much clearer, so the brace-less variant is deprecated.
By codepoint name
"\N{GREEK SMALL LETTER LAMDA}"
This may sometimes be preferable to providing the (hexa)decimal codepoint with an associated comment, but it gets awful wordy awful fast. By default the name must correspond exactly to that in the Unicode standard. Shorter aliases are available if you ask for them, via the charnames
pragma. The documentation only mentions this for the Greek and Cyrillic scripts, but they are available for all scripts which have letters. For example, these are equivalent:
"\x{5d0}"
\N{HEBREW LETTER ALEF}"
use charnames qw(hebrew);
"\N{ALEF}" # א
Be careful if you ask for character-set-specific aliases as there may be name clashes. Both Arabic and Hebrew have a letter called "alef", for example:
use charnames qw(arabic);
"\N{ALEF}" # ا
use charnames qw(arabic hebrew);
"\N{ALEF}" # Always Hebrew, no matter the order of the imports!
A happy medium ground is to ask for :short
aliases:
use charnames qw(:short);
"\N{ALEF}" # error
"\N{hebrew:alef} \N{arabic:alef}" # does what it says on the tin
Other hexadecimal
"\N{U+3bb}"
This notation looks a little bit more like the U-ish hexadecimal notations used in other languages while also being a bit like the \N{...}
notation for codepoint names. Unless you want to mix hexadecimal along with codepoint names you should probably not use this, and prefer \x{...}
which is more familiar to perl programmers.
In regular expressions
You can use any of the \x
and \N{...}
variants in regular expressions. You may also see \p
, \P
, and \X
as well. See perlunicode and perlrebackslash. You should consider use of the /a
modifier as that does things like force \d
to only match ASCII and not, say, ৪
which looks like 8
but is actually BENGALI DIGIT FOUR
.
ASCII-encoded JSON strings in your code
You may need to embed JSON strings in your code, especially in tests. I recommend that JSON should always be ASCII-encoded as this minimises the chances of it getting mangled anywhere. This introduces yet another annoying way of embedding a bunch of hex digits into text. This example:
use JSON;
to_json(chr(0x3c0), { ascii => 1 });
will produce the string "\u03c0"
. That’s the sequence of eight characters "
\
u
0
3
c
0
"
. The double quotes are how JSON says “this is a string”, and the two characters \
and u
are how JSON says “here comes a hexadecimal code point”. If you want to put ASCII-encoded JSON in your code then you need to be careful about quoting and escaping.
Perl will treat the character sequence \
u
as a real back-slash followed by the letter when it is single-quoted, but in general it is always good practice to escape a back-slash that you want to be a real back-slash, to avoid confusion to the reader who may not have been paying attention to whether you’re single- or double-quoting, or in case you later change the code to use double-quotes and interpolate some variable:
my $json = '"I like \\u03c0, especially Greek pie"';
# or double-quoted with interpolation
my $json = qq{"I like \\u03c0, especially $nationality pie"};
Accented character vs character + combining accent
For many characters there are two different valid ways of representing them. chr(0xe9)
is LATIN SMALL LETTER E WITH ACUTE
. The same character can be obtained with the two codepoints "e".chr(0x301)
- that is LATIN SMALL LETTER E
and COMBINING ACUTE ACCENT
.
Whether those should sort the same, compare the same, or one should be converted to t’other will vary depending on your application, so the best I can do is point you at Unicode::Normalize.
This content originally appeared on DEV Community and was authored by David Cantrell
David Cantrell | Sciencx (2022-01-31T18:43:59+00:00) A brief guide to perl character encoding. Retrieved from https://www.scien.cx/2022/01/31/a-brief-guide-to-perl-character-encoding/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.