Charset Encoding Issues
Table of Contents
Unicode
- ideal output for web is utf8
- UTF-8 is a decent character set for mapping characters to integers
- data can be encoded into a variety of character sets
- Unicode represents all characters (in all languages) with integers.
- ASCII represents every character as a binary digit between 32 and 127
- in the ASCII character set, each binary value between 0 and 127 is given a specific character
- character data is stored one character per byte, but the ASCII system only required 7 bits
- of the 256 possible values in an 8 bit byte, there were 128 slots left free that people decided to play with, randomly assigning different values
- unicode is a single character set that represents every writing system known
- in unicode, each letter is represented by a code point, ie, a U for unicode and an hexadecimal number: U+FEC9
- 'Hello' corresponds to the following code points: U+0048 U+0065 U+006C U+006C U+006F
- encodings are used to store the code points in memory and in web documents
- at the beginning of some unicode strings is an indication of the Byte Order Mark (BOM), which specifies if the string is stored in high-endian or low-endian mode,
- UTF-8 is an encoding that stores unicode code points in 8 bit bytes
- English text looks exactly the same in UTF-8 as it does in ASCII, only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes
- 2 ways of encoding unicode:
- The traditional store-it-in-two-byte methods are called UCS-2
- And there's the popular new UTF-8 standard
- It does not make sense to have a string without knowing what encoding it uses
Character Encoding Negotiation
- how the browser decides which encoding to use
Forms Data Set Encoding
- can be encoded in application/x-www-form-urlencoded
- in this mode, changing charsets changes encoding
- modern browsers send x-www-form-urlencoded data to the server in the CHARSET that was determined to be that of the *form*, however that determination was made
- drawbacks
- document charset must be correctly identified (often is not)
- fails with multiple encodings handled by a single CGI
- fails with transcoding proxies
- it is recommended to use UTF-8 in both directions
- and use multipart/form-data encoding
Encoding According to Target Type
- you must encode your characters according to whether the target is html, xml or a URI, eg:
character html xml url ~~~~~~~~- ~~~~ ~~~ ~~~ € = € € %E2%82%AC
Various Transformations
- between one charset and another: I18N::UnicodeString?
- determining string position: I18N::UnicodeString?
- unicode_to_entities_preserving_ascii()
Converting from IS0-8859-1 to UTF-8
- All characters in the range of 0-127 (hex 00 through 7F), are represented identically in both encodings. This covers the entire range of the original ASCII characters
- All iso-8859-1 characters in the range of 128-191 (hex 80 through BF) need to be preceeded by a byte with the value of 194 (hex C2) in utf-8, but otherwise are left intact
- All iso-8859-1 characters in the range of 192-255 (hex C0 through FF) not only need to be preceeded by a byte with the value of 195 (hex C3) in utf-8, but also need to have 64 (hex 40) subtracted from the iso-8859-1 character value. For example, a ñ (decimal 241, hex F1) becomes a 195 followed by a 177 (hex C3 B1)
- here's some code that will do it: http://miscoranda.com/96
Good Encoding Practices (T. Braye)
- Embrace Unicode, don't fight it; it's probably the right thing to do, and if it weren't you'd probably have to anyhow
- Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it
- Interchange data with the outside world using XML whenever possible; this makes a whole bunch of potential problems go away
- Try to make your application browser-based rather than write your own client; the browsers are getting really quite good at dealing with the texts of the world
- If you're using someone else's library code (and of course you are), assume its Unicode handling is broken until proved to be correct
BOM
If you use UTF-8 as charset, check every file you save with UTF-8 compliant editor and remove UTF-8 BOM from beginning of file if it's present, otherwise this will break display in IE, however Mozilla browsers seem to be immune to the problem.
UTF-8 BOM characters written in HEX are EF BB BF
BOM is the abbreviation for Byte Order Mark.
Resources
- http://blog.joshuaeichorn.com/archives/2005/09/30/html_ajax-021-released/#comment-5133
- http://www.gravitonic.com/do_download.php?download_file=talks/oscon2005/php_unicode_oscon2005.pdf
- http://www.acko.net/blog/unicode-in-php
- http://www.w3.org/International/
- http://www.xencraft.com/training/webstandards.html
- http://www.joelonsoftware.com/articles/Unicode.html
- http://intertwingly.net/stories/2004/04/14/i18n.html
- http://www.onlamp.com/pub/a/php/2002/11/28/php_i18n.html
- http://www.randomchaos.com/document.php?source=php_and_unicode
- http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
- http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
- http://www.webreference.com/dlab/books/html/39-0.html
- http://pear.php.net/package/I18N_UnicodeString
- http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex
- http://blog.ajohnstone.com/index.php?p=7
- http://dev.mysql.com/tech-resources/articles/4.1/unicode.html (good ideas for testing charset of data posted from a from)
- http://www.w3.org/International/questions/qa-forms-utf-8.html
- http://www.nathan-syntronics.de/midgard/midcom/utf8.html
- http://derickrethans.nl/files/wereldveroverend-ffm2004.pdf
- http://www.oracle.com/technology/tech/opensource/php/globalizing_oracle_php_applications.pdf
- http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss
- http://blog.bitflux.ch/archive/how-to-get-rid-of-invalid-utf-8-characters.html
- http://wact.sourceforge.net/docs/doku.php?id=php:i18n
- http://www.sitepoint.com/blog-post-view.php?id=254984
- http://wact.sourceforge.net/docs/doku.php?id=php:i18n:charsets
- http://farm.tucows.com/blog/_archives/2003/10/16/4630.html
- http://laughingmeme.org/archives/2004_03_21.html
Tests
1. form submission
- encoding utf-8
- lang french
- Content-Type: application/x-www-form-urlencoded
- data:
action=send&token=02186cc2b41b4c4d64fcde3d611884a4&contact%5Bfirst_name%5D=&contact%5Blast_name%5D=User&contact%5Bemail%5D=demian%40phpkitchen.com&contact%5Btype%5D=General+enquiry&contact%5Bcomment%5D=Veuillez+remplir+les+champs+indiqu%C3%A9s+et+recommencer&submitted=Envoyer
- encoding iso-8859-1
- lang french
- Content-Type: application/x-www-form-urlencoded
- data:
action=send&token=02186cc2b41b4c4d64fcde3d611884a4&contact%5Bfirst_name%5D=&contact%5Blast_name%5D=User&contact%5Bemail%5D=demian%40phpkitchen.com&contact%5Btype%5D=General+enquiry&contact%5Bcomment%5D=Veuillez+remplir+les+champs+indiqu%C3%A9s+et+recommencer&submitted=Envoyer
- encoding utf-8
- lang french
- Content-Type: multipart/form-data;
- data: Veuillez remplir les champs indiqués et recommencer
- for db storage: utf-8
- for data manip by PHP: unicode
- for display in x/html: utf-8
input / output encoding with GET/POST
process:
- client browser
- http post
- server response
- client browser
input GET (iso-8859-1)(application/x-www-form-urlencode):
param: input= <html> is a tag param: contentType = text/plain
output:
input=%3Chtml%3E+is+a+tag&contentType=text%2Fplain&submit=
input POST (iso-8859-1)(application/x-www-form-urlencode):
param: input= <html> is a tag param: contentType = text/plain
output:
input=%3Chtml%3E+is+a+tag&contentType=text%2Fplain&submit=
input POST (iso-8859-1)(multipart/form-data):
param: input= <html> is a tag param: contentType = text/plain
output:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~-12141224864143487581745862222 Content-Disposition: form-data; name="input"
<html> is a tag
~~~~~~~~~~~~~~~~~~~~~~~~~~~~-12141224864143487581745862222 Content-Disposition: form-data; name="contentType"
text/plain
Character encoding with PHP
$str = '<html> is a tag'; print urlencode($str); // %3Chtml%3E+is+a+tag $str = '<html> is a tag'; print rawurlencode($str); // %3Chtml%3E%20is%20a%20tag $str = '<html> is a tag'; print utf8_encode($str); // <html> is a tag $str = '<html> is a tag'; print base64_encode($str); // PGh0bWw+IGlzIGEgdGFn
Notes
- UTF-8 is a multi-byte encoded character set
- good description of detecting Byte Order Mark in .NET : http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=469
- Linux/Unix does not use any BOMs and signatures
- On POSIX systems, the selected locale identifies already the encoding expected in all input and output files of a process
- you can detect charset encoding in PHP with the multibyte functions: http://uk.php.net/manual/en/ref.mbstring.php
- good Unicode FAQ with lots of info on the difference between utf-8 and ascii: http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html
- check out the How do I have to modify my software? section from the above link
- UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
- All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set.
- All possible 231 UCS (Universal Character Set) codes can be encoded
- UTF-8 encoded characters may theoretically be up to six bytes long
- if you want to display two charsets on a single page, impossible without utf-8 encoding, ie, comments in french and japanese
- how data in web forms gets encoded in POST/GET : http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
