This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
blog:fall2015:mmalik1:journal [2015/11/17 04:48] – week11 opus mmalik1 | blog:fall2015:mmalik1:journal [2015/11/24 01:16] (current) – week12 mmalik1 | ||
---|---|---|---|
Line 70: | Line 70: | ||
Effectively, | Effectively, | ||
+ | |||
+ | ===November 23, 2015=== | ||
+ | |||
+ | Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? | ||
+ | |||
+ | Did you ever get an email from your friends in Bulgaria with the subject line "???? ?????? ??? ????"? | ||
+ | |||
+ | I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they " | ||
+ | |||
+ | But it won't. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, | ||
+ | |||
+ | So I have an announcement to make: if you are a programmer working in 2003 and you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you peel onions for 6 months in a submarine. I swear I will. | ||
+ | |||
+ | And one more thing: | ||
+ | |||
+ | IT'S NOT THAT HARD. | ||
+ | |||
+ | In this article I'll fill you in on exactly what every working programmer should know. All that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong, and if you're still programming that way, you're not much better than a medical doctor who doesn' | ||
+ | |||
+ | Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, | ||
+ | |||
+ | A Historical Perspective | ||
+ | |||
+ | The easiest way to understand this stuff is to go chronologically. | ||
+ | |||
+ | You probably think I'm going to talk about very old character sets like EBCDIC here. Well, I won't. EBCDIC is not relevant to your life. We don't have to go that far back in time. | ||
+ | |||
+ | ASCII tableBack in the semi-olden days, when Unix was being invented and K&R were writing The C Programming Language, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter " | ||
+ | |||
+ | And all was good, assuming you were an English speaker. | ||
+ | |||
+ | Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." | ||
+ | |||
+ | Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few " | ||
+ | |||
+ | Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the " | ||
+ | |||
+ | But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented. | ||
+ | |||
+ | Unicode | ||
+ | |||
+ | Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad. | ||
+ | |||
+ | In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense. | ||
+ | |||
+ | Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory: | ||
+ | |||
+ | A -> 0100 0001 | ||
+ | |||
+ | In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story. | ||
+ | |||
+ | In Unicode, the letter A is a platonic ideal. It's just floating in heaven: | ||
+ | |||
+ | A | ||
+ | |||
+ | This platonic A is different than B, and different from a, but the same as A and A and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from " | ||
+ | |||
+ | Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. | ||
====Data Structures==== | ====Data Structures==== |