Ohio Senator John Boehner tweeted this a few days ago. Note that this is not a political blog post.
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama
During the #AskObama Live Twitter event, the Tweets then came up on a big Plasma screen. This tweet came up "garbled" and said:
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama
And a million programmers, regardless of political party, groaned in unison. First, because someone screwed up their UTF-8 decoding, by not doing it, and second, because our President doesn't recognize a text encoding bug when he sees one! Well, maybe that second one was just me, but still. Tragic. The President then teased the Senator for his typing while newspapers and news organizations struggled to get their minds around this "garbled tweet."
Well, Boehner could have tweeted "that's left us deeper..." but he tweeted "that’s." Note the "smart" apostrophe. He used Tweetdeck to tweet it, and it was likely on a Mac. It's also possible that he wrote the tweet in Microsoft Word then copy pasted it as Word loves to change quotes and apostrophes ' into smart quotes and smart apostrophes with direction like this ’.
I can get John Boehner's User ID (not his twitter name, but the number that represents John) with this online tool http://www.idfromuser.com. I see that it's 5357812, so I can get his timeline as RSS (Really Simple Syndication)/XML like this: http://twitter.com/statuses/user_timeline/5357812.rss or JSON (JavaScript Object Notation) like this http://twitter.com/statuses/user_timeline/5357812.json
When I ask for this timeline, the HTTP Headers say it's encoded as "UTF-8", see?
Content-Type: application/json; charset=utf-8
I blogged about the "Importance of being UTF-8" about five years ago. If you look at the JSON and find the tweet with the ID 88618213008621568, you can see the raw text encoded in JSON:
"text":"After embarking on a record spending binge that\u2019s left us deeper in debt, where are the jobs?"
See that \u2019? In Windows (you have this program even if you aren't a developer) go to the Start Menu and run "Charmap." Look around and you can see U+2010 is Right Single Quotation Mark. Note that it's WAY down in the list of all the characters. It's not a basic character like A to Z or a to z. It's one of those special things that looks nice, but causes trouble later.
If I make a text file in Notepad that looks like this and name it text.txt, for example, and Save As, making sure to use UTF-8 as the encoding...
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs?
...then load it into any free HEX editor (or even an online one!) I see this:
Note that the part where the ’ was is actually three full bytes! E2 80 99.
Well, UTF-8 is an encoding whose goal was to not only support a bajillion different characters but also to be backwards compatible with ASCII, the American Standard Code for Information Interchange. If it wasn't, we wouldn't be able to see MOST of the characters in this tweet! In this case, just the ’ is goofy.
The code point was U+2019, which is 0010 0000 0001 1001, says Windows Calculator in Programmer Mode. You have this too, Dear Reader. There's some variable width encoding going on, that you can read about on Wikipedia.
This value of U+2019 expands to: 0010 0000 0001 1001, as I said, which then expands acording to these rules
zzzzyyyy yyxxxxxx ->
1110zzzz
10yyyyyy
10xxxxxx
Which gives us this
11100010 -> E2
10000000 -> 80
10011001 -> 99
hence, "that’s" is encoded as
74 68 61 74 E2 80 99 73
I've bolded the ’. Which then, read back in - this time as Extended ASCII (the ANSI Windows 1252 Code page) we get the ’ expanded:
that’s
Made it this far? Why didn't I just say "The software read in a UTF-8 encoded JSON stream of tweets and displayed it with an ANSI Windows Code Page 1252." Because that wouldn't be nearly as fun.
Either way, the company that did this for the White House definitely goofed up and should have tested this. This is SUCH a classic sloppy programmer mistake that I'm disappointed to see it showcased so blatantly. I hope they (the vendor) feel a little bad. The company appears to be called "Mass Relevance" and here's some news articles about Mass Relevance and their "Tweet Curation."
Testing, testing, testing, my friends. And not only testing, but KNOW this stuff. They don't always teach it in schools and no one will learn until they see their bug on national TV in front of the President of the United States. ;)
Text encoding is fun for all ages. Enjoy!
* Like this post? Put me on TV, folks. This is the kind of stuff that a real technology journalist *Pogue* would love to share with the people! ABC News? I'm available and I have Skype. Call my people. ;)
© 2011 Scott Hanselman. All rights reserved.