Thursday, March 15, 2012

Unicode

OK. I'm an idiot.

So here's today's BIG Unicode lesson; understand this and, maybe, half your troubles will evaporate.

Unicode is NOT a "code".

No. Unicode is a kind of platonic ideal of which everything else is an "encoding".

ASCII is an encoding. UTF-8 is an encoding. That weird character set you got with Portuguese accented letters is an encoding.

Hence the verb "encode" means to turn a Unicode string into a byte string.

And "decode" means to turn a byte string (say one imported from another application) back into the pure Unicode. 

I repeat. You DO NOT encode byte-strings into Unicode-strings. You decode them into Unicode. And then you re-encode them when you want to export them (as, say, XML or JSON).

read --> decode --> do stuff in your app --> encode --> write

Thanks ... that's all.


No comments: