Character Set Special Characters

Go To StackoverFlow.com

5

  • Is iso-8859-1 a proper subset of utf-8?
  • What about iso-8859-n?
  • What about windows-1252?

If the answer is no to any of the above, what are the disjoint characters? I'm testing some logic that detects charsets and want to write tests to verify the detection is working properly.

2012-04-05 01:42
by Sean Jezewski


8

Is iso-8859-1 a proper subset of utf-8?

The character reportoire of ISO-8859-1 (the first 256 characters of Unicode) is a proper subset of that of UTF-8 (every Unicode character).

However, the characters U+0080 to U+00FF are encoded differently in the two encodings.

  • ISO-8859-1 assigns each of these characters a single byte from 80 to FF.
  • UTF-8 encodes the same characters as two-byte sequences C2 80 to C3 BF.

What about iso-8859-n?

These are 15 different encodings that contain a total of 614 distinct characters. Some of these characters occur in multiple "parts" of ISO 8859, and some don't. You'll have to be more specific.

I see that your question is tagged ISO-8859-2. The characters that are in -2 that aren't in -1 are:

Ă㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŔŕŘřŚśŞşŠšŢţŤťŮůŰűŹźŻżŽžˇ˘˙˛˝

What about windows-1252?

Windows-1252 is just like ISO-8859-1 except that it replaces the rarely used control characters in the 0x80-0x9F range with printable characters. The characters that are in windows-1252 but not in ISO-8859-1 are:

ŒœŠšŸŽžƒˆ˜–—‘’‚“”„†‡•…‰‹›€™

2012-04-05 02:33
by dan04
So you're saying that repetoire of iso-8859-1 is a proper subset of the repetoire of utf-8? I believe that. What I'm not sure about is that the repetoire of utf-8 is equal to the repetoire of of unicode. I thought the purpose of utf-16 / utf-32 was to be able to encode more/all of the unicode characters respectively - Sean Jezewski 2012-04-05 19:27
Ahh .. I looked it up. Since UTF-8 can represent characters as multiple bytes it can expresss all of the unicode repetoire. This makes sense now - Sean Jezewski 2012-04-05 19:48


0

Unicode is a superset of all these character sets, and of pretty much all established character sets out there. You can find a list of mappings of all these character sets to Unicode code points here: http://unicode.org/Public/MAPPINGS/.

2012-04-05 02:11
by deceze
Ads