Kshitij Munankami: Character sets in XML

Review the correct use of character sets in XML, discussing the advantages and disadvantages of different sets, with examples. How would you select your character sets to present Chinese characters?

# A character set determines which characters are allowed within the XML document. There are restrictive character set and broad character set. A restrictive character set allows only uppercase character whereas broad character set allows many characters.

Before an XML parser can read a document, it must know which character set and encoding the document uses. A character set confirms which characters are allowed within a document. There are different types of character sets ASCII, Unicode.

ASCII is widely used character set. It’s each character is represented by a character encoding value. The pure ASCII is a 7-bit encoding scheme, allowing 128 different values. But ANSI extends the ASCII character set to 8-bit to use the full range of 256 characters available in a Byte. The drawback of ASCII is it has maximum 128 characters only which is not enough for some key boards having special characters.
The nominated character set form XML documents are Unicode, which contain characters all around the world. The Universal Character Set (USC) is an ISO standard that covers most of the writing systems. But as it uses multi-octet characters that are not compatible with many current applications and protocols so the UCS Transformation Formats (UTF) standards were build up in order to overcome the compatibility issue. The two widely used encoding schemes for Unicode are UTF-8 and UTF-16. UTF-8 uses 8 bits and is compatible with 7-bit ASCII and UTF-16 uses 16 bit character encoding and is able to signify 65,356 possible values. In order to specify the character set used in document creating, XML documents have an encoding declaration. http://www.developerfusion.com/article/3802/extensible-markup-language-xml-tutorial/5/

For example:

<? xml version=”1.0” encoding=”US-ASCII”?>

<? xml version=”1.0” encoding=”ISO-8859-1”?>

Advantages of these encoding are as follows:

Both UTF-8 and UTF-16 covers all the Unicode characters set, which includes all characters from nearly all major national, international character sets.
These two encodings are widely supported by XML processors than any others.
Both encodings cover all the languages without loss

The only drawbacks of them are they are not the native text file format for most systems. In a simple word, some text file editors and viewers cannot be directly used.

Comparing between Unicode and ASCII:

ASCII text is smaller in byte size, because it uses only 1 byte per character. However, it limits the type of text you can use to standard Latin alphabet with some extra control characters. This means that you cannot represent text in languages other than those based solely on the standard Latin alphabet. Mostly it just represents English.
Unicode allows you to represent text in pretty much any language. However, because it uses multiple bytes for 1 character, it makes its byte size increase. So a file with text in ASCII will be smaller in size than the same text in Unicode.

Kshitij Munankami

Wednesday, 1 December 2010

Character sets in XML

No comments:

Post a Comment