usefor-article-04 April 2001
[< Prev]
[TOC] [ Next >]
4.4.1. Character Sets within Article Headers
Within article headers, characters are represented as octets
according to the UTF-8 encoding scheme [ISO 10646] or [RFC 2279] and
hence all the characters in the Universal Multiple-Octet Coded
Character Set (UCS) [ISO 10646] (which is essentially a superset of
Unicode [UNICODE] and expected to remain so) are potentially
available. However, interpreting the octets directly as US-ASCII
characters should ensure correct behaviour in most situations.
NOTE: UTF-8 is an encoding for 16bit (and even 32bit) character
sets with the property that any octet less than 128 immediately
represents the corresponding US-ASCII character, thus ensuring
upwards compatibility with previous practice. Non-ASCII
characters from UCS are represented by sequences of octets
satisfying the syntax of a UTF8-xtra-char (2.4). Only those
octet sequences explicitly permitted by [RFC 2044] shall be
used. UCS includes all characters from the ISO-8859 series of
characters sets [ISO 8859] (which includes all Greek and Arabic
characters) as well as the more elaborate characters used in
Japan and China. See the following section for the appropriate
treatment of UCS characters by reading agents.
Notwithstanding the great flexibility permitted by UTF-8, there is
need for restraint in its use in order that the essential components
of headers may be discerned using reading agents that cannot present
the full UCS range. In particular, header-names and tokens MUST be in
US-ASCII, and certain other components of headers, as defined
elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
domains and path-identities - MUST be in US-ASCII. Comments, phrases
(as in addresses) and unstructureds (as in Subject headers) MAY use
the full range of UTF-8 characters. For newsgroup-names see 5.5.
Where the use of non-ASCII characters, encoded in UTF-8, is permitted
as above, they MAY also be encoded using the Mime mechanism defined
in [RFC 2047], but this usage is deprecated within news articles
(even though it is required in mail messages) since it is less
legible in older reading agents which support neither it nor UTF-8.
Nevertheless, reading agents SHOULD support this usage, but only in
those contexts explicitly mentioned in [RFC 2047].
[< Prev]
[TOC] [ Next >]
#Diff to first older
--- ../usefor-article-03/Character_Sets_within_Article_Headers.out February 2000
+++ ../usefor-article-04/Character_Sets_within_Article_Headers.out April 2001
@@ -1,11 +1,12 @@
4.4.1. Character Sets within Article Headers
- Within article headers, the CES is UTF-8 [ISO 10646] or [RFC 2279]
- and hence the CCS is the Universal Multiple-Octet Coded Character Set
- (UCS) [ISO 10646] (which is essentially a superset of Unicode
- [UNICODE] and expected to remain so). However, interpreting the
- octets directly as US-ASCII characters should ensure correct
- behaviour in most situations.
+ Within article headers, characters are represented as octets
+ according to the UTF-8 encoding scheme [ISO 10646] or [RFC 2279] and
+ hence all the characters in the Universal Multiple-Octet Coded
+ Character Set (UCS) [ISO 10646] (which is essentially a superset of
+ Unicode [UNICODE] and expected to remain so) are potentially
+ available. However, interpreting the octets directly as US-ASCII
+ characters should ensure correct behaviour in most situations.
NOTE: UTF-8 is an encoding for 16bit (and even 32bit) character
sets with the property that any octet less than 128 immediately