usefor-article-06 November 2001
[< Prev]
[TOC] [ Next >]
4.4.1. Character Sets within Article Headers
Within article headers, characters are represented as octets
according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
and hence all the characters in Unicode [UNICODE 3.1] or in the
Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
(which is essentially a superset of Unicode and expected to remain
so) are potentially available. However, processing all octets in the
same manner as US-ASCII characters should ensure correct behaviour in
most situations.
NOTE: UTF-8 is an encoding for 16bit (and even 32bit) character
sets with the property that any octet less than 128 immediately
represents the corresponding US-ASCII character, thus ensuring
upwards compatibility with previous practice. Non-ASCII
characters from Unicode are represented by sequences of octets
satisfying the syntax of a UTF8-xtra-char (2.4), which excludes
certain octet sequences not explicitly permitted by [RFC 2279].
Unicode includes all characters from the ISO-8859 series of
characters sets [ISO 8859] (which includes all Cyrillic, Greek
and Arabic characters) together with the more elaborate
characters used in Asian countries. See the following section
for the appropriate treatment of Unicode characters by reading
agents.
Notwithstanding the great flexibility permitted by UTF-8, there is
need for restraint in its use in order that the essential components
of headers may be discerned using reading agents that cannot present
the full Unicode range. In particular, header-names and tokens MUST
be in US-ASCII, and certain other components of headers, as defined
elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
domains and path-identities - MUST be in US-ASCII. Comments, phrases
(as in addresses) and unstructureds (as in Subject headers) MAY use
the full range of UTF-8 characters, but SHOULD nevertheless be
invariant under Unicode normalization NFC [UNICODE 3.1].
NOTE: Unicode allows for composite characters made up of a
starter character - which can be a letter, number, punctuation
mark, or symbol - plus zero or more combining marks (such as
accents, diacritics, and similar). The requirement that a
composite be invariant under normalization NFC means that, where
it could be written in more than one way, only one particular
one is allowed (for example, the single character E-acute is
preferred over E followed by a non-spacing acute accent, and A-
ring is preferred over the Angstrom symbol). At least for the
main European languages, for which all the needed composites are
already available as single characters, it is unlikely that
posting agents will need to take any special steps to ensure
normalization.
In the particular case of newsgroup-names (see 5.5) there are more
stringent requirements regarding the use of UTF-8 and Unicode.
Where the use of non-ASCII characters, encoded in UTF-8, is permitted
as above, they MAY also be encoded using the MIME mechanism defined
in [RFC 2047], but this usage is deprecated within news articles
(even though it is required in mail messages) since it is less
legible in older reading agents which support neither it nor UTF-8.
Nevertheless, reading agents SHOULD support this usage, but only in
those contexts explicitly mentioned in [RFC 2047].
[< Prev]
[TOC] [ Next >]
#Diff to first older
--- ../usefor-article-05/Character_Sets_within_Article_Headers.out July 2001
+++ ../usefor-article-06/Character_Sets_within_Article_Headers.out November 2001
@@ -33,16 +33,18 @@
the full range of UTF-8 characters, but SHOULD nevertheless be
invariant under Unicode normalization NFC [UNICODE 3.1].
- NOTE: The effect of normalization NFC is to place composite
- characters (made by overlaying one character with another) into
- a canonical form (usually represented by a single character
- where one is available - thus E-acute is preferred over E
- followed by a non-spacing acute accent), and to make a
- consistent choice among equivalent forms (e.g. the Angstrom sign
- is replaced by A-ring). At least for the main European
- languages, for which all the needed composites are already
- available as single characters, it is unlikely that posting
- agents will need to take any special steps to ensure
+ NOTE: Unicode allows for composite characters made up of a
+ starter character - which can be a letter, number, punctuation
+ mark, or symbol - plus zero or more combining marks (such as
+ accents, diacritics, and similar). The requirement that a
+ composite be invariant under normalization NFC means that, where
+ it could be written in more than one way, only one particular
+ one is allowed (for example, the single character E-acute is
+ preferred over E followed by a non-spacing acute accent, and A-
+ ring is preferred over the Angstrom symbol). At least for the
+ main European languages, for which all the needed composites are
+ already available as single characters, it is unlikely that
+ posting agents will need to take any special steps to ensure
normalization.
In the particular case of newsgroup-names (see 5.5) there are more