usefor-article-08 August 2002
[< Prev]
[TOC] [ Next >]
4.4.1. Character Sets within Article Headers
Within article headers, characters are represented as octets
according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
and hence all the characters in Unicode [UNICODE 3.2] or in the
Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
(which is essentially a superset of Unicode and expected to remain
so) are potentially available. However, processing all octets in the
same manner as US-ASCII characters should ensure correct behaviour in
most situations.
NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
(in both its 16 and 32 bit forms) with the property that any
octet less than 128 immediately represents the corresponding
US-ASCII character, thus ensuring upwards compatibility with
previous practice. Non-ASCII characters from Unicode are
represented by sequences of octets satisfying the syntax of a
UTF8-xtra-char (2.4.2), which excludes certain octet sequences
not explicitly permitted by [RFC 2279]. Unicode includes all
characters from the ISO-8859 series of characters sets [ISO
8859] (which includes all Cyrillic, Greek and Arabic characters)
together with the more elaborate characters used in Asian
countries. See the NOTEs in the following section for the
appropriate treatment of Unicode characters by reading agents.
[The sentence mentioning [RFC 2279] could be simplified if [RFC 2279bis]
has been accepted by the time this standard is published.]
Notwithstanding the great flexibility permitted by UTF-8, there is
need for restraint in its use in order that the essential components
of headers may be discerned using reading agents that cannot present
the full Unicode range. In particular, header-names and tokens MUST
be in US-ASCII, and certain other components of headers, as defined
elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
domains and path-identities - MUST be in US-ASCII. Comments, phrases
(as in mailboxes) and unstructured headers (such as the Subject-,
Organization- and Summary-headers) MAY use the full range of UTF-8
characters, but SHOULD nevertheless be invariant under Unicode
normalization NFC [UNICODE 3.2].
NOTE: Unicode allows for composite characters made up of a
starter character - which can be a letter, number, punctuation
mark, or symbol - plus zero or more combining marks (such as
accents, diacritics, and similar). The requirement that a
composite be invariant under normalization NFC means that, where
it could be written in more than one way, only one particular
one of those ways is allowed (for example, the single character
E-acute is preferred over E followed by a non-spacing acute
accent, and A-ring is preferred over the Angstrom symbol). At
least for the main European languages, for which all the needed
composites are already available as single characters, it is
unlikely that posting agents will need to take any special steps
to ensure normalization.
In the particular case of newsgroup-names (see 5.5) there are more
stringent requirements regarding the normalization and other usages
of Unicode.
Where the use of non-ASCII characters is permitted as above, they MAY
be encoded in UTF-8 and they MAY be encoded using the MIME mechanisms
defined in [RFC 2047] and [RFC 2231], but only in those contexts
explicitly mentioned in those documents (unstructured headers,
phrases and comments in the one, quoted-strings within parameters in
the other).
Encoding by other means is not compliant with this standard.
Nevertheless, encoding using other character sets (with no indication
of which one beyond the user's ability to guess based upon other
clues in the article, or custom within the newsgroup) has been in use
in some hierarchies, and such usage may be expected to continue for
some period after the introduction of this standard. Reading agents
MUST support the use of UTF-8, [RFC 2047] and [RFC 2231] in headers
and they MAY, when it is detected that none of these has been used,
attempt to interpet the header according to whatever other character
set can be deduced, or has been configued as a default by the reader.
NOTE: It is possible to determine, with a high degree of
accuracy, when a given text containing octets with the 8th bit
set was not encoded using UTF-8, and using this test to recover
such non-compliant texts is therefore commended where no other
harm could arise.
Exceptionally, Newsgroups-headers (5.5) MUST use UTF-8 in order to
ensure that they appear in their canonical form (in any case, a
Newsgroups-header is not one of the acceptable contexts of [RFC
2047]). Certain exceptions to this rule are provided (8.7 and 8.8.1)
for use when mailing to moderators and other gatewaying applications.
NOTE: The choice between UTF-8 and [RFC 2047] when posting
depends on various factors. Some reading agents do not recogize
[RFC 2047], and some are incapable of decoding UTF-8 (though
there in an increasing tendency for modern reading agents to
understand, or to be configurable to understand, both). Since
headers encoded in UTF-8 are currently prohibited in Email,
special consideration needs to be given to articles that are
both posted and mailed (6.9) or which are mailed to moderators
(see 8.2.2). Posters and implementors of posting agents need to
take account of all these factors when deciding which method to
use.
[< Prev]
[TOC] [ Next >]
#Diff to first older
--- ../usefor-article-07/Character_Sets_within_Article_Headers.out May 2002
+++ ../usefor-article-08/Character_Sets_within_Article_Headers.out August 2002
@@ -2,26 +2,29 @@
Within article headers, characters are represented as octets
according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
- and hence all the characters in Unicode [UNICODE 3.1] or in the
+ and hence all the characters in Unicode [UNICODE 3.2] or in the
Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
(which is essentially a superset of Unicode and expected to remain
so) are potentially available. However, processing all octets in the
same manner as US-ASCII characters should ensure correct behaviour in
most situations.
- NOTE: UTF-8 is an encoding for 16bit (and even 32bit) character
- sets with the property that any octet less than 128 immediately
- represents the corresponding US-ASCII character, thus ensuring
- upwards compatibility with previous practice. Non-ASCII
- characters from Unicode are represented by sequences of octets
- satisfying the syntax of a UTF8-xtra-char (2.4.2), which
- excludes certain octet sequences not explicitly permitted by
- [RFC 2279]. Unicode includes all characters from the ISO-8859
- series of characters sets [ISO 8859] (which includes all
- Cyrillic, Greek and Arabic characters) together with the more
- elaborate characters used in Asian countries. See the following
- section for the appropriate treatment of Unicode characters by
- reading agents.
+ NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
+ (in both its 16 and 32 bit forms) with the property that any
+ octet less than 128 immediately represents the corresponding
+ US-ASCII character, thus ensuring upwards compatibility with
+ previous practice. Non-ASCII characters from Unicode are
+ represented by sequences of octets satisfying the syntax of a
+ UTF8-xtra-char (2.4.2), which excludes certain octet sequences
+ not explicitly permitted by [RFC 2279]. Unicode includes all
+ characters from the ISO-8859 series of characters sets [ISO
+ 8859] (which includes all Cyrillic, Greek and Arabic characters)
+ together with the more elaborate characters used in Asian
+ countries. See the NOTEs in the following section for the
+ appropriate treatment of Unicode characters by reading agents.
+[The sentence mentioning [RFC 2279] could be simplified if [RFC 2279bis]
+has been accepted by the time this standard is published.]
+
Notwithstanding the great flexibility permitted by UTF-8, there is
need for restraint in its use in order that the essential components
of headers may be discerned using reading agents that cannot present
@@ -29,10 +32,10 @@
be in US-ASCII, and certain other components of headers, as defined
elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
domains and path-identities - MUST be in US-ASCII. Comments, phrases
- (as in addresses) and unstructured headers (such as the Subject-,
+ (as in mailboxes) and unstructured headers (such as the Subject-,
Organization- and Summary-headers) MAY use the full range of UTF-8
characters, but SHOULD nevertheless be invariant under Unicode
- normalization NFC [UNICODE 3.1].
+ normalization NFC [UNICODE 3.2].
NOTE: Unicode allows for composite characters made up of a
starter character - which can be a letter, number, punctuation
@@ -40,31 +43,57 @@
accents, diacritics, and similar). The requirement that a
composite be invariant under normalization NFC means that, where
it could be written in more than one way, only one particular
- one is allowed (for example, the single character E-acute is
- preferred over E followed by a non-spacing acute accent, and A-
- ring is preferred over the Angstrom symbol). At least for the
- main European languages, for which all the needed composites are
- already available as single characters, it is unlikely that
- posting agents will need to take any special steps to ensure
- normalization.
+ one of those ways is allowed (for example, the single character
+ E-acute is preferred over E followed by a non-spacing acute
+ accent, and A-ring is preferred over the Angstrom symbol). At
+ least for the main European languages, for which all the needed
+ composites are already available as single characters, it is
+ unlikely that posting agents will need to take any special steps
+ to ensure normalization.
In the particular case of newsgroup-names (see 5.5) there are more
- stringent requirements regarding the use of UTF-8 and Unicode.
+ stringent requirements regarding the normalization and other usages
+ of Unicode.
+
+ Where the use of non-ASCII characters is permitted as above, they MAY
+ be encoded in UTF-8 and they MAY be encoded using the MIME mechanisms
+ defined in [RFC 2047] and [RFC 2231], but only in those contexts
+ explicitly mentioned in those documents (unstructured headers,
+ phrases and comments in the one, quoted-strings within parameters in
+ the other).
+
+ Encoding by other means is not compliant with this standard.
+ Nevertheless, encoding using other character sets (with no indication
+ of which one beyond the user's ability to guess based upon other
+ clues in the article, or custom within the newsgroup) has been in use
+ in some hierarchies, and such usage may be expected to continue for
+ some period after the introduction of this standard. Reading agents
+ MUST support the use of UTF-8, [RFC 2047] and [RFC 2231] in headers
+ and they MAY, when it is detected that none of these has been used,
+ attempt to interpet the header according to whatever other character
+ set can be deduced, or has been configued as a default by the reader.
+
+ NOTE: It is possible to determine, with a high degree of
+ accuracy, when a given text containing octets with the 8th bit
+ set was not encoded using UTF-8, and using this test to recover
+ such non-compliant texts is therefore commended where no other
+ harm could arise.
+
+ Exceptionally, Newsgroups-headers (5.5) MUST use UTF-8 in order to
+ ensure that they appear in their canonical form (in any case, a
+ Newsgroups-header is not one of the acceptable contexts of [RFC
+ 2047]). Certain exceptions to this rule are provided (8.7 and 8.8.1)
+ for use when mailing to moderators and other gatewaying applications.
- Where the use of non-ASCII characters, encoded in UTF-8, is permitted
- as above, they MAY also be encoded using the MIME mechanism defined
- in [RFC 2047], but this usage is deprecated within news articles
- (even though it is required in email messages) since it is less
- legible in older reading agents which support neither it nor UTF-8.
- Nevertheless, reading agents SHOULD support this usage, but only in
- those contexts explicitly mentioned in [RFC 2047].
-
- Similar considerations apply to non-ASCII characters within the
- values of parameters (which, according to the syntax, MUST be in the
- form of quoted-strings in order for UTF8-xtra-chars to be
- accomodated). Such values MAY be encoded using the MIME mechanism
- defined in [RFC 2231], but this usage is deprecated within news
- articles (even though it is required in email messages) since it is
- less legible in older reading agents which support neither it nor
- UTF-8. Nevertheless, reading agents SHOULD support this usage.
+ NOTE: The choice between UTF-8 and [RFC 2047] when posting
+ depends on various factors. Some reading agents do not recogize
+ [RFC 2047], and some are incapable of decoding UTF-8 (though
+ there in an increasing tendency for modern reading agents to
+ understand, or to be configurable to understand, both). Since
+ headers encoded in UTF-8 are currently prohibited in Email,
+ special consideration needs to be given to articles that are
+ both posted and mailed (6.9) or which are mailed to moderators
+ (see 8.2.2). Posters and implementors of posting agents need to
+ take account of all these factors when deciding which method to
+ use.