usefor-article-09 February 2003
[< Prev]
[TOC] [ Next >]
4.4.1. Character Sets within Article Headers
Within article headers, characters are represented as octets
according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
and hence all the characters in Unicode [UNICODE 3.2] or in the
Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
(which is essentially identical to Unicode and expected to remain so)
are potentially available. Although it will usually be unnecessary
to use language tagging within headers, the tagging facilities
provided in [UNICODE 3.2] (code points U+E0000 through U+E007F) MAY
be used for that purpose.
NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
(in both its 16 and 32 bit forms) with the property that any
octet less than 128 immediately represents the corresponding
US-ASCII character, thus ensuring upwards compatibility with
previous practice. Non-ASCII characters from Unicode are
represented by sequences of octets satisfying the syntax of a
UTF8-xtra-char (2.4.2), which excludes certain octet sequences
not explicitly permitted by [RFC 2279]. Unicode includes all
characters from the ISO-8859 series of characters sets [ISO
8859] (which includes all Cyrillic, Greek and Arabic characters)
together with the more elaborate characters used in Asian
countries. See the NOTEs in the following section for the
appropriate treatment of Unicode characters by reading agents.
[The sentence mentioning [RFC 2279] could be simplified if [RFC 2279bis]
has been accepted by the time this standard is published.]
Notwithstanding the great flexibility permitted by UTF-8, there is
need for restraint in its use in order that the essential components
of headers may be discerned using reading agents that cannot present
the full Unicode range. In particular, header-names and tokens MUST
be in US-ASCII, and certain other components of headers, as defined
elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
domains and path-identities - MUST be in US-ASCII. Comments, phrases
(as in mailboxes) and unstructured headers (such as the Subject-,
Organization- and Summary-headers) MAY use the full range of UTF-8
characters, but SHOULD nevertheless be invariant under Unicode
normalization NFC [UNICODE 3.2].
NOTE: Unicode allows for composite characters made up of a
starter character - which can be a letter, number, punctuation
mark, or symbol - plus zero or more combining marks (such as
accents, diacritics, and similar). The requirement that a
composite be invariant under normalization NFC means that, where
it could be written in more than one way, only one particular
one of those ways is allowed (for example, the single character
E-acute is preferred over E followed by a non-spacing acute
accent, and A-ring is preferred over the Angstrom symbol). At
least for the main European languages, for which all the needed
composites are already available as single characters, it is
unlikely that posting agents will need to take any special steps
to ensure normalization.
In the particular case of newsgroup-names (see 5.5) there are more
stringent requirements regarding the normalization and other usages
of Unicode.
Where the use of non-ASCII characters is permitted as above, they MAY
be encoded in UTF-8 or they MAY be encoded using the MIME mechanisms
defined in [RFC 2047] and [RFC 2231]. For this purpose, all headers
defined in this standard are to be considered as "extension message
header fields" for the purpose of section 5 of [RFC 2047] (insofar as
they are not already covered under the existing Email standards). The
effect of this is to permit the use of [RFC 2047] encodings within
any unstructured header, or within any comment or phrase permitted
within any structured header. Additionally, [RFC 2047] is
considered to incorporate the extension to allow language tags within
encoded-words described in [RFC 2231]. Likewise, the syntax for
parameter (see 4.1 above) is to be considered as replaced by the
revised syntax given in [RFC 2231], the effect of which is to allow
the use of parameter value continuations, character sets and language
information within the MIME-style parameters introduced in this
standard (4.2.2).
[We could go further and include that syntax explicitly in this
document.]
Exceptionally, where some other protocol, for example the
authentication protocol based on OpenPGP defined in [RFC 3156],
restricts some header to 7-bit data, the [RFC 2047] and [RFC 2231]
encodings MUST be used in preference to UTF-8 (see also the similar
restriction in 6.21.3).
[This presupposes that the extension to permit UTF-8 in body part
headers in 6.21.1 survives.]
Examples:
Organization: Technische =?iso-8859-1?Q?Universit=E4t_M=FCnchen?=
Approved: =?iso-8859-1?Q?Fran=E7ois_Faur=E9?= <ff@modsite.example>
(=?iso-8859-1?Q*fr?Mod=E9rateur_autoris=E9?=)
Archive: yes; filename*=iso-8859-1'es'ma=F1ana.txt
Reading agents MUST support the use of UTF-8, [RFC 2047] and [RFC
2231] in all those headers defined in this standard and in the Email
standards, at least to the extent of their ability to display the
characters presented to them. Moreover, since Netnews articles are
regularly emailed as well as posted, and the current Email standards
do not currently admit the use of full UTF-8 in headers, posting
agents MUST ensure that [RFC 2047] and [RFC 2231] are used in
preference to UTF-8 in those cases, at least within the emailed
version (see also 6.9 and 8.8.1.1).
Encoding by other means is not compliant with this standard.
Nevertheless, encoding using other character sets (with no indication
of which one beyond the user's ability to guess based upon other
clues in the article, or custom within the newsgroup) has been in use
in some hierarchies, and such usage may be expected to continue for
some period after the introduction of this standard. Reading agents
MAY, when such usage is detected, attempt to interpet the header
according to whatever other character set can be deduced, or has been
configued as a default by the reader.
NOTE: It is possible to determine, with a high degree of
accuracy, when a given text containing octets with the 8th bit
set was not encoded using UTF-8, and using this test to recover
such non-compliant texts is therefore commended where no other
harm could arise.
The [RFC 2047] encoding is not available within headers which contain
a newsgroup-name, notably Newsgroups-headers and Followup-To-headers,
because a newsgroup-name is neither a phrase nor a comment. Moreover
such headers MUST in any case use UTF-8 in order to ensure that
newsgroup-names appear in their canonical form. A special encoding
for newsgroup-names is provided in section 5.5.2 for use when mailing
to moderators and other gatewaying applications (8.7 and 8.8.1.1).
NOTE: The choice between UTF-8 and [RFC 2047] when posting
depends on various factors. Some reading agents do not recogize
[RFC 2047], and some are incapable of decoding UTF-8 (though
there in an increasing tendency for modern reading agents to
understand, or to be configurable to understand, both). Since
headers encoded in UTF-8 are currently prohibited in Email,
special consideration needs to be given to articles that are
both posted and mailed (6.9) or which are mailed to moderators
(see 8.2.2). Posters and implementors of posting agents need to
take account of all these factors when deciding which method to
use.
[< Prev]
[TOC] [ Next >]
#Diff to first older
--- ../usefor-article-08/Character_Sets_within_Article_Headers.out August 2002
+++ ../usefor-article-09/Character_Sets_within_Article_Headers.out February 2003
@@ -4,10 +4,11 @@
according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
and hence all the characters in Unicode [UNICODE 3.2] or in the
Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
- (which is essentially a superset of Unicode and expected to remain
- so) are potentially available. However, processing all octets in the
- same manner as US-ASCII characters should ensure correct behaviour in
- most situations.
+ (which is essentially identical to Unicode and expected to remain so)
+ are potentially available. Although it will usually be unnecessary
+ to use language tagging within headers, the tagging facilities
+ provided in [UNICODE 3.2] (code points U+E0000 through U+E007F) MAY
+ be used for that purpose.
NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
(in both its 16 and 32 bit forms) with the property that any
@@ -56,11 +57,47 @@
of Unicode.
Where the use of non-ASCII characters is permitted as above, they MAY
- be encoded in UTF-8 and they MAY be encoded using the MIME mechanisms
- defined in [RFC 2047] and [RFC 2231], but only in those contexts
- explicitly mentioned in those documents (unstructured headers,
- phrases and comments in the one, quoted-strings within parameters in
- the other).
+ be encoded in UTF-8 or they MAY be encoded using the MIME mechanisms
+ defined in [RFC 2047] and [RFC 2231]. For this purpose, all headers
+ defined in this standard are to be considered as "extension message
+ header fields" for the purpose of section 5 of [RFC 2047] (insofar as
+ they are not already covered under the existing Email standards). The
+ effect of this is to permit the use of [RFC 2047] encodings within
+ any unstructured header, or within any comment or phrase permitted
+ within any structured header. Additionally, [RFC 2047] is
+ considered to incorporate the extension to allow language tags within
+ encoded-words described in [RFC 2231]. Likewise, the syntax for
+ parameter (see 4.1 above) is to be considered as replaced by the
+ revised syntax given in [RFC 2231], the effect of which is to allow
+ the use of parameter value continuations, character sets and language
+ information within the MIME-style parameters introduced in this
+ standard (4.2.2).
+[We could go further and include that syntax explicitly in this
+document.]
+
+ Exceptionally, where some other protocol, for example the
+ authentication protocol based on OpenPGP defined in [RFC 3156],
+ restricts some header to 7-bit data, the [RFC 2047] and [RFC 2231]
+ encodings MUST be used in preference to UTF-8 (see also the similar
+ restriction in 6.21.3).
+[This presupposes that the extension to permit UTF-8 in body part
+headers in 6.21.1 survives.]
+
+ Examples:
+ Organization: Technische =?iso-8859-1?Q?Universit=E4t_M=FCnchen?=
+ Approved: =?iso-8859-1?Q?Fran=E7ois_Faur=E9?= <ff@modsite.example>
+ (=?iso-8859-1?Q*fr?Mod=E9rateur_autoris=E9?=)
+ Archive: yes; filename*=iso-8859-1'es'ma=F1ana.txt
+
+ Reading agents MUST support the use of UTF-8, [RFC 2047] and [RFC
+ 2231] in all those headers defined in this standard and in the Email
+ standards, at least to the extent of their ability to display the
+ characters presented to them. Moreover, since Netnews articles are
+ regularly emailed as well as posted, and the current Email standards
+ do not currently admit the use of full UTF-8 in headers, posting
+ agents MUST ensure that [RFC 2047] and [RFC 2231] are used in
+ preference to UTF-8 in those cases, at least within the emailed
+ version (see also 6.9 and 8.8.1.1).
Encoding by other means is not compliant with this standard.
Nevertheless, encoding using other character sets (with no indication
@@ -68,22 +105,22 @@
clues in the article, or custom within the newsgroup) has been in use
in some hierarchies, and such usage may be expected to continue for
some period after the introduction of this standard. Reading agents
- MUST support the use of UTF-8, [RFC 2047] and [RFC 2231] in headers
- and they MAY, when it is detected that none of these has been used,
- attempt to interpet the header according to whatever other character
- set can be deduced, or has been configued as a default by the reader.
+ MAY, when such usage is detected, attempt to interpet the header
+ according to whatever other character set can be deduced, or has been
+ configued as a default by the reader.
NOTE: It is possible to determine, with a high degree of
accuracy, when a given text containing octets with the 8th bit
set was not encoded using UTF-8, and using this test to recover
such non-compliant texts is therefore commended where no other
harm could arise.
-
- Exceptionally, Newsgroups-headers (5.5) MUST use UTF-8 in order to
- ensure that they appear in their canonical form (in any case, a
- Newsgroups-header is not one of the acceptable contexts of [RFC
- 2047]). Certain exceptions to this rule are provided (8.7 and 8.8.1)
- for use when mailing to moderators and other gatewaying applications.
+ The [RFC 2047] encoding is not available within headers which contain
+ a newsgroup-name, notably Newsgroups-headers and Followup-To-headers,
+ because a newsgroup-name is neither a phrase nor a comment. Moreover
+ such headers MUST in any case use UTF-8 in order to ensure that
+ newsgroup-names appear in their canonical form. A special encoding
+ for newsgroup-names is provided in section 5.5.2 for use when mailing
+ to moderators and other gatewaying applications (8.7 and 8.8.1.1).
NOTE: The choice between UTF-8 and [RFC 2047] when posting
depends on various factors. Some reading agents do not recogize