usefor-article-09 February 2003

[< Prev] [TOC] [ Next >]
4.4.1.  Character Sets within Article Headers

   Within article headers, characters are represented as octets
   according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
   and hence all the characters in Unicode [UNICODE 3.2] or in the
   Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
   (which is essentially identical to Unicode and expected to remain so)
   are potentially available.  Although it will usually be unnecessary
   to use language tagging within headers, the tagging facilities
   provided in [UNICODE 3.2] (code points U+E0000 through U+E007F) MAY
   be used for that purpose.

        NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
        (in both its 16 and 32 bit forms) with the property that any
        octet less than 128 immediately represents the corresponding
        US-ASCII character, thus ensuring upwards compatibility with
        previous practice.  Non-ASCII characters from Unicode are
        represented by sequences of octets satisfying the syntax of a
        UTF8-xtra-char (2.4.2), which excludes certain octet sequences
        not explicitly permitted by [RFC 2279].  Unicode includes all
        characters from the ISO-8859 series of characters sets [ISO
        8859] (which includes all Cyrillic, Greek and Arabic characters)
        together with the more elaborate characters used in Asian
        countries. See the NOTEs in the following section for the
        appropriate treatment of Unicode characters by reading agents.
[The sentence mentioning [RFC 2279] could be simplified if [RFC 2279bis]
has been accepted by the time this standard is published.]

   Notwithstanding the great flexibility permitted by UTF-8, there is
   need for restraint in its use in order that the essential components
   of headers may be discerned using reading agents that cannot present
   the full Unicode range. In particular, header-names and tokens MUST
   be in US-ASCII, and certain other components of headers, as defined
   elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
   domains and path-identities - MUST be in US-ASCII.  Comments, phrases
   (as in mailboxes) and unstructured headers (such as the Subject-,
   Organization- and Summary-headers) MAY use the full range of UTF-8
   characters, but SHOULD nevertheless be invariant under Unicode
   normalization NFC [UNICODE 3.2].

        NOTE: Unicode allows for composite characters made up of a
        starter character - which can be a letter, number, punctuation
        mark, or symbol - plus zero or more combining marks (such as
        accents, diacritics, and similar). The requirement that a
        composite be invariant under normalization NFC means that, where
        it could be written in more than one way, only one particular
        one of those ways is allowed (for example, the single character
        E-acute is preferred over E followed by a non-spacing acute
        accent, and A-ring is preferred over the Angstrom symbol). At
        least for the main European languages, for which all the needed
        composites are already available as single characters, it is
        unlikely that posting agents will need to take any special steps
        to ensure normalization.

   In the particular case of newsgroup-names (see 5.5) there are more
   stringent requirements regarding the normalization and other usages
   of Unicode.

   Where the use of non-ASCII characters is permitted as above, they MAY
   be encoded in UTF-8 or they MAY be encoded using the MIME mechanisms
   defined in [RFC 2047] and [RFC 2231].  For this purpose, all headers
   defined in this standard are to be considered as "extension message
   header fields" for the purpose of section 5 of [RFC 2047] (insofar as
   they are not already covered under the existing Email standards). The
   effect of this is to permit the use of [RFC 2047] encodings within
   any unstructured header, or within any comment or phrase permitted
   within any structured header.  Additionally,  [RFC 2047] is
   considered to incorporate the extension to allow language tags within
   encoded-words described in [RFC 2231].  Likewise, the syntax for
   parameter (see 4.1 above) is to be considered as replaced by the
   revised syntax given in [RFC 2231], the effect of which is to allow
   the use of parameter value continuations, character sets and language
   information within the MIME-style parameters introduced in this
   standard (4.2.2).
[We could go further and include that syntax explicitly in this
document.]

   Exceptionally, where some other protocol, for example the
   authentication protocol based on OpenPGP defined in [RFC 3156],
   restricts some header to 7-bit data, the [RFC 2047] and [RFC 2231]
   encodings MUST be used in preference to UTF-8 (see also the similar
   restriction in 6.21.3).
[This presupposes that the extension to permit UTF-8 in body part
headers in 6.21.1 survives.]

   Examples:
      Organization: Technische =?iso-8859-1?Q?Universit=E4t_M=FCnchen?=
      Approved: =?iso-8859-1?Q?Fran=E7ois_Faur=E9?= <ff@modsite.example>
         (=?iso-8859-1?Q*fr?Mod=E9rateur_autoris=E9?=)
      Archive: yes; filename*=iso-8859-1'es'ma=F1ana.txt

   Reading agents MUST support the use of UTF-8, [RFC 2047] and [RFC
   2231] in all those headers defined in this standard and in the Email
   standards, at least to the extent of their ability to display the
   characters presented to them. Moreover, since Netnews articles are
   regularly emailed as well as posted, and the current Email standards
   do not currently admit the use of full UTF-8 in headers, posting
   agents MUST ensure that [RFC 2047] and [RFC 2231] are used in
   preference to UTF-8 in those cases, at least within the emailed
   version (see also 6.9 and 8.8.1.1).

   Encoding by other means is not compliant with this standard.
   Nevertheless, encoding using other character sets (with no indication
   of which one beyond the user's ability to guess based upon other
   clues in the article, or custom within the newsgroup) has been in use
   in some hierarchies, and such usage may be expected to continue for
   some period after the introduction of this standard.  Reading agents
   MAY, when such usage is detected, attempt to interpet the header
   according to whatever other character set can be deduced, or has been
   configued as a default by the reader.

        NOTE: It is possible to determine, with a high degree of
        accuracy, when a given text containing octets with the 8th bit
        set was not encoded using UTF-8, and using this test to recover
        such non-compliant texts is therefore commended where no other
        harm could arise.
   The [RFC 2047] encoding is not available within headers which contain
   a newsgroup-name, notably Newsgroups-headers and Followup-To-headers,
   because a newsgroup-name is neither a phrase nor a comment. Moreover
   such headers MUST in any case use UTF-8 in order to ensure that
   newsgroup-names appear in their canonical form.  A special encoding
   for newsgroup-names is provided in section 5.5.2 for use when mailing
   to moderators and other gatewaying applications (8.7 and 8.8.1.1).

        NOTE: The choice between UTF-8 and [RFC 2047] when posting
        depends on various factors. Some reading agents do not recogize
        [RFC 2047], and some are incapable of decoding UTF-8 (though
        there in an increasing tendency for modern reading agents to
        understand, or to be configurable to understand, both). Since
        headers encoded in UTF-8 are currently prohibited in Email,
        special consideration needs to be given to articles that are
        both posted and mailed (6.9) or which are mailed to moderators
        (see 8.2.2).  Posters and implementors of posting agents need to
        take account of all these factors when deciding which method to
        use.
[< Prev] [TOC] [ Next >]
#Diff to first older
NewerOlder
News Article Format and Transmission May 2004
News Article Format and Transmission November 2003
News Article Format June 2003
News Article Format April 2003
News Article Format August 2002
News Article Format May 2002
News Article Format November 2001
News Article Format July 2001
News Article Format April 2001
News Article Format February 2000

--- ../usefor-article-08/Character_Sets_within_Article_Headers.out          August 2002
+++ ../usefor-article-09/Character_Sets_within_Article_Headers.out          February 2003
@@ -4,10 +4,11 @@
    according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
    and hence all the characters in Unicode [UNICODE 3.2] or in the
    Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
-   (which is essentially a superset of Unicode and expected to remain
-   so) are potentially available. However, processing all octets in the
-   same manner as US-ASCII characters should ensure correct behaviour in
-   most situations.
+   (which is essentially identical to Unicode and expected to remain so)
+   are potentially available.  Although it will usually be unnecessary
+   to use language tagging within headers, the tagging facilities
+   provided in [UNICODE 3.2] (code points U+E0000 through U+E007F) MAY
+   be used for that purpose.
 
         NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
         (in both its 16 and 32 bit forms) with the property that any
@@ -56,11 +57,47 @@
    of Unicode.
 
    Where the use of non-ASCII characters is permitted as above, they MAY
-   be encoded in UTF-8 and they MAY be encoded using the MIME mechanisms
-   defined in [RFC 2047] and [RFC 2231], but only in those contexts
-   explicitly mentioned in those documents (unstructured headers,
-   phrases and comments in the one, quoted-strings within parameters in
-   the other).
+   be encoded in UTF-8 or they MAY be encoded using the MIME mechanisms
+   defined in [RFC 2047] and [RFC 2231].  For this purpose, all headers
+   defined in this standard are to be considered as "extension message
+   header fields" for the purpose of section 5 of [RFC 2047] (insofar as
+   they are not already covered under the existing Email standards). The
+   effect of this is to permit the use of [RFC 2047] encodings within
+   any unstructured header, or within any comment or phrase permitted
+   within any structured header.  Additionally,  [RFC 2047] is
+   considered to incorporate the extension to allow language tags within
+   encoded-words described in [RFC 2231].  Likewise, the syntax for
+   parameter (see 4.1 above) is to be considered as replaced by the
+   revised syntax given in [RFC 2231], the effect of which is to allow
+   the use of parameter value continuations, character sets and language
+   information within the MIME-style parameters introduced in this
+   standard (4.2.2).
+[We could go further and include that syntax explicitly in this
+document.]
+
+   Exceptionally, where some other protocol, for example the
+   authentication protocol based on OpenPGP defined in [RFC 3156],
+   restricts some header to 7-bit data, the [RFC 2047] and [RFC 2231]
+   encodings MUST be used in preference to UTF-8 (see also the similar
+   restriction in 6.21.3).
+[This presupposes that the extension to permit UTF-8 in body part
+headers in 6.21.1 survives.]
+
+   Examples:
+      Organization: Technische =?iso-8859-1?Q?Universit=E4t_M=FCnchen?=
+      Approved: =?iso-8859-1?Q?Fran=E7ois_Faur=E9?= <ff@modsite.example>
+         (=?iso-8859-1?Q*fr?Mod=E9rateur_autoris=E9?=)
+      Archive: yes; filename*=iso-8859-1'es'ma=F1ana.txt
+
+   Reading agents MUST support the use of UTF-8, [RFC 2047] and [RFC
+   2231] in all those headers defined in this standard and in the Email
+   standards, at least to the extent of their ability to display the
+   characters presented to them. Moreover, since Netnews articles are
+   regularly emailed as well as posted, and the current Email standards
+   do not currently admit the use of full UTF-8 in headers, posting
+   agents MUST ensure that [RFC 2047] and [RFC 2231] are used in
+   preference to UTF-8 in those cases, at least within the emailed
+   version (see also 6.9 and 8.8.1.1).
 
    Encoding by other means is not compliant with this standard.
    Nevertheless, encoding using other character sets (with no indication
@@ -68,22 +105,22 @@
    clues in the article, or custom within the newsgroup) has been in use
    in some hierarchies, and such usage may be expected to continue for
    some period after the introduction of this standard.  Reading agents
-   MUST support the use of UTF-8, [RFC 2047] and [RFC 2231] in headers
-   and they MAY, when it is detected that none of these has been used,
-   attempt to interpet the header according to whatever other character
-   set can be deduced, or has been configued as a default by the reader.
+   MAY, when such usage is detected, attempt to interpet the header
+   according to whatever other character set can be deduced, or has been
+   configued as a default by the reader.
 
         NOTE: It is possible to determine, with a high degree of
         accuracy, when a given text containing octets with the 8th bit
         set was not encoded using UTF-8, and using this test to recover
         such non-compliant texts is therefore commended where no other
         harm could arise.
-
-   Exceptionally, Newsgroups-headers (5.5) MUST use UTF-8 in order to
-   ensure that they appear in their canonical form (in any case, a
-   Newsgroups-header is not one of the acceptable contexts of [RFC
-   2047]).  Certain exceptions to this rule are provided (8.7 and 8.8.1)
-   for use when mailing to moderators and other gatewaying applications.
+   The [RFC 2047] encoding is not available within headers which contain
+   a newsgroup-name, notably Newsgroups-headers and Followup-To-headers,
+   because a newsgroup-name is neither a phrase nor a comment. Moreover
+   such headers MUST in any case use UTF-8 in order to ensure that
+   newsgroup-names appear in their canonical form.  A special encoding
+   for newsgroup-names is provided in section 5.5.2 for use when mailing
+   to moderators and other gatewaying applications (8.7 and 8.8.1.1).
 
         NOTE: The choice between UTF-8 and [RFC 2047] when posting
         depends on various factors. Some reading agents do not recogize


Documents were processed to this format by Forrest J. Cavalier III