usefor-article-08 August 2002

4.4.1.  Character Sets within Article Headers

   Within article headers, characters are represented as octets
   according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
   and hence all the characters in Unicode [UNICODE 3.2] or in the
   Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
   (which is essentially a superset of Unicode and expected to remain
   so) are potentially available. However, processing all octets in the
   same manner as US-ASCII characters should ensure correct behaviour in
   most situations.

        NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
        (in both its 16 and 32 bit forms) with the property that any
        octet less than 128 immediately represents the corresponding
        US-ASCII character, thus ensuring upwards compatibility with
        previous practice.  Non-ASCII characters from Unicode are
        represented by sequences of octets satisfying the syntax of a
        UTF8-xtra-char (2.4.2), which excludes certain octet sequences
        not explicitly permitted by [RFC 2279].  Unicode includes all
        characters from the ISO-8859 series of characters sets [ISO
        8859] (which includes all Cyrillic, Greek and Arabic characters)
        together with the more elaborate characters used in Asian
        countries. See the NOTEs in the following section for the
        appropriate treatment of Unicode characters by reading agents.
[The sentence mentioning [RFC 2279] could be simplified if [RFC 2279bis]
has been accepted by the time this standard is published.]

   Notwithstanding the great flexibility permitted by UTF-8, there is
   need for restraint in its use in order that the essential components
   of headers may be discerned using reading agents that cannot present
   the full Unicode range. In particular, header-names and tokens MUST
   be in US-ASCII, and certain other components of headers, as defined
   elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
   domains and path-identities - MUST be in US-ASCII.  Comments, phrases
   (as in mailboxes) and unstructured headers (such as the Subject-,
   Organization- and Summary-headers) MAY use the full range of UTF-8
   characters, but SHOULD nevertheless be invariant under Unicode
   normalization NFC [UNICODE 3.2].

        NOTE: Unicode allows for composite characters made up of a
        starter character - which can be a letter, number, punctuation
        mark, or symbol - plus zero or more combining marks (such as
        accents, diacritics, and similar). The requirement that a
        composite be invariant under normalization NFC means that, where
        it could be written in more than one way, only one particular
        one of those ways is allowed (for example, the single character
        E-acute is preferred over E followed by a non-spacing acute
        accent, and A-ring is preferred over the Angstrom symbol). At
        least for the main European languages, for which all the needed
        composites are already available as single characters, it is
        unlikely that posting agents will need to take any special steps
        to ensure normalization.

   In the particular case of newsgroup-names (see 5.5) there are more
   stringent requirements regarding the normalization and other usages
   of Unicode.

   Where the use of non-ASCII characters is permitted as above, they MAY
   be encoded in UTF-8 and they MAY be encoded using the MIME mechanisms
   defined in [RFC 2047] and [RFC 2231], but only in those contexts
   explicitly mentioned in those documents (unstructured headers,
   phrases and comments in the one, quoted-strings within parameters in
   the other).

   Encoding by other means is not compliant with this standard.
   Nevertheless, encoding using other character sets (with no indication
   of which one beyond the user's ability to guess based upon other
   clues in the article, or custom within the newsgroup) has been in use
   in some hierarchies, and such usage may be expected to continue for
   some period after the introduction of this standard.  Reading agents
   MUST support the use of UTF-8, [RFC 2047] and [RFC 2231] in headers
   and they MAY, when it is detected that none of these has been used,
   attempt to interpet the header according to whatever other character
   set can be deduced, or has been configued as a default by the reader.

        NOTE: It is possible to determine, with a high degree of
        accuracy, when a given text containing octets with the 8th bit
        set was not encoded using UTF-8, and using this test to recover
        such non-compliant texts is therefore commended where no other
        harm could arise.

   Exceptionally, Newsgroups-headers (5.5) MUST use UTF-8 in order to
   ensure that they appear in their canonical form (in any case, a
   Newsgroups-header is not one of the acceptable contexts of [RFC
   2047]).  Certain exceptions to this rule are provided (8.7 and 8.8.1)
   for use when mailing to moderators and other gatewaying applications.

        NOTE: The choice between UTF-8 and [RFC 2047] when posting
        depends on various factors. Some reading agents do not recogize
        [RFC 2047], and some are incapable of decoding UTF-8 (though
        there in an increasing tendency for modern reading agents to
        understand, or to be configurable to understand, both). Since
        headers encoded in UTF-8 are currently prohibited in Email,
        special consideration needs to be given to articles that are
        both posted and mailed (6.9) or which are mailed to moderators
        (see 8.2.2).  Posters and implementors of posting agents need to
        take account of all these factors when deciding which method to
        use.

[< Prev] [TOC] [ Next >]
#Diff to first older

Newer	Older
News Article Format and Transmission May 2004 News Article Format and Transmission November 2003 News Article Format June 2003 News Article Format April 2003 News Article Format February 2003	News Article Format May 2002 News Article Format November 2001 News Article Format July 2001 News Article Format April 2001 News Article Format February 2000


--- ../usefor-article-07/Character_Sets_within_Article_Headers.out          May 2002
+++ ../usefor-article-08/Character_Sets_within_Article_Headers.out          August 2002
@@ -2,26 +2,29 @@
 
    Within article headers, characters are represented as octets
    according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
-   and hence all the characters in Unicode [UNICODE 3.1] or in the
+   and hence all the characters in Unicode [UNICODE 3.2] or in the
    Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
    (which is essentially a superset of Unicode and expected to remain
    so) are potentially available. However, processing all octets in the
    same manner as US-ASCII characters should ensure correct behaviour in
    most situations.
 
-        NOTE: UTF-8 is an encoding for 16bit (and even 32bit) character
-        sets with the property that any octet less than 128 immediately
-        represents the corresponding US-ASCII character, thus ensuring
-        upwards compatibility with previous practice.  Non-ASCII
-        characters from Unicode are represented by sequences of octets
-        satisfying the syntax of a UTF8-xtra-char (2.4.2), which
-        excludes certain octet sequences not explicitly permitted by
-        [RFC 2279].  Unicode includes all characters from the ISO-8859
-        series of characters sets [ISO 8859] (which includes all
-        Cyrillic, Greek and Arabic characters) together with the more
-        elaborate characters used in Asian countries. See the following
-        section for the appropriate treatment of Unicode characters by
-        reading agents.
+        NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
+        (in both its 16 and 32 bit forms) with the property that any
+        octet less than 128 immediately represents the corresponding
+        US-ASCII character, thus ensuring upwards compatibility with
+        previous practice.  Non-ASCII characters from Unicode are
+        represented by sequences of octets satisfying the syntax of a
+        UTF8-xtra-char (2.4.2), which excludes certain octet sequences
+        not explicitly permitted by [RFC 2279].  Unicode includes all
+        characters from the ISO-8859 series of characters sets [ISO
+        8859] (which includes all Cyrillic, Greek and Arabic characters)
+        together with the more elaborate characters used in Asian
+        countries. See the NOTEs in the following section for the
+        appropriate treatment of Unicode characters by reading agents.
+[The sentence mentioning [RFC 2279] could be simplified if [RFC 2279bis]
+has been accepted by the time this standard is published.]
+
    Notwithstanding the great flexibility permitted by UTF-8, there is
    need for restraint in its use in order that the essential components
    of headers may be discerned using reading agents that cannot present
@@ -29,10 +32,10 @@
    be in US-ASCII, and certain other components of headers, as defined
    elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
    domains and path-identities - MUST be in US-ASCII.  Comments, phrases
-   (as in addresses) and unstructured headers (such as the Subject-,
+   (as in mailboxes) and unstructured headers (such as the Subject-,
    Organization- and Summary-headers) MAY use the full range of UTF-8
    characters, but SHOULD nevertheless be invariant under Unicode
-   normalization NFC [UNICODE 3.1].
+   normalization NFC [UNICODE 3.2].
 
         NOTE: Unicode allows for composite characters made up of a
         starter character - which can be a letter, number, punctuation
@@ -40,31 +43,57 @@
         accents, diacritics, and similar). The requirement that a
         composite be invariant under normalization NFC means that, where
         it could be written in more than one way, only one particular
-        one is allowed (for example, the single character E-acute is
-        preferred over E followed by a non-spacing acute accent, and A-
-        ring is preferred over the Angstrom symbol). At least for the
-        main European languages, for which all the needed composites are
-        already available as single characters, it is unlikely that
-        posting agents will need to take any special steps to ensure
-        normalization.
+        one of those ways is allowed (for example, the single character
+        E-acute is preferred over E followed by a non-spacing acute
+        accent, and A-ring is preferred over the Angstrom symbol). At
+        least for the main European languages, for which all the needed
+        composites are already available as single characters, it is
+        unlikely that posting agents will need to take any special steps
+        to ensure normalization.
 
    In the particular case of newsgroup-names (see 5.5) there are more
-   stringent requirements regarding the use of UTF-8 and Unicode.
+   stringent requirements regarding the normalization and other usages
+   of Unicode.
+
+   Where the use of non-ASCII characters is permitted as above, they MAY
+   be encoded in UTF-8 and they MAY be encoded using the MIME mechanisms
+   defined in [RFC 2047] and [RFC 2231], but only in those contexts
+   explicitly mentioned in those documents (unstructured headers,
+   phrases and comments in the one, quoted-strings within parameters in
+   the other).
+
+   Encoding by other means is not compliant with this standard.
+   Nevertheless, encoding using other character sets (with no indication
+   of which one beyond the user's ability to guess based upon other
+   clues in the article, or custom within the newsgroup) has been in use
+   in some hierarchies, and such usage may be expected to continue for
+   some period after the introduction of this standard.  Reading agents
+   MUST support the use of UTF-8, [RFC 2047] and [RFC 2231] in headers
+   and they MAY, when it is detected that none of these has been used,
+   attempt to interpet the header according to whatever other character
+   set can be deduced, or has been configued as a default by the reader.
+
+        NOTE: It is possible to determine, with a high degree of
+        accuracy, when a given text containing octets with the 8th bit
+        set was not encoded using UTF-8, and using this test to recover
+        such non-compliant texts is therefore commended where no other
+        harm could arise.
+
+   Exceptionally, Newsgroups-headers (5.5) MUST use UTF-8 in order to
+   ensure that they appear in their canonical form (in any case, a
+   Newsgroups-header is not one of the acceptable contexts of [RFC
+   2047]).  Certain exceptions to this rule are provided (8.7 and 8.8.1)
+   for use when mailing to moderators and other gatewaying applications.
 
-   Where the use of non-ASCII characters, encoded in UTF-8, is permitted
-   as above, they MAY also be encoded using the MIME mechanism defined
-   in [RFC 2047], but this usage is deprecated within news articles
-   (even though it is required in email messages) since it is less
-   legible in older reading agents which support neither it nor UTF-8.
-   Nevertheless, reading agents SHOULD support this usage, but only in
-   those contexts explicitly mentioned in [RFC 2047].
-
-   Similar considerations apply to non-ASCII characters within the
-   values of parameters (which, according to the syntax, MUST be in the
-   form of quoted-strings in order for UTF8-xtra-chars to be
-   accomodated). Such values MAY be encoded using the MIME mechanism
-   defined in [RFC 2231], but this usage is deprecated within news
-   articles (even though it is required in email messages) since it is
-   less legible in older reading agents which support neither it nor
-   UTF-8. Nevertheless, reading agents SHOULD support this usage.
+        NOTE: The choice between UTF-8 and [RFC 2047] when posting
+        depends on various factors. Some reading agents do not recogize
+        [RFC 2047], and some are incapable of decoding UTF-8 (though
+        there in an increasing tendency for modern reading agents to
+        understand, or to be configurable to understand, both). Since
+        headers encoded in UTF-8 are currently prohibited in Email,
+        special consideration needs to be given to articles that are
+        both posted and mailed (6.9) or which are mailed to moderators
+        (see 8.2.2).  Posters and implementors of posting agents need to
+        take account of all these factors when deciding which method to
+        use.

Documents were processed to this format by Forrest J. Cavalier III