usefor-article-07 May 2002

4.4.1.  Character Sets within Article Headers

   Within article headers, characters are represented as octets
   according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
   and hence all the characters in Unicode [UNICODE 3.1] or in the
   Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
   (which is essentially a superset of Unicode and expected to remain
   so) are potentially available. However, processing all octets in the
   same manner as US-ASCII characters should ensure correct behaviour in
   most situations.

        NOTE: UTF-8 is an encoding for 16bit (and even 32bit) character
        sets with the property that any octet less than 128 immediately
        represents the corresponding US-ASCII character, thus ensuring
        upwards compatibility with previous practice.  Non-ASCII
        characters from Unicode are represented by sequences of octets
        satisfying the syntax of a UTF8-xtra-char (2.4.2), which
        excludes certain octet sequences not explicitly permitted by
        [RFC 2279].  Unicode includes all characters from the ISO-8859
        series of characters sets [ISO 8859] (which includes all
        Cyrillic, Greek and Arabic characters) together with the more
        elaborate characters used in Asian countries. See the following
        section for the appropriate treatment of Unicode characters by
        reading agents.
   Notwithstanding the great flexibility permitted by UTF-8, there is
   need for restraint in its use in order that the essential components
   of headers may be discerned using reading agents that cannot present
   the full Unicode range. In particular, header-names and tokens MUST
   be in US-ASCII, and certain other components of headers, as defined
   elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
   domains and path-identities - MUST be in US-ASCII.  Comments, phrases
   (as in addresses) and unstructured headers (such as the Subject-,
   Organization- and Summary-headers) MAY use the full range of UTF-8
   characters, but SHOULD nevertheless be invariant under Unicode
   normalization NFC [UNICODE 3.1].

        NOTE: Unicode allows for composite characters made up of a
        starter character - which can be a letter, number, punctuation
        mark, or symbol - plus zero or more combining marks (such as
        accents, diacritics, and similar). The requirement that a
        composite be invariant under normalization NFC means that, where
        it could be written in more than one way, only one particular
        one is allowed (for example, the single character E-acute is
        preferred over E followed by a non-spacing acute accent, and A-
        ring is preferred over the Angstrom symbol). At least for the
        main European languages, for which all the needed composites are
        already available as single characters, it is unlikely that
        posting agents will need to take any special steps to ensure
        normalization.

   In the particular case of newsgroup-names (see 5.5) there are more
   stringent requirements regarding the use of UTF-8 and Unicode.

   Where the use of non-ASCII characters, encoded in UTF-8, is permitted
   as above, they MAY also be encoded using the MIME mechanism defined
   in [RFC 2047], but this usage is deprecated within news articles
   (even though it is required in email messages) since it is less
   legible in older reading agents which support neither it nor UTF-8.
   Nevertheless, reading agents SHOULD support this usage, but only in
   those contexts explicitly mentioned in [RFC 2047].

   Similar considerations apply to non-ASCII characters within the
   values of parameters (which, according to the syntax, MUST be in the
   form of quoted-strings in order for UTF8-xtra-chars to be
   accomodated). Such values MAY be encoded using the MIME mechanism
   defined in [RFC 2231], but this usage is deprecated within news
   articles (even though it is required in email messages) since it is
   less legible in older reading agents which support neither it nor
   UTF-8. Nevertheless, reading agents SHOULD support this usage.

[< Prev] [TOC] [ Next >]
#Diff to first older

Newer	Older
News Article Format and Transmission May 2004 News Article Format and Transmission November 2003 News Article Format June 2003 News Article Format April 2003 News Article Format February 2003 News Article Format August 2002	News Article Format November 2001 News Article Format July 2001 News Article Format April 2001 News Article Format February 2000


--- ../usefor-article-06/Character_Sets_within_Article_Headers.out          November 2001
+++ ../usefor-article-07/Character_Sets_within_Article_Headers.out          May 2002
@@ -8,20 +8,20 @@
    so) are potentially available. However, processing all octets in the
    same manner as US-ASCII characters should ensure correct behaviour in
    most situations.
+
         NOTE: UTF-8 is an encoding for 16bit (and even 32bit) character
         sets with the property that any octet less than 128 immediately
         represents the corresponding US-ASCII character, thus ensuring
         upwards compatibility with previous practice.  Non-ASCII
         characters from Unicode are represented by sequences of octets
-        satisfying the syntax of a UTF8-xtra-char (2.4), which excludes
-        certain octet sequences not explicitly permitted by [RFC 2279].
-        Unicode includes all characters from the ISO-8859 series of
-        characters sets [ISO 8859] (which includes all Cyrillic, Greek
-        and Arabic characters) together with the more elaborate
-        characters used in Asian countries. See the following section
-        for the appropriate treatment of Unicode characters by reading
-        agents.
-
+        satisfying the syntax of a UTF8-xtra-char (2.4.2), which
+        excludes certain octet sequences not explicitly permitted by
+        [RFC 2279].  Unicode includes all characters from the ISO-8859
+        series of characters sets [ISO 8859] (which includes all
+        Cyrillic, Greek and Arabic characters) together with the more
+        elaborate characters used in Asian countries. See the following
+        section for the appropriate treatment of Unicode characters by
+        reading agents.
    Notwithstanding the great flexibility permitted by UTF-8, there is
    need for restraint in its use in order that the essential components
    of headers may be discerned using reading agents that cannot present
@@ -29,9 +29,10 @@
    be in US-ASCII, and certain other components of headers, as defined
    elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
    domains and path-identities - MUST be in US-ASCII.  Comments, phrases
-   (as in addresses) and unstructureds (as in Subject headers) MAY use
-   the full range of UTF-8 characters, but SHOULD nevertheless be
-   invariant under Unicode normalization NFC [UNICODE 3.1].
+   (as in addresses) and unstructured headers (such as the Subject-,
+   Organization- and Summary-headers) MAY use the full range of UTF-8
+   characters, but SHOULD nevertheless be invariant under Unicode
+   normalization NFC [UNICODE 3.1].
 
         NOTE: Unicode allows for composite characters made up of a
         starter character - which can be a letter, number, punctuation
@@ -53,8 +54,17 @@
    Where the use of non-ASCII characters, encoded in UTF-8, is permitted
    as above, they MAY also be encoded using the MIME mechanism defined
    in [RFC 2047], but this usage is deprecated within news articles
-   (even though it is required in mail messages) since it is less
+   (even though it is required in email messages) since it is less
    legible in older reading agents which support neither it nor UTF-8.
    Nevertheless, reading agents SHOULD support this usage, but only in
    those contexts explicitly mentioned in [RFC 2047].
+
+   Similar considerations apply to non-ASCII characters within the
+   values of parameters (which, according to the syntax, MUST be in the
+   form of quoted-strings in order for UTF8-xtra-chars to be
+   accomodated). Such values MAY be encoded using the MIME mechanism
+   defined in [RFC 2231], but this usage is deprecated within news
+   articles (even though it is required in email messages) since it is
+   less legible in older reading agents which support neither it nor
+   UTF-8. Nevertheless, reading agents SHOULD support this usage.

Documents were processed to this format by Forrest J. Cavalier III