usefor-article-10 April 2003

[< Prev] [TOC] [ Next >]
4.4.1.  Character Sets within Article Headers

   Where the use of non-ASCII characters is required, they MUST be
   encoded using the MIME mechanisms defined in [RFC 2047] and [RFC
   2231].

   Examples:
      Organization: Technische =?iso-8859-1?Q?Universit=E4t_M=FCnchen?=
      Approved: =?iso-8859-1?Q?Fran=E7ois_Faur=E9?= <ff@modsite.example>
         (=?iso-8859-1?Q*fr?Mod=E9rateur_autoris=E9?=)
      Archive: yes; filename*=iso-8859-1'es'ma=F1ana.txt

        NOTE: The raw use of non-ASCII character sets or of encodings
        other than those described above is not compliant with this
        standard, even though such usage has been seen in some
        hierarchies (with no indication of which character set has been
        used beyond the user's ability to guess based upon other clues
        in the article, or custom within the newsgroup). Future
        extensions to this standard may make provision for other
        character sets, hence the requirement that octets beyond the
        US-ASCII range be transported without error.
[< Prev] [TOC] [ Next >]
#Diff to first older
NewerOlder
News Article Format and Transmission May 2004
News Article Format and Transmission November 2003
News Article Format June 2003
News Article Format February 2003
News Article Format August 2002
News Article Format May 2002
News Article Format November 2001
News Article Format July 2001
News Article Format April 2001
News Article Format February 2000

--- ../usefor-article-09/Character_Sets_within_Article_Headers.out          February 2003
+++ ../usefor-article-10/Character_Sets_within_Article_Headers.out          April 2003
@@ -1,87 +1,8 @@
 4.4.1.  Character Sets within Article Headers
 
-   Within article headers, characters are represented as octets
-   according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646],
-   and hence all the characters in Unicode [UNICODE 3.2] or in the
-   Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646]
-   (which is essentially identical to Unicode and expected to remain so)
-   are potentially available.  Although it will usually be unnecessary
-   to use language tagging within headers, the tagging facilities
-   provided in [UNICODE 3.2] (code points U+E0000 through U+E007F) MAY
-   be used for that purpose.
-
-        NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set
-        (in both its 16 and 32 bit forms) with the property that any
-        octet less than 128 immediately represents the corresponding
-        US-ASCII character, thus ensuring upwards compatibility with
-        previous practice.  Non-ASCII characters from Unicode are
-        represented by sequences of octets satisfying the syntax of a
-        UTF8-xtra-char (2.4.2), which excludes certain octet sequences
-        not explicitly permitted by [RFC 2279].  Unicode includes all
-        characters from the ISO-8859 series of characters sets [ISO
-        8859] (which includes all Cyrillic, Greek and Arabic characters)
-        together with the more elaborate characters used in Asian
-        countries. See the NOTEs in the following section for the
-        appropriate treatment of Unicode characters by reading agents.
-[The sentence mentioning [RFC 2279] could be simplified if [RFC 2279bis]
-has been accepted by the time this standard is published.]
-
-   Notwithstanding the great flexibility permitted by UTF-8, there is
-   need for restraint in its use in order that the essential components
-   of headers may be discerned using reading agents that cannot present
-   the full Unicode range. In particular, header-names and tokens MUST
-   be in US-ASCII, and certain other components of headers, as defined
-   elsewhere in this standard - notably msg-ids, date-times, dot-atoms,
-   domains and path-identities - MUST be in US-ASCII.  Comments, phrases
-   (as in mailboxes) and unstructured headers (such as the Subject-,
-   Organization- and Summary-headers) MAY use the full range of UTF-8
-   characters, but SHOULD nevertheless be invariant under Unicode
-   normalization NFC [UNICODE 3.2].
-
-        NOTE: Unicode allows for composite characters made up of a
-        starter character - which can be a letter, number, punctuation
-        mark, or symbol - plus zero or more combining marks (such as
-        accents, diacritics, and similar). The requirement that a
-        composite be invariant under normalization NFC means that, where
-        it could be written in more than one way, only one particular
-        one of those ways is allowed (for example, the single character
-        E-acute is preferred over E followed by a non-spacing acute
-        accent, and A-ring is preferred over the Angstrom symbol). At
-        least for the main European languages, for which all the needed
-        composites are already available as single characters, it is
-        unlikely that posting agents will need to take any special steps
-        to ensure normalization.
-
-   In the particular case of newsgroup-names (see 5.5) there are more
-   stringent requirements regarding the normalization and other usages
-   of Unicode.
-
-   Where the use of non-ASCII characters is permitted as above, they MAY
-   be encoded in UTF-8 or they MAY be encoded using the MIME mechanisms
-   defined in [RFC 2047] and [RFC 2231].  For this purpose, all headers
-   defined in this standard are to be considered as "extension message
-   header fields" for the purpose of section 5 of [RFC 2047] (insofar as
-   they are not already covered under the existing Email standards). The
-   effect of this is to permit the use of [RFC 2047] encodings within
-   any unstructured header, or within any comment or phrase permitted
-   within any structured header.  Additionally,  [RFC 2047] is
-   considered to incorporate the extension to allow language tags within
-   encoded-words described in [RFC 2231].  Likewise, the syntax for
-   parameter (see 4.1 above) is to be considered as replaced by the
-   revised syntax given in [RFC 2231], the effect of which is to allow
-   the use of parameter value continuations, character sets and language
-   information within the MIME-style parameters introduced in this
-   standard (4.2.2).
-[We could go further and include that syntax explicitly in this
-document.]
-
-   Exceptionally, where some other protocol, for example the
-   authentication protocol based on OpenPGP defined in [RFC 3156],
-   restricts some header to 7-bit data, the [RFC 2047] and [RFC 2231]
-   encodings MUST be used in preference to UTF-8 (see also the similar
-   restriction in 6.21.3).
-[This presupposes that the extension to permit UTF-8 in body part
-headers in 6.21.1 survives.]
+   Where the use of non-ASCII characters is required, they MUST be
+   encoded using the MIME mechanisms defined in [RFC 2047] and [RFC
+   2231].
 
    Examples:
       Organization: Technische =?iso-8859-1?Q?Universit=E4t_M=FCnchen?=
@@ -89,48 +10,13 @@
          (=?iso-8859-1?Q*fr?Mod=E9rateur_autoris=E9?=)
       Archive: yes; filename*=iso-8859-1'es'ma=F1ana.txt
 
-   Reading agents MUST support the use of UTF-8, [RFC 2047] and [RFC
-   2231] in all those headers defined in this standard and in the Email
-   standards, at least to the extent of their ability to display the
-   characters presented to them. Moreover, since Netnews articles are
-   regularly emailed as well as posted, and the current Email standards
-   do not currently admit the use of full UTF-8 in headers, posting
-   agents MUST ensure that [RFC 2047] and [RFC 2231] are used in
-   preference to UTF-8 in those cases, at least within the emailed
-   version (see also 6.9 and 8.8.1.1).
-
-   Encoding by other means is not compliant with this standard.
-   Nevertheless, encoding using other character sets (with no indication
-   of which one beyond the user's ability to guess based upon other
-   clues in the article, or custom within the newsgroup) has been in use
-   in some hierarchies, and such usage may be expected to continue for
-   some period after the introduction of this standard.  Reading agents
-   MAY, when such usage is detected, attempt to interpet the header
-   according to whatever other character set can be deduced, or has been
-   configued as a default by the reader.
-
-        NOTE: It is possible to determine, with a high degree of
-        accuracy, when a given text containing octets with the 8th bit
-        set was not encoded using UTF-8, and using this test to recover
-        such non-compliant texts is therefore commended where no other
-        harm could arise.
-   The [RFC 2047] encoding is not available within headers which contain
-   a newsgroup-name, notably Newsgroups-headers and Followup-To-headers,
-   because a newsgroup-name is neither a phrase nor a comment. Moreover
-   such headers MUST in any case use UTF-8 in order to ensure that
-   newsgroup-names appear in their canonical form.  A special encoding
-   for newsgroup-names is provided in section 5.5.2 for use when mailing
-   to moderators and other gatewaying applications (8.7 and 8.8.1.1).
-
-        NOTE: The choice between UTF-8 and [RFC 2047] when posting
-        depends on various factors. Some reading agents do not recogize
-        [RFC 2047], and some are incapable of decoding UTF-8 (though
-        there in an increasing tendency for modern reading agents to
-        understand, or to be configurable to understand, both). Since
-        headers encoded in UTF-8 are currently prohibited in Email,
-        special consideration needs to be given to articles that are
-        both posted and mailed (6.9) or which are mailed to moderators
-        (see 8.2.2).  Posters and implementors of posting agents need to
-        take account of all these factors when deciding which method to
-        use.
+        NOTE: The raw use of non-ASCII character sets or of encodings
+        other than those described above is not compliant with this
+        standard, even though such usage has been seen in some
+        hierarchies (with no indication of which character set has been
+        used beyond the user's ability to guess based upon other clues
+        in the article, or custom within the newsgroup). Future
+        extensions to this standard may make provision for other
+        character sets, hence the requirement that octets beyond the
+        US-ASCII range be transported without error.
 

Documents were processed to this format by Forrest J. Cavalier III