usefor-article-03 February 2000
[< Prev]
[TOC] [ Next >]
4.4. Characters and Character Sets
Transmission paths for news articles MUST treat news articles as
uninterpreted sequences of octets, excluding the values 0 (ASCII NUL)
and 13 and 10 (ASCII CR and LF, which MUST ONLY appear in the
combination CRLF which denotes a line separator).
NOTE: this correspponds to the range of octets permitted for
Mime "8bit data" [RFC 2045]. Thus raw binary data cannot be
transmitted in an article body except by the use of a Content-
Transfer-Encoding such as base64.
An octet, or a sequence of octets, may represent a character in some
Coded Character Set (CCS) as determined by some Character Encoding
Scheme (CES) [RFC 2130].
If it comes to a relaying agent's attention that it is being asked to
pass an article using the Content-Transfer-Encoding "8bit" to a
relaying agent that does not support it, it SHOULD report this error
to its administrator. It MUST refuse to pass the article and MUST NOT
re-encode it with different Mime encodings.
NOTE: This strategy will do little harm. The target relaying
agent is unlikely to be able to make use of the article on its
own servers, and the usual flooding algorithm will likely find
some alternative route to get the article to destinations where
it is needed.
[< Prev]
[TOC] [ Next >]
#Diff to first older
--- ../s-o-1036/Characters_And_Character_Sets.out June 1994
+++ ../usefor-article-03/Characters_And_Character_Sets.out February 2000
@@ -1,206 +1,28 @@
4.4. Characters And Character Sets
-Header and body lines MAY contain any ASCII characters other
-than CR (ASCII 13), LF (ASCII 10), and NUL (ASCII 0).
-
- NOTE: CR and LF are excluded because they clash
- with common EOL conventions. NUL is excluded
- because it clashes with the C end-of-string con-
- vention, which is significant to most existing
- news software. These three characters are
- unlikely to be transmitted successfully.
-
-However, posters SHOULD avoid using ASCII control characters
-except for tab (ASCII 9), formfeed (ASCII 12), and backspace
-(ASCII 8). Tab signifies sufficient horizontal white space
-to reach the next of a set of fixed positions; posters are
-warned that there is no standard set of positions, so tabs
-should be avoided if precise spacing is essential. Formfeed
-signifies a point at which a reading agent SHOULD pause and
-await reader interaction before displaying further text.
-Backspace SHOULD be used only for underlining, done by a
-sequence of underscores (ASCII 95) followed by an equal num-
-ber of backspaces, signifying that the same number of text
-characters following are to be underlined. Posters are
-warned that underlining is not available on all output
-devices and is best not relied on for essential meaning.
-Reading agents SHOULD recognize underlining and translate it
-to the appropriate commands for devices that support it.
-
- NOTE: Interpretation of almost all control charac-
- ters is device-specific to some degree, and
- devices differ. Tabs and underlining are sup-
- ported, to some extent, by most modern devices and
- reading agents, hence the cautious exemptions for
-
-INTERNET DRAFT to be NEWS sec. 4.4
-
-
- them. The underlining method is specified because
- the inverse method, text and then underscores, is
- tempting to the naive... but if sent unaltered to
- a device that shows only the most recent of sev-
- eral overstruck characters rather than a compos-
- ite, the result can be utterly unreadable.
-
- NOTE: A common interpretation of tab is that it is
- a request to space forward to the next position
- whose number is one more than a multiple of 8,
- with positions numbered sequentially starting at
- 1. (So tab positions are 9, 17, 25, ...) Reading
- agents not constrained by existing system conven-
- tions might wish to use this interpretation.
-
- NOTE: It will typically be necessary for a reading
- agent to catch and interpret formfeed, not just
- send it to the output device. The actions per-
- formed by typical output devices on receiving a
- formfeed are neither adequate for nor appropriate
- to the pause-for-interaction meaning.
-
-Cooperating subnets which wish to employ non-ASCII character
-sets by using escape sequences (employing, e.g., ESC (ASCII
-27), SO (ASCII 14), and SI (ASCII 15)) to alter the meaning
-of superficially-ASCII characters MAY do so, but MUST use
-MIME headers to alert reading agents to the particular char-
-acter set(s) and escape sequences in use. A reading agent
-SHOULD not pass such an escape sequence through, unaltered,
-to the output device unless the agent confirms that the
-sequence is one used to affect character sets and has reason
-to believe that the device is capable of interpreting that
-particular sequence properly.
-
- NOTE: Cooperating-subnet organizers are warned
- that some very old relayers strip certain control
- characters out of articles they pass along. ESC
- is known to be among the affected characters.
-
- NOTE: There are now standard Internet encodings
- for Japanese [rrr] and Vietnamese [rrr] in partic-
- ular.
-
-Articles MUST not contain any octet with value exceeding
-127, i.e. any octet that is not an ASCII character.
-
- NOTE: This rule, like others, may be relaxed by
- unanimous consent of the members of a cooperating
- subnet, provided suitable precautions are taken to
- ensure that rule-violating articles do not leak
- out of the subnet. (This has already been done in
- many areas where ASCII is not adequate for the
- local language(s).) Beware that articles contain-
- ing non-ASCII octets in headers are a violation of
-
-INTERNET DRAFT to be NEWS sec. 4.4
-
-
- the MAIL specifications and are not valid MAIL
- messages. MIME offers a way to encode non-ASCII
- characters in ASCII for use in headers; see sec-
- tion 4.5.
-
- NOTE: While there is great interest in using 8-bit
- character sets, not all software can yet handle
- them correctly. Hence the restriction to cooper-
- ating subnets. MIME encodings can be used to
- transmit such characters while remaining within
- the octet restriction.
-
-In anticipation of the day when it is possible to use non-
-ASCII characters safely anywhere, and to provide for the
-(substantial) cooperating subnets that are already using
-them, transmission paths SHOULD treat news articles as unin-
-terpreted sequences of octets (except perhaps for transfor-
-mations between EOL representations) and relayers SHOULD
-treat non-ASCII characters in articles as ordinary charac-
-ters.
-
- NOTE: 8-bit enthusiasts are warned that not all
- software conforms to these recommendations yet.
- In particular, standard NNTP [rrr] is a 7-bit pro-
- tocol, and there may be implementations which
- enforce this rule. Be warned, also, that it will
- never be safe to send raw binary data in the body
- of news articles, because changes of EOL represen-
- tation may (will!) corrupt it.
-
-Except where cooperating subnets permit more direct
-approaches, MIME [rrr] headers and encodings SHOULD be used
-to transmit non-ASCII content using ASCII characters; see
-section 4.5, appendix B, and the MIME RFCs for details. If
-article content can be expressed in ASCII, it SHOULD be.
-Failing that, the order of preference for character sets is
-that described in MIME [rrr].
-
- NOTE: Using the MIME facilities, it is possible to
- transmit ANY character set, and ANY form of binary
- data, using only ASCII characters. Equally impor-
- tant, such articles are self-describing and the
- reading agent can tell which octet-to-symbol map-
- ping is intended! Designation of some preferred
- character sets is intended to minimize the number
- of character sets that a reading agent must under-
- stand in order to display most articles properly.
-
-Articles containing non-ASCII characters, articles using
-ASCII characters (values 0 through 127) to refer to non-
-ASCII symbols, and articles using escape sequences to shift
-character sets SHOULD include MIME headers indicating which
-character set(s) and conventions are being used, and MUST do
-so unless such articles are strictly confined to a
-
-INTERNET DRAFT to be NEWS sec. 4.4
-
-
-cooperating subnet which has its own pre-agreed conventions.
-MIME encodings are preferred over all these techniques. If
-it comes to a relayer's attention that it is being asked to
-pass an article using such techniques outward across what it
-knows to be the boundary of such a cooperating subnet, it
-MUST report this error to its administrator, and MAY refuse
-to pass the article beyond the subnet boundary. If it does
-pass the article, it MUST re-encode it with MIME encodings
-to make it conform to this Draft.
-
- NOTE: Such re-encoding is a non-trivial task, due
- to MIME rules such as the prohibition of nested
- encodings. It's not just a matter of pouring the
- body through a simple filter.
-
-Reading agents SHOULD note MIME headers and attempt to show
-the reader the closest possible approximation to the
-intended content. They SHOULD not just send the octets of
-the article to the output device unaltered, unless there is
-reason to believe that the output device will indeed inter-
-pret them correctly. Reading agents MUST not pass ASCII
-control characters or escape sequences, other than as dis-
-cussed above, unaltered to the output device; only by chance
-would the result be the desired one, and there is serious
-potential for harmful side effects, either accidental or
-malicious.
-
- NOTE: Exactly what to do with unwanted control
- characters/sequences depends on the philosophy of
- the reading agent, but passing them straight to
- the output device is almost always wrong. If the
- reading agent wants to mark the presence of such a
- character/sequence in circumstances where only
- ASCII printable characters are available, trans-
- lating it to "#" might be a suitable method; "#"
- is a conspicuous character seldom used in normal
- text.
-
- NOTE: Reading agents should be aware that many old
- output devices (or the transmission paths to them)
- zero out the top bit of octets sent to them. This
- can transform non-ASCII characters into ASCII con-
- trol characters.
-
-Followup agents MUST be careful to apply appropriate trans-
-formations of representation to the outbound followup as
-well as the inbound precursor. A followup to an article
-containing non-ASCII material is very likely to contain non-
-ASCII material itself.
-
-INTERNET DRAFT to be NEWS sec. 4.5
+ Transmission paths for news articles MUST treat news articles as
+ uninterpreted sequences of octets, excluding the values 0 (ASCII NUL)
+ and 13 and 10 (ASCII CR and LF, which MUST ONLY appear in the
+ combination CRLF which denotes a line separator).
+
+ NOTE: this correspponds to the range of octets permitted for
+ Mime "8bit data" [RFC 2045]. Thus raw binary data cannot be
+ transmitted in an article body except by the use of a Content-
+ Transfer-Encoding such as base64.
+
+ An octet, or a sequence of octets, may represent a character in some
+ Coded Character Set (CCS) as determined by some Character Encoding
+ Scheme (CES) [RFC 2130].
+
+ If it comes to a relaying agent's attention that it is being asked to
+ pass an article using the Content-Transfer-Encoding "8bit" to a
+ relaying agent that does not support it, it SHOULD report this error
+ to its administrator. It MUST refuse to pass the article and MUST NOT
+ re-encode it with different Mime encodings.
+
+ NOTE: This strategy will do little harm. The target relaying
+ agent is unlikely to be able to make use of the article on its
+ own servers, and the usual flooding algorithm will likely find
+ some alternative route to get the article to destinations where
+ it is needed.