s-o-1036 June 1994
[< Prev]
[TOC] [ Next >]
4.4. Characters And Character Sets
Header and body lines MAY contain any ASCII characters other
than CR (ASCII 13), LF (ASCII 10), and NUL (ASCII 0).
NOTE: CR and LF are excluded because they clash
with common EOL conventions. NUL is excluded
because it clashes with the C end-of-string con-
vention, which is significant to most existing
news software. These three characters are
unlikely to be transmitted successfully.
However, posters SHOULD avoid using ASCII control characters
except for tab (ASCII 9), formfeed (ASCII 12), and backspace
(ASCII 8). Tab signifies sufficient horizontal white space
to reach the next of a set of fixed positions; posters are
warned that there is no standard set of positions, so tabs
should be avoided if precise spacing is essential. Formfeed
signifies a point at which a reading agent SHOULD pause and
await reader interaction before displaying further text.
Backspace SHOULD be used only for underlining, done by a
sequence of underscores (ASCII 95) followed by an equal num-
ber of backspaces, signifying that the same number of text
characters following are to be underlined. Posters are
warned that underlining is not available on all output
devices and is best not relied on for essential meaning.
Reading agents SHOULD recognize underlining and translate it
to the appropriate commands for devices that support it.
NOTE: Interpretation of almost all control charac-
ters is device-specific to some degree, and
devices differ. Tabs and underlining are sup-
ported, to some extent, by most modern devices and
reading agents, hence the cautious exemptions for
INTERNET DRAFT to be NEWS sec. 4.4
them. The underlining method is specified because
the inverse method, text and then underscores, is
tempting to the naive... but if sent unaltered to
a device that shows only the most recent of sev-
eral overstruck characters rather than a compos-
ite, the result can be utterly unreadable.
NOTE: A common interpretation of tab is that it is
a request to space forward to the next position
whose number is one more than a multiple of 8,
with positions numbered sequentially starting at
1. (So tab positions are 9, 17, 25, ...) Reading
agents not constrained by existing system conven-
tions might wish to use this interpretation.
NOTE: It will typically be necessary for a reading
agent to catch and interpret formfeed, not just
send it to the output device. The actions per-
formed by typical output devices on receiving a
formfeed are neither adequate for nor appropriate
to the pause-for-interaction meaning.
Cooperating subnets which wish to employ non-ASCII character
sets by using escape sequences (employing, e.g., ESC (ASCII
27), SO (ASCII 14), and SI (ASCII 15)) to alter the meaning
of superficially-ASCII characters MAY do so, but MUST use
MIME headers to alert reading agents to the particular char-
acter set(s) and escape sequences in use. A reading agent
SHOULD not pass such an escape sequence through, unaltered,
to the output device unless the agent confirms that the
sequence is one used to affect character sets and has reason
to believe that the device is capable of interpreting that
particular sequence properly.
NOTE: Cooperating-subnet organizers are warned
that some very old relayers strip certain control
characters out of articles they pass along. ESC
is known to be among the affected characters.
NOTE: There are now standard Internet encodings
for Japanese [rrr] and Vietnamese [rrr] in partic-
ular.
Articles MUST not contain any octet with value exceeding
127, i.e. any octet that is not an ASCII character.
NOTE: This rule, like others, may be relaxed by
unanimous consent of the members of a cooperating
subnet, provided suitable precautions are taken to
ensure that rule-violating articles do not leak
out of the subnet. (This has already been done in
many areas where ASCII is not adequate for the
local language(s).) Beware that articles contain-
ing non-ASCII octets in headers are a violation of
INTERNET DRAFT to be NEWS sec. 4.4
the MAIL specifications and are not valid MAIL
messages. MIME offers a way to encode non-ASCII
characters in ASCII for use in headers; see sec-
tion 4.5.
NOTE: While there is great interest in using 8-bit
character sets, not all software can yet handle
them correctly. Hence the restriction to cooper-
ating subnets. MIME encodings can be used to
transmit such characters while remaining within
the octet restriction.
In anticipation of the day when it is possible to use non-
ASCII characters safely anywhere, and to provide for the
(substantial) cooperating subnets that are already using
them, transmission paths SHOULD treat news articles as unin-
terpreted sequences of octets (except perhaps for transfor-
mations between EOL representations) and relayers SHOULD
treat non-ASCII characters in articles as ordinary charac-
ters.
NOTE: 8-bit enthusiasts are warned that not all
software conforms to these recommendations yet.
In particular, standard NNTP [rrr] is a 7-bit pro-
tocol, and there may be implementations which
enforce this rule. Be warned, also, that it will
never be safe to send raw binary data in the body
of news articles, because changes of EOL represen-
tation may (will!) corrupt it.
Except where cooperating subnets permit more direct
approaches, MIME [rrr] headers and encodings SHOULD be used
to transmit non-ASCII content using ASCII characters; see
section 4.5, appendix B, and the MIME RFCs for details. If
article content can be expressed in ASCII, it SHOULD be.
Failing that, the order of preference for character sets is
that described in MIME [rrr].
NOTE: Using the MIME facilities, it is possible to
transmit ANY character set, and ANY form of binary
data, using only ASCII characters. Equally impor-
tant, such articles are self-describing and the
reading agent can tell which octet-to-symbol map-
ping is intended! Designation of some preferred
character sets is intended to minimize the number
of character sets that a reading agent must under-
stand in order to display most articles properly.
Articles containing non-ASCII characters, articles using
ASCII characters (values 0 through 127) to refer to non-
ASCII symbols, and articles using escape sequences to shift
character sets SHOULD include MIME headers indicating which
character set(s) and conventions are being used, and MUST do
so unless such articles are strictly confined to a
INTERNET DRAFT to be NEWS sec. 4.4
cooperating subnet which has its own pre-agreed conventions.
MIME encodings are preferred over all these techniques. If
it comes to a relayer's attention that it is being asked to
pass an article using such techniques outward across what it
knows to be the boundary of such a cooperating subnet, it
MUST report this error to its administrator, and MAY refuse
to pass the article beyond the subnet boundary. If it does
pass the article, it MUST re-encode it with MIME encodings
to make it conform to this Draft.
NOTE: Such re-encoding is a non-trivial task, due
to MIME rules such as the prohibition of nested
encodings. It's not just a matter of pouring the
body through a simple filter.
Reading agents SHOULD note MIME headers and attempt to show
the reader the closest possible approximation to the
intended content. They SHOULD not just send the octets of
the article to the output device unaltered, unless there is
reason to believe that the output device will indeed inter-
pret them correctly. Reading agents MUST not pass ASCII
control characters or escape sequences, other than as dis-
cussed above, unaltered to the output device; only by chance
would the result be the desired one, and there is serious
potential for harmful side effects, either accidental or
malicious.
NOTE: Exactly what to do with unwanted control
characters/sequences depends on the philosophy of
the reading agent, but passing them straight to
the output device is almost always wrong. If the
reading agent wants to mark the presence of such a
character/sequence in circumstances where only
ASCII printable characters are available, trans-
lating it to "#" might be a suitable method; "#"
is a conspicuous character seldom used in normal
text.
NOTE: Reading agents should be aware that many old
output devices (or the transmission paths to them)
zero out the top bit of octets sent to them. This
can transform non-ASCII characters into ASCII con-
trol characters.
Followup agents MUST be careful to apply appropriate trans-
formations of representation to the outbound followup as
well as the inbound precursor. A followup to an article
containing non-ASCII material is very likely to contain non-
ASCII material itself.
INTERNET DRAFT to be NEWS sec. 4.5
[< Prev]
[TOC] [ Next >]
#Diff to first older