s-o-1036 June 1994

[< Prev] [TOC] [ Next >]
4.4. Characters And Character Sets

Header and body lines MAY contain any ASCII characters other
than CR (ASCII 13), LF (ASCII 10), and NUL (ASCII 0).

     NOTE:  CR  and  LF are excluded because they clash
     with common  EOL  conventions.   NUL  is  excluded
     because  it  clashes with the C end-of-string con-
     vention, which is  significant  to  most  existing
     news   software.    These   three  characters  are
     unlikely to be transmitted successfully.

However, posters SHOULD avoid using ASCII control characters
except for tab (ASCII 9), formfeed (ASCII 12), and backspace
(ASCII 8).  Tab signifies sufficient horizontal white  space
to  reach  the next of a set of fixed positions; posters are
warned that there is no standard set of positions,  so  tabs
should be avoided if precise spacing is essential.  Formfeed
signifies a point at which a reading agent SHOULD pause  and
await  reader  interaction  before  displaying further text.
Backspace SHOULD be used only for  underlining,  done  by  a
sequence of underscores (ASCII 95) followed by an equal num-
ber of backspaces, signifying that the same number  of  text
characters  following  are  to  be  underlined.  Posters are
warned that underlining  is  not  available  on  all  output
devices  and  is  best  not relied on for essential meaning.
Reading agents SHOULD recognize underlining and translate it
to the appropriate commands for devices that support it.

     NOTE: Interpretation of almost all control charac-
     ters  is  device-specific  to  some  degree,   and
     devices  differ.   Tabs  and  underlining are sup-
     ported, to some extent, by most modern devices and
     reading  agents, hence the cautious exemptions for

INTERNET DRAFT to be        NEWS                    sec. 4.4


     them.  The underlining method is specified because
     the  inverse method, text and then underscores, is
     tempting to the naive... but if sent unaltered  to
     a  device  that shows only the most recent of sev-
     eral overstruck characters rather than  a  compos-
     ite, the result can be utterly unreadable.

     NOTE: A common interpretation of tab is that it is
     a request to space forward to  the  next  position
     whose  number  is  one  more than a multiple of 8,
     with positions numbered sequentially  starting  at
     1.  (So tab positions are 9, 17, 25, ...)  Reading
     agents not constrained by existing system  conven-
     tions might wish to use this interpretation.

     NOTE: It will typically be necessary for a reading
     agent to catch and interpret  formfeed,  not  just
     send  it  to  the output device.  The actions per-
     formed by typical output devices  on  receiving  a
     formfeed  are neither adequate for nor appropriate
     to the pause-for-interaction meaning.

Cooperating subnets which wish to employ non-ASCII character
sets  by using escape sequences (employing, e.g., ESC (ASCII
27), SO (ASCII 14), and SI (ASCII 15)) to alter the  meaning
of  superficially-ASCII  characters  MAY do so, but MUST use
MIME headers to alert reading agents to the particular char-
acter  set(s)  and escape sequences in use.  A reading agent
SHOULD not pass such an escape sequence through,  unaltered,
to  the  output  device  unless  the agent confirms that the
sequence is one used to affect character sets and has reason
to  believe  that the device is capable of interpreting that
particular sequence properly.

     NOTE:  Cooperating-subnet  organizers  are  warned
     that  some very old relayers strip certain control
     characters out of articles they pass  along.   ESC
     is known to be among the affected characters.

     NOTE:  There  are  now standard Internet encodings
     for Japanese [rrr] and Vietnamese [rrr] in partic-
     ular.

Articles  MUST  not  contain  any octet with value exceeding
127, i.e. any octet that is not an ASCII character.

     NOTE: This rule, like others, may  be  relaxed  by
     unanimous  consent of the members of a cooperating
     subnet, provided suitable precautions are taken to
     ensure  that  rule-violating  articles do not leak
     out of the subnet.  (This has already been done in
     many  areas  where  ASCII  is not adequate for the
     local language(s).)  Beware that articles contain-
     ing non-ASCII octets in headers are a violation of

INTERNET DRAFT to be        NEWS                    sec. 4.4


     the MAIL specifications and  are  not  valid  MAIL
     messages.   MIME  offers a way to encode non-ASCII
     characters in ASCII for use in headers;  see  sec-
     tion 4.5.

     NOTE: While there is great interest in using 8-bit
     character sets, not all software  can  yet  handle
     them  correctly.  Hence the restriction to cooper-
     ating subnets.  MIME  encodings  can  be  used  to
     transmit  such  characters  while remaining within
     the octet restriction.

In anticipation of the day when it is possible to  use  non-
ASCII  characters  safely  anywhere,  and to provide for the
(substantial) cooperating subnets  that  are  already  using
them, transmission paths SHOULD treat news articles as unin-
terpreted sequences of octets (except perhaps for  transfor-
mations  between  EOL  representations)  and relayers SHOULD
treat non-ASCII characters in articles as  ordinary  charac-
ters.

     NOTE:  8-bit  enthusiasts  are warned that not all
     software conforms to  these  recommendations  yet.
     In particular, standard NNTP [rrr] is a 7-bit pro-
     tocol, and  there  may  be  implementations  which
     enforce  this rule.  Be warned, also, that it will
     never be safe to send raw binary data in the  body
     of news articles, because changes of EOL represen-
     tation may (will!) corrupt it.

Except  where  cooperating  subnets   permit   more   direct
approaches,  MIME [rrr] headers and encodings SHOULD be used
to transmit non-ASCII content using  ASCII  characters;  see
section  4.5, appendix B, and the MIME RFCs for details.  If
article content can be expressed in  ASCII,  it  SHOULD  be.
Failing  that, the order of preference for character sets is
that described in MIME [rrr].

     NOTE: Using the MIME facilities, it is possible to
     transmit ANY character set, and ANY form of binary
     data, using only ASCII characters.  Equally impor-
     tant,  such  articles  are self-describing and the
     reading agent can tell which octet-to-symbol  map-
     ping  is  intended!  Designation of some preferred
     character sets is intended to minimize the  number
     of character sets that a reading agent must under-
     stand in order to display most articles  properly.

Articles  containing  non-ASCII  characters,  articles using
ASCII characters (values 0 through 127)  to  refer  to  non-
ASCII  symbols, and articles using escape sequences to shift
character sets SHOULD include MIME headers indicating  which
character set(s) and conventions are being used, and MUST do
so  unless  such  articles  are  strictly  confined   to   a

INTERNET DRAFT to be        NEWS                    sec. 4.4


cooperating subnet which has its own pre-agreed conventions.
MIME encodings are preferred over all these techniques.   If
it  comes to a relayer's attention that it is being asked to
pass an article using such techniques outward across what it
knows  to  be  the boundary of such a cooperating subnet, it
MUST report this error to its administrator, and MAY  refuse
to  pass the article beyond the subnet boundary.  If it does
pass the article, it MUST re-encode it with  MIME  encodings
to make it conform to this Draft.

     NOTE:  Such re-encoding is a non-trivial task, due
     to MIME rules such as the  prohibition  of  nested
     encodings.   It's not just a matter of pouring the
     body through a simple filter.

Reading agents SHOULD note MIME headers and attempt to  show
the   reader  the  closest  possible  approximation  to  the
intended content.  They SHOULD not just send the  octets  of
the  article to the output device unaltered, unless there is
reason to believe that the output device will indeed  inter-
pret  them  correctly.   Reading  agents MUST not pass ASCII
control characters or escape sequences, other than  as  dis-
cussed above, unaltered to the output device; only by chance
would the result be the desired one, and  there  is  serious
potential  for  harmful  side  effects, either accidental or
malicious.

     NOTE: Exactly what to  do  with  unwanted  control
     characters/sequences  depends on the philosophy of
     the reading agent, but passing  them  straight  to
     the  output device is almost always wrong.  If the
     reading agent wants to mark the presence of such a
     character/sequence  in  circumstances  where  only
     ASCII printable characters are  available,  trans-
     lating  it  to "#" might be a suitable method; "#"
     is a conspicuous character seldom used  in  normal
     text.

     NOTE: Reading agents should be aware that many old
     output devices (or the transmission paths to them)
     zero out the top bit of octets sent to them.  This
     can transform non-ASCII characters into ASCII con-
     trol characters.

Followup  agents MUST be careful to apply appropriate trans-
formations of representation to  the  outbound  followup  as
well  as  the  inbound  precursor.  A followup to an article
containing non-ASCII material is very likely to contain non-
ASCII material itself.

INTERNET DRAFT to be        NEWS                    sec. 4.5
[< Prev] [TOC] [ Next >]
#Diff to first older
NewerOlder
News Article Format June 2003
News Article Format April 2003
News Article Format February 2003
News Article Format August 2002
News Article Format May 2002
News Article Format November 2001
News Article Format July 2001
News Article Format April 2001
News Article Format February 2000



Documents were processed to this format by Forrest J. Cavalier III