usefor-article-08 August 2002
[< Prev]
[TOC] [ Next >]
5.5. Newsgroups
The Newsgroups-header's content specifies the newsgroup(s) in which
the article is intended to appear. It is an inheritable header
(4.2.5.2) which then becomes the default Newsgroups-header of any
followup, unless a Followup-To-header is present to prescribe
otherwise. Articles MUST NOT be passed between relaying agents or to
serving agents unless the sending agent has been configured to supply
and the receiving agent to receive at least one of the newsgroup-
names in the Newsgroups-header.
In order to allow newsgroup-names containing Non-ASCII characters,
this section relies heavily on the provisions of the Unicode
Standard. All references to "Unicode" mean [UNICODE 3.2] or any
standard that supersedes it. That document contains guarantees of
strict future upwards compatibility (e.g. no character will be
removed or change classification). Implementors should be aware that
currently unassigned code points (Unicode category Cn) may become
valid characters in future versions of Unicode. Since the poster of
an article might have access to a newer version of that standard,
relaying and serving agents MUST accept such characters, but posting
agents (and indeed all agents) MUST NOT generate them (though they
might well follow up to newsgroup-names containing them).
header =/ Newsgroups-header
Newsgroups-header = "Newsgroups" ":" SP Newsgroups-content
*( ";" other-parameter )
Newsgroups-content = [FWS] newsgroup-name
*( [FWS] ng-delim [FWS] newsgroup-name )
[FWS]
newsgroup-name = component *( "." component )
component = 1*component-grapheme
ng-delim = ","
component-grapheme = combiner-base *combiner-mark
combiner-base = combiner-ASCII / combiner-extended
combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
combiner-extended = <any character with a Unicode code value
of 0080 or greater but excluding any
character in Unicode categories
Cc, Cf, Cs, M* and Z*>
combiner-mark = <any character with a Unicode code value of
0080 or greater and in Unicode category M*>
NOTE: the excluded characters in a combiner-extended are control
characters (Cc), format control characters (Cf), surrogates
(Cs), marks (M*) and separators (Z*). In particular, this
excludes all whitespace characters. To all intents and
purposes, a component-grapheme is what a user might regard as a
single "character" as displayed on his screen, though it might
be transmitted as several actual characters (e.g. q-circumflex
is two characters). Note also that, in some writing schemes,
several component-graphemes will merge into one visible object
of variable size.
Each component MUST be invariant under Unicode normalization NFKC
(cf. the weaker normalization requirement for other headers in
section 4.4.1 which specified no more than normalization NFC, and see
also the explanatory NOTE in that section).
NOTE: As a result of of this restriction, a name has only one
valid form. Implementations can assume that a straight (case
sensitive) comparison of characters or octets is sufficient to
compare two newsgroup-names.
The requirement that names be invariant under NFKC, rather than
NFC, means that all characters with a "compatibility
decomposition" are forbidden (Unicode provides the property
"NFKC_NO" to make this test easier). The effect is to exclude
variant forms of characters, such as superscripts and
subscripts, wide and narrow forms, font variants, encircled
forms, ligatures, and so on, as their use could cause confusion.
There is insufficient experience in this area to determine
whether this is the right long-term solution. Implementors
should therefore be aware that a future version of this standard
might reduce the requirement in the direction of NFC as opposed
to NFKC.
NOTE: An implementation is not required to apply NFKC, or any
other normalization, to newsgroup-names. Only agencies that
create new groups need to be careful to obey this restriction
(7.2.1). However, if a posting agent neglects to normalize a
newsgroup-name entered manually, this may lead to the user
posting to a non-existent group without understanding why.
Newsgroup-names containing non-ASCII characters MUST be encoded in
UTF-8 and not according to [RFC 2047].
Components beginning with underline ("_") are reserved for use by
future versions of this standard and MUST NOT occur in newsgroup-
names (whether in Newsgroups-headers or in newgroup control messages
(7.2.1)). However, such names MUST be accepted.
Components beginning with "+" or "-" are reserved for use by
implementations and MUST NOT occur in newsgroup-names (whether in
Newsgroups-headers or in newgroup control messages). Implementors may
assume that this rule will not change in any future version of this
standard.
NOTE: For example, implementors may safely use leading "+" and
"-" to "escape" other entities within something that looks like
a newsgroup-name.
Agencies responsible for the administration of particular hierarchies
Ought to place additional restrictions on the characters they allow
in newsgroup-names within those hierarchies (such as to accord with
the languages commonly used within those hierarchies, or to avoid
perceived ambiguities pertinent to those languages). Where there is
no such specific policy, the following restrictions SHOULD be applied
to newsgroup-names.
NOTE: These restrictions are intended to reflect existing
practice, with some additions to accommodate foreseeable
enhancements, and are intended both to avoid certain technical
difficulties and to avoid unnecessary confusion. It may well be
that experience will allow future extensions to this standard to
relax some or all of these restrictions.
The specific restrictions (to be applied in the absence of
established policies to the contrary) are:
1. The following characters are forbidden, subject to the comments
and notes at the end of the list:
characters in category Cn (Other, Not assigned) [1]
characters in category Co (Other, Private Use) [2]
characters in category Lt (Letter, Titlecase) [3]
characters in category Lu (Letter, Uppercase) [3]
characters in category Me (Mark, Enclosing) [4]
characters in category Pd (Punctuation, Dash) [4][5]
characters in category Pe (Punctuation, Close) [4]
characters in category Pf (Punctuation, Final quote) [4]
characters in category Pi (Punctuation, Initial quote) [4]
characters in category Po (Punctuation, Other) [4]
characters in category Ps (Punctuation, Open) [4]
characters in category Sc (Symbol, Currency) [4]
characters in category Sk (Symbol, Modifier) [4]
characters in category Sm (Symbol, Math) [4][5]
characters in category So (Symbol, Other) [4]
[1] As new characters are added to Unicode, the code point moves
from category Cn to some other category. As stated above,
implementors should be prepared for this.
[2] Specific private use characters can be used within a hierarchy
or co-operating subnet that has agreed meanings for them.
[3] Traditionally, newsgroup-names have been written in lowercase.
Posting agents Ought Not to convert uppercase or titlecase
characters to the corresponding lowercase forms except under
the explicit instructions of the poster.
[4] Traditionally newsgroup-names have only used letters, digits,
and the three special characters "+", "-" and "_". These
categories correspond to characters outside that set.
[5] Although the characters "+" and "-" are within categories Pd
and Sm, they are not forbidden.
2. A component name is forbidden to consist entirely of digits.
NOTE: This requirement was in [RFC 1036] but nevertheless
several such groups have appeared in practice and implementors
should be prepared for them. A common implementation technique
uses each component as the name of a directory and uses numeric
filenames for each article within a group. Such an
implementation needs to be careful when this could cause a clash
(e.g. between article 123 of group xxx.yyy and the directory for
group xxx.yyy.123).
3. A component is limited to 30 component-graphemes and a newsgroup-
name to 71 component-graphemes (counting also the '.'s separating
the components). Whilst there is no longer any technical reason to
limit the length of a component (formerly, it was limited to 14
octets) nor of a newsgroup-name, it should be noted that these
names are also used in the newsgroups-line (7.2.1.2) where an
overall policy limit applies and, moreover, excessively long names
can be exceedingly inconvenient in practical use.
Serving and relaying agents MUST accept any newsgroup-name that meets
the above requirements, even if they violate one or more of the
policy restrictions. Posting and injecting agents MAY reject articles
containing newsgroup-names that do not meet these restrictions, and
posting agents MAY attempt to correct them (but only with the
explicit agreement of the poster for anything more than NFC or NFKC
normalization). However, because of the large and changing tables
required to do these checks and corrections throughout the whole of
Unicode, this standard does not require them to do so. Rather, the
onus is placed on those who create new newsgroups (7.2.1) to check
the mandatory requirements, to consider the effects of relaxing the
other restrictions, and to consider how all this may affect
propagation of the group.
Since future extensions to this standard and the Unicode standard,
including a possible relaxation of the NFKC normalization, plus any
relaxations of the default restrictions introduced by specific
hierarchies might invalidate some such checks, warnings, and
adjustments, implementations MUST incorporate means to disable them.
NOTE: The newsgroup-name as encoded in UTF-8 should be regarded as
the canonical form. Reading agents may convert it to whatever
character set they are able to display and serving agents may
possibly need to convert it to some form more suitable as a
filename. Simple algorithms for both kinds of conversion are
readily available. Observe that the syntax does not allow
comments within the Newsgroups-header; this is to simplify
processing by relaying and serving agents which have a requirement
to process this header extremely rapidly.
The inclusion of folding white space within a Newsgroups-content is a
newly introduced feature in this standard. It MUST be accepted by all
conforming implementations (relaying agents, serving agents and
reading agents). Posting agents should be aware that such postings
may be rejected by overly-critical old-style relaying agents. When a
sufficient number of relaying agents are in conformance, posting
agents SHOULD generate such whitespace in the form of <CRLF WSP> so
as to keep the length of lines in the relevant headers (notably
Newsgroups and Followup-To) to no more than than 79 characters (or
other agreed policy limit - see 4.5). Before such critical mass
occurs, injecting agents MAY reformat such headers by removing
whitespace inserted by the posting agent, but relaying agents MUST
NOT do so.
Posters SHOULD use only the names of existing newsgroups in the
Newsgroups-header. However, it is legitimate to cross-post to
newsgroups which do not exist on the posting agent's host, provided
that at least one of the newsgroups DOES exist there, and followup
agents SHOULD accept this (posting agents MAY accept it, but Ought at
least to alert the poster to the situation and request confirmation).
Relaying agents MUST NOT rewrite Newsgroups-headers in any way, even
if some or all of the newsgroups do not exist on the relaying agent's
host. Serving agents MUST NOT create new newsgroups simply because an
unrecognized newsgroup-name occurs in a Newsgroups-header (see 7.2.1
for the correct method of newsgroup creation).
The Newsgroups-header is intended for use in Netnews articles rather
than in email messages. It MAY be used in an email message to
indicate that it is a copy also posted to the listed newsgroups, in
which case the inclusion of a Posted-And-Mailed header (6.9) would
also be appropriate. However, it SHOULD NOT be used in an email-only
reply to a Netnews article (thus the "inheritable" property of this
header applies only to followups to a newsgroup, and not to followups
to the poster). Moreover, if a newsgroup-name contains any non-ASCII
character, it may need to be encoded using the mechanism defined in
section 5.5.2. See also the further discussion in section 8.8.1.
[< Prev]
[TOC] [ Next >]
#Diff to first older
--- ../usefor-article-07/Newsgroups.out May 2002
+++ ../usefor-article-08/Newsgroups.out August 2002
@@ -9,17 +9,18 @@
and the receiving agent to receive at least one of the newsgroup-
names in the Newsgroups-header.
- References to "Unicode" or "the latest version of the Unicode
- Standard" mean [UNICODE 3.1] or any standard that supersedes it. That
- document contains guarantees of strict future upwards compatibility
- (e.g. no character will be removed or change classification).
- Implementors should be aware that currently unassigned code points
- (Unicode category Cn) may become valid characters in future versions
- of Unicode. Since the poster of an article might have access to a
- newer version of that standard, relaying and serving agents MUST
- accept such characters, but posting agents (and indeed all agents)
- MUST NOT generate them (though they might well follow up to
- newsgroup-names containing them).
+ In order to allow newsgroup-names containing Non-ASCII characters,
+ this section relies heavily on the provisions of the Unicode
+ Standard. All references to "Unicode" mean [UNICODE 3.2] or any
+ standard that supersedes it. That document contains guarantees of
+ strict future upwards compatibility (e.g. no character will be
+ removed or change classification). Implementors should be aware that
+ currently unassigned code points (Unicode category Cn) may become
+ valid characters in future versions of Unicode. Since the poster of
+ an article might have access to a newer version of that standard,
+ relaying and serving agents MUST accept such characters, but posting
+ agents (and indeed all agents) MUST NOT generate them (though they
+ might well follow up to newsgroup-names containing them).
header =/ Newsgroups-header
Newsgroups-header = "Newsgroups" ":" SP Newsgroups-content
@@ -28,28 +29,28 @@
*( [FWS] ng-delim [FWS] newsgroup-name )
[FWS]
newsgroup-name = component *( "." component )
- component = 1*component-glyph
+ component = 1*component-grapheme
ng-delim = ","
- component-glyph = combiner-base *combiner-mark
+ component-grapheme = combiner-base *combiner-mark
combiner-base = combiner-ASCII / combiner-extended
combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_"
- combiner-extended = <any character with a Unicode code value of
- 0080 or greater and a combining class of 0,
- but excluding any character in Unicode
- categories Cc, Cf, Cs, Zs, Zl, and Zp>
+ combiner-extended = <any character with a Unicode code value
+ of 0080 or greater but excluding any
+ character in Unicode categories
+ Cc, Cf, Cs, M* and Z*>
combiner-mark = <any character with a Unicode code value of
- 0080 or greater and a combining class other
- than 0>
+ 0080 or greater and in Unicode category M*>
- NOTE: the excluded characters are control characters (Cc),
- format control characters (Cf), surrogates (Cs), and separators
- (Zs, Zl, Zp). In particular, this excludes all whitespace
- characters. To all intents and purposes, a component-glyph is
- what a user might regard as a single "character" as displayed on
- his screen, though it might be transmitted as several actual
- characters (e.g. q-circumflex is two characters). Note also
- that, in some writing schemes, several component-glyphs will
- merge into one visible object of variable size.
+ NOTE: the excluded characters in a combiner-extended are control
+ characters (Cc), format control characters (Cf), surrogates
+ (Cs), marks (M*) and separators (Z*). In particular, this
+ excludes all whitespace characters. To all intents and
+ purposes, a component-grapheme is what a user might regard as a
+ single "character" as displayed on his screen, though it might
+ be transmitted as several actual characters (e.g. q-circumflex
+ is two characters). Note also that, in some writing schemes,
+ several component-graphemes will merge into one visible object
+ of variable size.
Each component MUST be invariant under Unicode normalization NFKC
(cf. the weaker normalization requirement for other headers in
@@ -57,9 +58,9 @@
also the explanatory NOTE in that section).
NOTE: As a result of of this restriction, a name has only one
- valid form. Implementations can assume that a straight
- comparison of characters or octets is sufficient to compare two
- newsgroup-names.
+ valid form. Implementations can assume that a straight (case
+ sensitive) comparison of characters or octets is sufficient to
+ compare two newsgroup-names.
The requirement that names be invariant under NFKC, rather than
NFC, means that all characters with a "compatibility
@@ -76,7 +77,7 @@
to NFKC.
NOTE: An implementation is not required to apply NFKC, or any
- other normalization, to newsgroup names. Only agencies that
+ other normalization, to newsgroup-names. Only agencies that
create new groups need to be careful to obey this restriction
(7.2.1). However, if a posting agent neglects to normalize a
newsgroup-name entered manually, this may lead to the user
@@ -84,13 +85,14 @@
Newsgroup-names containing non-ASCII characters MUST be encoded in
UTF-8 and not according to [RFC 2047].
+
Components beginning with underline ("_") are reserved for use by
- future versions of this standard and MUST NOT occur in newsgroup
+ future versions of this standard and MUST NOT occur in newsgroup-
names (whether in Newsgroups-headers or in newgroup control messages
(7.2.1)). However, such names MUST be accepted.
Components beginning with "+" or "-" are reserved for use by
- implementations and MUST NOT occur in newsgroup names (whether in
+ implementations and MUST NOT occur in newsgroup-names (whether in
Newsgroups-headers or in newgroup control messages). Implementors may
assume that this rule will not change in any future version of this
standard.
@@ -98,14 +100,13 @@
NOTE: For example, implementors may safely use leading "+" and
"-" to "escape" other entities within something that looks like
a newsgroup-name.
-
Agencies responsible for the administration of particular hierarchies
Ought to place additional restrictions on the characters they allow
in newsgroup-names within those hierarchies (such as to accord with
the languages commonly used within those hierarchies, or to avoid
perceived ambiguities pertinent to those languages). Where there is
no such specific policy, the following restrictions SHOULD be applied
- to newsgroup names.
+ to newsgroup-names.
NOTE: These restrictions are intended to reflect existing
practice, with some additions to accommodate foreseeable
@@ -135,6 +136,7 @@
characters in category Sk (Symbol, Modifier) [4]
characters in category Sm (Symbol, Math) [4][5]
characters in category So (Symbol, Other) [4]
+
[1] As new characters are added to Unicode, the code point moves
from category Cn to some other category. As stated above,
implementors should be prepared for this.
@@ -147,10 +149,9 @@
characters to the corresponding lowercase forms except under
the explicit instructions of the poster.
- [4] Traditionally newsgroup names have only used letters, digits,
+ [4] Traditionally newsgroup-names have only used letters, digits,
and the three special characters "+", "-" and "_". These
categories correspond to characters outside that set.
-
[5] Although the characters "+" and "-" are within categories Pd
and Sm, they are not forbidden.
@@ -165,13 +166,14 @@
(e.g. between article 123 of group xxx.yyy and the directory for
group xxx.yyy.123).
- 3. A component is limited to 30 component-glyphs and a newsgroup-name
- to 71 component-glyphs. Whilst there is no longer any technical
- reason to limit the length of a component (formerly, it was
- limited to 14 octets) nor of a newsgroup-name, it should be noted
- that these names are also used in the newsgroups line (7.2.1.2)
- where an overall policy limit applies and, moreover, excessively
- long names can be exceedingly inconvenient in practical use.
+ 3. A component is limited to 30 component-graphemes and a newsgroup-
+ name to 71 component-graphemes (counting also the '.'s separating
+ the components). Whilst there is no longer any technical reason to
+ limit the length of a component (formerly, it was limited to 14
+ octets) nor of a newsgroup-name, it should be noted that these
+ names are also used in the newsgroups-line (7.2.1.2) where an
+ overall policy limit applies and, moreover, excessively long names
+ can be exceedingly inconvenient in practical use.
Serving and relaying agents MUST accept any newsgroup-name that meets
the above requirements, even if they violate one or more of the
@@ -186,6 +188,7 @@
the mandatory requirements, to consider the effects of relaxing the
other restrictions, and to consider how all this may affect
propagation of the group.
+
Since future extensions to this standard and the Unicode standard,
including a possible relaxation of the NFKC normalization, plus any
relaxations of the default restrictions introduced by specific
@@ -201,7 +204,6 @@
comments within the Newsgroups-header; this is to simplify
processing by relaying and serving agents which have a requirement
to process this header extremely rapidly.
-
The inclusion of folding white space within a Newsgroups-content is a
newly introduced feature in this standard. It MUST be accepted by all
conforming implementations (relaying agents, serving agents and
@@ -217,8 +219,8 @@
NOT do so.
Posters SHOULD use only the names of existing newsgroups in the
- Newsgroups-header. However, it is legitimate to cross-post to a
- newsgroup(s) which do not exist on the posting agent's host, provided
+ Newsgroups-header. However, it is legitimate to cross-post to
+ newsgroups which do not exist on the posting agent's host, provided
that at least one of the newsgroups DOES exist there, and followup
agents SHOULD accept this (posting agents MAY accept it, but Ought at
least to alert the poster to the situation and request confirmation).
@@ -236,9 +238,6 @@
reply to a Netnews article (thus the "inheritable" property of this
header applies only to followups to a newsgroup, and not to followups
to the poster). Moreover, if a newsgroup-name contains any non-ASCII
- character, it MAY be encoded using the mechanism defined in [RFC
- 2047] when sent by email (for which purpose the newsgroup-name SHOULD
- be treated as an encoded-word) but, if it is subsequently returned to
- the Netnews environment, it MUST then be re-encoded into UTF-8. See
- also the further discussion in section 8.8.1.
+ character, it may need to be encoded using the mechanism defined in
+ section 5.5.2. See also the further discussion in section 8.8.1.