Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000795 [1003.1(2013)/Issue7+TC1] Base Definitions and Headers Objection Enhancement Request 2013-11-15 22:46 2015-07-02 20:24
Reporter steffen View Status public  
Assigned To ajosey
Priority normal Resolution Open  
Status Under Review  
Name steffen
Organization
User Reference
Section XBD section 3 (Definitions) and section 7 (Locale)
Page Number 93, 140-146, 167
Line Number 2586, 4047, 4065, 4072, 4081, 4090, 4093, 4156, 4174, 4177, 4215, 4271, 4278, 4295-4297, 4330, 4332, 4360, 4362, 5302
Interp Status ---
Final Accepted Text
Summary 0000795: Addition of a new «symbol» character class
Description With the possible support of the Universal Character Set via a POSIX.UTF-8 locale thousands of characters need to be classified with the restricted set of POSIX character classes.

This rather necessarily results in problematic classifications.

Unicode introduces some more major categories, and furtherly subdivides those to get around these limitations.
One major category that is entirely missing from POSIX is "Symbol", which this issue tries to introduce.

Historically (portable character set) POSIX classifies symbols as punctuation characters, and current implementations of an UTF-8 locale extend this classification to the entire UCS spectrum.

Since portability issues will forbid changing all those implementations, the symbol character class will add support for a little subdivision.
Desired Action - Vol 1: Base Definitions, Chapter 3, «Definitions».

  On page 93, line 2586
  add (insert before)

    3.374 Symbol
      One of the characters included in the symbol
      character classification of the LC_CTYPE
      category in the current locale.
      Note: The LC_CTYPE category is defined in
      detail in Section 7.3.1 (on page 139).

- Vol 1: Base Definitions, Chapter 7, «Locale»,
  7.3.1, «LC_CTYPE».

  On page 140, line 4047 ff. (alpha)
  change

    In a locale definition file, no character
    specified for the keywords cntrl, digit,
    punct, or space shall be specified.

  to

    In a locale definition file, no character
    specified for the keywords cntrl, digit,
    punct, symbol or space shall be specified.

  On page 140, line 4065 ff. (space)
  change

    In a locale definition file, no character
    specified for the keywords upper, lower,
    alpha, digit, graph, or xdigit shall be
    specified.

  to

    In a locale definition file, no character
    specified for the keywords upper, lower,
    alpha, digit, graph, symbol or xdigit shall be
    specified.

  On page 140, line 4072 ff. (cntrl)
  change

    In a locale definition file, no character
    specified for the keywords upper, lower,
    alpha, digit, punct, graph, print, or xdigit
    shall be specified.

  to

    In a locale definition file, no character
    specified for the keywords upper, lower,
    alpha, digit, punct, graph, print, symbol or
    xdigit shall be specified.

  On page 140, line 4081 ff. (graph)
  change

    In the POSIX locale, all characters in classes
    alpha, digit, and punct shall be included; no
    characters in class cntrl shall be included.
    In a locale definition file, characters
    specified for the keywords upper, lower,
    alpha, digit, xdigit, and punct are
    automatically included in this class. No
    character specified for the keyword cntrl
    shall be specified.

  to

    In the POSIX locale, all characters in classes
    alpha, digit, punct and symbol shall be
    included; no characters in class cntrl shall
    be included.
    In a locale definition file, characters
    specified for the keywords upper, lower,
    alpha, digit, xdigit, punct and symbol are
    automatically included in this class. No
    character specified for the keyword cntrl
    shall be specified.

  On page 141, line 4090 ff. (print)
  change

    In a locale definition file, characters
    specified for the keywords upper, lower,
    alpha, digit, xdigit, punct, graph, and the
    <space> are automatically included in this
    class. No character specified for the keyword
    cntrl shall be specified.

  to

    In a locale definition file, characters
    specified for the keywords upper, lower,
    alpha, digit, xdigit, punct, symbol, graph, and
    the <space> are automatically included in this
    class. No character specified for the keyword
    cntrl shall be specified.

  On page 141, line 4093
  add (insert before)


    symbol
      Define characters to be classified as
      characters representing symbols.
      In the POSIX locale, only:
        <dollar-sign>, <plus-sign>,
        <less-than-sign>, <equals-sign>,
        <greater-than-sign>, <circumflex>,
        <grave-accent>, <vertical-line> and
        <tilde>
      shall be included.
      In a locale definition file, no character
      specified for the keywords upper, lower,
      alpha, digit, cntrl, xdigit, punct or as the
      <space> shall be specified.
      The <dollar-sign>, <plus-sign>,
      <less-than-sign>, <equals-sign>,
      <greater-than-sign>, <circumflex>,
      <grave-accent>, <vertical-line> and <tilde>
      of the portable character class may be
      included in this class even if they belong
      to the punct character class.

  On page 142, line 4156 ff., change the table to

                          Table 7-1 Valid Character Class Combinations
                                        Can Also Belong To
     In Class upper lower alpha digit space cntrl punct symbol graph print xdigit blank
     upper — A x x x x x A A — x
     lower — A x x x x x A A — x
     alpha — — x x x x x A A — x
     digit x x x x x x x A A A x
     space x x x x — * x * * x —
     cntrl x x x x — x x x x x —
     punct x x x x — x * A A x —
     symbol x x x x x x * A A x x
     graph — — — — — x — – A — —
     print — — — — — x — – — — —
     xdigit — — — — x x x x A A x
     blank x x x x A — * x * * x


  On page 143, line 4174 ff.
  change

    2. The <space>, which is part of the space
          and blank classes, cannot belong to
          punct or graph, but shall automatically
          belong to the print class. Other space
          or blank characters can be classified as
          any of punct, graph, or print.

  to

    2. The <space>, which is part of the space
          and blank classes, cannot belong to
          punct, symbol or graph, but shall
          automatically belong to the print class.
          Other space or blank characters can be
          classified as any of punct, graph, or
          print.

  On page 143, line 4177
  add (insert before)

    3. Historically the <dollar-sign>,
          <plus-sign>, <less-than-sign>,
          <equals-sign>, <greater-than-sign>,
          <circumflex>, <grave-accent>,
          <vertical-line> and <tilde> symbol
          characters where included in the punct
          character class.

  On page 144, line 4215
  add (insert before)

    #
    symbol <dollar-sign>;<plus-sign>;<less-than-sign>;\
           <equals-sign>;<greater-than-sign>;<circumflex>;\
           <grave-accent>;<vertical-line>;<tilde>

  On page 145, line 4271
  change

    <dollar-sign> punct, print, graph


  to

    <dollar-sign> punct, symbol, print, graph

  On page 145, line 4278
  change

    <plus-sign> punct, print, graph

  to

    <plus-sign> punct, symbol, print, graph

  On page 145, lines 4295-4297
  change

    <less-than-sign> punct, print, graph
    <equals-sign> punct, print, graph
    <greater-than-sign> punct, print, graph

  to

    <less-than-sign> punct, symbol, print, graph
    <equals-sign> punct, symbol, print, graph
    <greater-than-sign> punct, symbol, print, graph

  On page 146, line 4330
  change

    <circumflex> punct, print, graph

  to

    <circumflex> punct, symbol, print, graph

  On page 146, line 4332
  change

    <grave-accent> punct, print, graph

  to

    <grave-accent> punct, symbol, print, graph

  On page 146, line 4360
  change
    <vertical-line> punct, print, graph

  to

    <vertical-line> punct, symbol, print, graph

  On page 146, line 4362
  change

    <tilde> punct, print, graph

  to

    <tilde> punct, symbol, print, graph

- Vol 1: Base Definitions, Chapter 7, «Locale»,
  7.4.2, «Locale Grammar».

  On page 167, line 5302
  add (insert before)

    | ’symbol’
Tags UTF-8_Locale
Attached Files

- Relationships
related to 0000797New Addition of a isw?symbol(_l)?() function family 
related to 0000798New Addition of a [:symbol:] bracket expression character class expression 

-  Notes
(0001987)
Don Cragun (manager)
2013-11-16 07:15

This bug was originally filed against the Rationale section of the 2008-TC1 project and the page and line number for the 1st suggested change did not match any version of the standard (the page number was off by one from the 2013 edition). That mistaken page number has been corrected and this bug has been moved to Category Base Definitions and Headers in the Project 1003.1(2013)/Issue7+TC1.

If these changes are made to the POSIX Locale, don't we also need to add issymbol() and iswsymbol() functions to XSH?

If we add a new character class to the POSIX Locale, is it still appropriate to require that the C Locale and the POSIX Locale be synonyms for the same locale?
(0001988)
steffen (reporter)
2013-11-16 15:51

Addition:
XBD section 7.3.1, page 139, lines 4023-4025
change

  The character classes digit, xdigit, lower, upper,
  and space have a set of automatically included characters.

to

  The character classes digit, xdigit, lower, upper,
  symbol, blank and space have a set of automatically included
  characters.

[Note that "blank" doesn't belong to this issue, but oh dear]
(0001989)
steffen (reporter)
2013-11-16 16:27

Addition:
XRAT A.7.3.1 «LC_CTYPE», page 3488, line 117720
change

  The character classes digit, xdigit, lower,
  upper, and space have a set of automatically
  included

to

  The character classes digit, xdigit, lower,
  upper, symbol, blank and space have a set of
  automatically included
    
[Again, "blank" doesn't belong to this issue]
(0001990)
steffen (reporter)
2013-11-16 21:08

Note: 0001987:
  If these changes are made to the POSIX Locale, don't we also need to add issymbol() and iswsymbol() functions to XSH?

I've opened 0000797.

  If we add a new character class to the POSIX Locale, is it still appropriate to require that the C Locale and the POSIX Locale be synonyms for the same locale?

It seems to me that POSIX already defines itself (the POSIX locale) to be a (compatible) superset of what ISO C requires for the "C" locale.
I've opened 0000796.
The character class itself however would be an extension (at the time of this writing).
(0001991)
shware_systems (reporter)
2013-11-16 23:32

RE: #0001987

I'm more into pestering the 'C' standard folks to add the issymbol() and iswsymbol() first, as something that should have been added to C11. What else may be considered 'missing' is a separate gripe. This in particular has been a known issue since before c89, that some common code sets are mostly symbols, in the linguistic sense, not any of the other ctype keywords. Until that happens I think extra ctype keywords like this and even locale categories should be a testable option an implementation supports. That's all I see that can be done while deferring to 'C' as the base behavior.

As it is, most of the proposed changes here would need to be shaded CX until 'C' incorporates them. They could be shaded LCX for Locale Extensions Option. Such an Option could specify additional standardized locales that would incorporate this and other changes without burdening embedded systems that only need the base POSIX locale as it is now, whether localedef also supported or not. This would be consistent with how other extensions have been introduced to the standard that are now part of the base.

Another possibility is 'symbol' becomes a reserved charclass-name for the charclass keyword, with defined behavior if it is specified as a charclass argument, similar to the elective Environment Variables of XBD Section 8. That extension mechanism is already in place and supporting it there might make this a TC2 candidate, as it does apply to other code sets besides Unicode. Making it a non-charclass optional keyword could be left to Issue 8 as part of a full Unicode proposal, with modifying the 'C' locale left to Issue 9.
(0001992)
geoffclare (manager)
2013-11-18 10:23

There is a major discrepancy between the description and the desired
action, in that the description says symbols are currently classified
as punctuation characters and portability issues will forbid changing
that, but the desired action does exactly that.

In any case, this appears to be invention and therefore cannot be
standardized until it has been implemented in at least one widely used
system. (The related 0000797 adding new is[w]symbol[_l]()
interfaces would also need a sponsor.)
(0001993)
steffen (reporter)
2013-11-18 13:30

Reply to Note: 0001992.

> There is a major discrepancy [.]
> the desired action does exactly that.

Oh, a Freudian slip!
Correction:

  On page 141, line 4093
  add (insert before)

  symbol
      Define characters to be classified as
      characters representing symbols.
      In the POSIX locale, only:
        <dollar-sign>, <plus-sign>,
        <less-than-sign>, <equals-sign>,
        <greater-than-sign>, <circumflex>,
        <grave-accent>, <vertical-line> and
        <tilde>
      shall be included.
      In a locale definition file, no character
      specified for the keywords upper, lower,
      alpha, digit, cntrl, xdigit, punct or as the
      <space> shall be specified.
      Historically the <dollar-sign>, <plus-sign>,
      <less-than-sign>, <equals-sign>,
      <greater-than-sign>, <circumflex>,
      <grave-accent>, <vertical-line> and <tilde>
      of the portable character class are
      also included in the punct class.

> this appears to be invention

POSIX already allows implementations to add additional [:class:]es.
IBM ICU / Unicode make extensive use of that and standardize extensions ([1]) which are covered via \p{PROPERTY} in engines like the one from perl(1) (and compatible), e.g., [:script=greek:].

(Note they also extend the syntax to something that is direly missing: class intersections and subtractions; e.g., [[:punct:]--[:symbol:]] (which means the set of all punctuation characters that are not also symbols).)

So yes, it is absent (to the best of my knowledge) from a POSIX implementation, but standardized, supported and even furtherly syntax extended by the leasing industry standard, Unicode.

There is an online example ([2]) which can be used to test character classes like [:ASCII:], [:symbol:] and [:mark:]. The latter is also the second major category that is entirely missing from the POSIX standard. (And testing it via [2] shows a result set of not less than 1,645 Code Points.)

 [1] http://www.unicode.org/reports/tr18/ [^]
 [2] http://unicode.org/cldr/utility/list-unicodeset.jsp [^]


> adding new is[w]symbol[_l]() interfaces would also need a sponsor

I'm afraid so; as a last resort these issues could be left open until they come in via the C standard?

And what about «mark», isw?mark(_l)?() and [:mark:]?
(0001994)
shware_systems (reporter)
2013-11-18 14:37

They get added as charclass keyword names and the generic iswctype() is used directly or in wrapper functions by REs and tr. This is already required behavior. These additions can overlap other classes, as xdigit combines digits and some upper and lower, so with 'symbol' defined that way a code point can be both punct and symbol for current code sets. Restricting it to non-alpha, non-xdigit, non-NUL should suffice to keep backwards compatibility. For some code sets allowing a code point to be both symbol and blank, cntrl, or punct is warranted.

- Issue History
Date Modified Username Field Change
2013-11-15 22:46 steffen New Issue
2013-11-15 22:46 steffen Status New => Under Review
2013-11-15 22:46 steffen Assigned To => ajosey
2013-11-15 22:46 steffen Name => steffen
2013-11-15 22:46 steffen Section => Vol 1. 3., Vol 1. 7.,
2013-11-15 22:46 steffen Page Number => 92, 140-146, 167
2013-11-15 22:46 steffen Line Number => 2586, 4047, 4065, 4072, 4081, 4090, 4093, 4156, 4174, 4177, 4215, 4271, 4278, 4295-4297, 4330, 4332, 4360, 4362, 5302
2013-11-16 07:04 Don Cragun Project 2008-TC1 => 1003.1(2013)/Issue7+TC1
2013-11-16 07:15 Don Cragun Section Vol 1. 3., Vol 1. 7., => XBD section 3 (Definitions) and section 7 (Locale)
2013-11-16 07:15 Don Cragun Page Number 92, 140-146, 167 => 93, 140-146, 167
2013-11-16 07:15 Don Cragun Interp Status => ---
2013-11-16 07:15 Don Cragun Note Added: 0001987
2013-11-16 07:15 Don Cragun Severity Editorial => Objection
2013-11-16 07:15 Don Cragun Category Rationale => Base Definitions and Headers
2013-11-16 07:15 Don Cragun Desired Action Updated
2013-11-16 15:51 steffen Note Added: 0001988
2013-11-16 16:27 steffen Note Added: 0001989
2013-11-16 21:08 steffen Note Added: 0001990
2013-11-16 23:32 shware_systems Note Added: 0001991
2013-11-18 10:23 geoffclare Note Added: 0001992
2013-11-18 10:24 geoffclare Relationship added related to 0000797
2013-11-18 10:24 geoffclare Relationship added related to 0000798
2013-11-18 13:30 steffen Note Added: 0001993
2013-11-18 14:37 shware_systems Note Added: 0001994
2014-01-16 16:45 Don Cragun Tag Attached: UTF-8_Locale


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker