Austin Group Defect Tracker

Aardvark Mark IV

Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001077 [1003.1(2013)/Issue7+TC1] System Interfaces Editorial Enhancement Request 2016-09-11 17:47 2016-09-11 23:54
Reporter deadpixi View Status public  
Assigned To
Priority normal Resolution Open  
Status New  
Name Rob King
User Reference
Section regcomp
Page Number 1783-1789
Line Number 57399-57703
Interp Status ---
Final Accepted Text
Summary 0001077: Recommend support for wide-character regcomp and regexec and/or specify multi-byte behavior
Description The existing mandated regular expression interfaces are specified to work solely on regular expressions and inputs that are specified using single-byte character strings. The standard is silent on what the regular expression functions should do if either the regular expression or the input contains multi-byte encoded characters in the current (or any) multi-byte encoding.

This makes it impossible to rely on the regular expression interfaces in a portable manner, as their behavior is unspecified with multi-byte characters.

For example, in UTF-8 encoding, a regular expression containing a single logical code point might encode that code point as four individual bytes. If this encoding were used in, e.g., a regular expression character class a naive implementation that expected each character to take up a single byte would treat each individual byte as a character to be matched in the character class, and not as a single character.
Desired Action The Standard should specify behavior of the regular expression interfaces when the expression to be compiled or the input to be matched contains multi-byte characters in the current character encoding.

Alternatively (and perhaps preferably), the Standard should specify additional wide-character regular expression interfaces (perhaps named regwcomp and regwexec) to perform regular expression compilation and matching on expressions and inputs specified using wide characters; this would avoid any issue with multi-byte encoding.

The specification of the additional wide-character interfaces would likely not be too burdensome: the GNU C library (when compiled with RE_ENABLE_I18N defined) already represents regular expressions and the input as wide characters internally. The Mac OS X standard library supports the "regwcomp" and "regwexec" interfaces. A free, portable, well-licensed, and popular implementation (libtre) exists and has been incorporated into several popular C libraries (musl, Darwin's libc, etc).
Tags No tags attached.
Attached Files

- Relationships

-  Notes
deadpixi (reporter)
2016-09-11 17:53

As an additional note, wide-character regular expressions are described in standard C++ as well, as of the C++11 language.
Don Cragun (manager)
2016-09-11 21:03

I see nothing in the description of regular expression in the standard nor in the description of the regcomp() and regexec() functions that restricts their use to single-byte character strings. I believe the requirements are perfectly clear and that they apply to multi-byte character strings (such as UTF-8) just as much as they to do to single-byte character strings (such as ASCII, EBCDIC, and ISO 8859-*).
deadpixi (reporter)
2016-09-11 23:54

I appreciate the rapid response, thank you.

I agree now that the mutli-byte concerns are likely unfounded. I still think that there is some significant merit to specifying the wide-character interfaces. The only way to portably do many things on strings is to first convert them to wide strings. For example, it is impossible to move a character pointer backwards to point at the previous character of a string portably, since in shifted encodings one would have to scan first from the beginning of the string.

I appreciate everyone's time in considering this.

- Issue History
Date Modified Username Field Change
2016-09-11 17:47 deadpixi New Issue
2016-09-11 17:47 deadpixi Name => Rob King
2016-09-11 17:47 deadpixi Section => regcomp
2016-09-11 17:47 deadpixi Page Number => -
2016-09-11 17:47 deadpixi Line Number => -
2016-09-11 17:53 deadpixi Note Added: 0003376
2016-09-11 17:53 deadpixi Issue Monitored: deadpixi
2016-09-11 21:03 Don Cragun Note Added: 0003377
2016-09-11 21:08 Don Cragun Page Number - => 1783-1789
2016-09-11 21:08 Don Cragun Line Number - => 57399-57703
2016-09-11 21:08 Don Cragun Interp Status => ---
2016-09-11 23:54 deadpixi Note Added: 0003378

Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker