Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001827 [Issue 8 drafts] Shell and Utilities Objection Enhancement Request 2024-04-15 03:23 2024-04-17 18:01
Reporter nrk View Status public  
Assigned To
Priority normal Resolution Open  
Status New   Product Version
Name NRK
Organization
User Reference
Section compress(1) utility
Page Number n/a
Line Number n/a
Final Accepted Text
Summary 0001827: Standardize gzip(1) cli interface instead of adding it to compress(1)
Description As a resolution to https://www.austingroupbugs.net/view.php?id=1041 [^] the gzip/deflate algorithm was added to compress(1) tool.

I believe standardizing the gzip(1) interface would've been the correct decision and adding gzip format to compress(1) will create a bunch of practical issues which I'll outline below.

Hurdles for script authors
--------------------------

gzip(1) is ubiquitous and is shipped with almost every distribution. Whereas compress(1) is often not installed by default on many systems.

If gzip(1) is standardized, the most existing scripts that are using gzip will automatically become compliant with next POSIX edition. On the other hand adding gzip to compress(1) means such scripts will need to be edited to use compress(1) in order to avoid non-posix dependency. But doing so will result in such scripts no longer working on older systems where compress(1) is either not available or doesn't support gzip format.

Standardizing gzip(1) avoids such issues entirely.

Confusing cli options
---------------------

Because gzip and (ancient) compress have different file format and algorithm, many cli options which are applicable to one will not be applicable to the other. The prime example of it is the `-b` flag which has been repurposed to mean different things (max bits or compression level) depending on the algorithm used. This also deviates from *majority* of the compression tool cli interface where `-[0..9]` to indicate compression level has been the de facto standard - leading to unnecessary confusion for the users.

Implementation complexity
-------------------------

Currently one can implement the compress(1) utility or the gzip(1) utility as a stand-alone software without having to worry about the other. But if compress(1) also needs to support gzip format then the author now needs to implement both LZW and DEFLATE in order to be compliant. This increases the implementation difficultly - it's also against the unix philosophy of doing one thing.

But even more worrying is the passage below:

    Other implementation-defined algorithms may be supported.

If even more format are to be added in the future, then not only will that increase implementation difficultly further, it's also going to create more awkward cli option situation where parameters for a certain algorithm doesn't make sense on the other.

Ultimately this will lead to the compress(1) cli either having lots of cli options being confusingly multiplexed or the options will be limited to the subset of parameters which makes sense on most/all algorithms - leading to a tool that does more than one thing, and does them poorly.
Desired Action 1. Revert the changes to compress(1)
2. Standardize a subset of gzip(1) with cli options that's known to be supported by most/all major implementations.
Tags No tags attached.
Attached Files

- Relationships

-  Notes
(0006752)
dannyniu (reporter)
2024-04-16 05:57
edited on: 2024-04-16 05:58

I totally agree that `gzip` should've been standardized, however, I have a few observations on this:

1. some linux distros don't ship `pax` by default, but instead `cpio` and `tar`.

2. the current draft 4 has passed a stage where we can make normative changes to, so we might have to settle with an application usage or rationale section.

3. I once asked in a note if avoiding the `gzip` name had anything to do with potential trademark infringement, luckily one of the maintainers said it isn't, and it's just one implementation had already implemented gzip compression in the `compress` utility in the current form.

4. the `compress` utility can (if I didn't interpret it wrong) report "unsupported" for some combinations of the options.

(0006753)
steffen (reporter)
2024-04-16 18:39

To be a stinker, i do not.
In my opinion, if anything, a new compression format should be added.

For example zstd is RFC standardized as RFC 8878, and has the capability to either be incredible fast, or to offer a compression almost as good as xz. Decompression is always incredible fast.
It uses fast XXH64 (xxHash is BSD 2-clause licensed) checksums which "survive smhasher testing".
zstd is (optionally) BSD 3-clause licensed, and has gained more and more momentum over the years: you find it everywhere, and where not now today, surely at the time when the next POSIX standard appears.

There is also plzip, which is the tenth of the size of zstd (static linkage easily doable), and with at least the library also being BSD 2-clause licensed.
It compresses even better than zstd in best mode, or xz, but is pretty slow.
It uses a CRC-32 checksum, which seems to be a fixed decision.
It has a very easy programming interface (i am working on getting either memory hooking or buffer passing to him, but will likely fail).
He is trying for years to get this standardized as an RFC at the IETF: nothing but silence first (iirc), but last time, not too many weeks ago, he got some responses which he was thankful for, being able to improve the documents' formalism.
plzip is even elder than zstd (at least fifteen years), stable for many years, and used by noticeable packages like the IANA TZ distribution.

Regarding checksums it seems the ship is sailing towards xxhash (with siphash having traction) on the one hand, and Blake2 (RFC 7693) on the other (very noticeable in the Argon2 "monster" (RFC 9106) that gets traction almost everywhere as the new password encryption scheme of choice).
Blake2 is also part of OpenSSL.
(0006754)
nrk (reporter)
2024-04-16 22:51

> and it's just one implementation had already implemented gzip compression in the `compress` utility in the current form

AFAIK, that implementation is openbsd: https://man.openbsd.org/compress.1 [^]
But openbsd also provides gzip(1), with the familiar -0..9 cli flag for compression level: https://man.openbsd.org/gzip.1 [^]

And so IMO, deciding to add gzip support to compress(1) - which is followed by a single implementation - instead of standarizing gzip(1) which is followed by nearly all implementation is very poor and is leading to many of the concerns outlined above.

> the `compress` utility can (if I didn't interpret it wrong) report "unsupported" for some combinations of the options.

It may, but my point was about poor and confusing interface. A flag doing one thing in one case and something completely different in another is usually not good design.

Furthermore, consider the hassle when trying to look through the manpage. Currently I can do `man gzip` and only see gzip(1) specific options or `man compress` and only see compress(1) specific options. Merging gzip and lzw into a single utility means the documentation will be cluttered with both even when I'm only looking for one.

> the current draft 4 has passed a stage where we can make normative changes to, so we might have to settle with an application usage or rationale section

That's unfortunate. If gzip(1) cannot be added, can the changes to compress(1) be reverted at least?

> some linux distros don't ship `pax` by default, but instead `cpio` and `tar`.

This is one more reason why I believe the compress(1) change should be reverted. gzip(1) is so ubiquitous that the chances that people start using compress(1) for it is HIGHLY unlikely - especially when you consider the "Hurdles for script authors" issues I raised above. As such, I would not be surprised at all if distributions continue to not ship compress(1) or if compress(1) implementations simply not add gzip support (aside from openbsd which already has it).
And so a very likely outcome is a more complicated compress(1) specification which benefits no one.
(0006755)
nrk (reporter)
2024-04-16 23:03

> a new compression format should be added.

zstd is nice and I do like it. But IMO this is off-topic for this issue which is more regarding compress(1) and gzip(1).

> Regarding checksums [...] Blake2

Are we talking about checksum utilities or the checksum present in the compressed archive?

I view both as off-topic for this issue, but if it's the latter then I'd like to point out that using a cryptographic hash as checksum in compressed archive has *no security benefits*. If an attacker can modify the compressed data stream then he can also modify the hash which is stored in the same file.

Lzip's choice of CRC32 is a perfectly fine decision to detect non-malicious data corruptions. If you want to detect malicious modification then the hash needs to be stored elsewhere outside the attacker's control.
(0006756)
oguzismailuysal (reporter)
2024-04-16 23:32

Standardizing a single frontend for various compression formats makes more sense than standardizing a separate utility for each format. If anything, POSIX should encourage supporting other popular algorithms such as lzip, bzip2, and xz.
(0006757)
steffen (reporter)
2024-04-17 00:10

Then the issue should be closed, because why standardizing yet another compression format that will see less and lesser usage because of its properties.
It is *much* worse for text files, and it is *much* worse for balls as of pax(1) (and if via tar(1)).
In my opinion, of course.
Plain is only that compress(1) is not even installed anymore on a lot of boxes.
*And* that i agree that placing new algorithms under a compress(1) interface does seem to not make any sense.
And that a the xxhash64 of BTRFS or zstd (but even xxhash32) is still better than CRC-32, if you go the mathematical approach, people go away from CRC32, they do, standards do, too, and even though i know that SHA-1 and MD5 are still used for checksumming even if they are broken cryptographically.
(0006758)
nrk (reporter)
2024-04-17 00:32

> Standardizing a single frontend for various compression formats makes more sense than standardizing a separate utility for each format.

I fail to see how it "makes more sense", especially in the face of all the issues I've raised (none of which has been countered).

But let's imagine that a single monolithic tool becomes standard. Now let's imagine that there's a script/program that needs to decompress both X and Y file. Something like:

    uncompress -m X ...
    uncompress -m Y ...

Now if my system's uncompress(1) only support X but doesn't support Y this will not work. And if I try to replace my system's uncompress(1) with a separate implementation which supports Y but doesn't support Z (which may be needed elsewhere) then what's the solution to this? Continually keep symlinking uncompress(1) to whatever implementation happens to work for the moment?

Having separate compression utility that deal with separate formats avoids such issues. I can have gzip(1), bzip2(1), zstd(1) etc, all installed at once. I can even use pigz(1) and symlink it to gzip(1) (which I do in my system) since it's a compatible alternative which also has multi-threaded compression.

A standalone tool like pigz(1) would not be able to exist - and be easily used system wide by simply symlinking it to gzip(1) - if the convention was a single monolithic tool for all algorithm.

And what happens if someone comes up with a new algorithm? Convention of a single monolithic tool forces him to either go add his algorithm to this one monolithic tool or to fork it. The current convention of different tool allows him to just release his own tool without worrying about getting it into the monolithic compress(1) implementation. Moreover, people who use that tool do not have to worry about whether their system's monolithic compress(1) supports it or not, because each tool can co-exist separately.

And so I don't think a monolithic tool "makes sense", it is a direct downgrade that:

* Goes against existing convention and practices.
* Makes it impossible for a user to pick and choose which utility and/or specific implementation they want to use.
* Makes it difficult for implementations to provide standalone tool focusing on a single format.
* Makes a mess out of the cli options where different algorithms have different parameters.
* Makes the manpage and documentation either a clustered mess or a small subset which doesn't offer any way to tune algorithm specific params.
(0006759)
larryv (reporter)
2024-04-17 01:36

Re: Note: 0006754:
If gzip(1) cannot be added, can the changes to compress(1) be reverted at least?
Almost certainly not; draft 4.1 was submitted to the IEEE Standards Review Committee over four weeks ago.
(0006760)
oguzismailuysal (reporter)
2024-04-17 06:16
edited on: 2024-04-17 06:19

Re: Note: 0006758

I fail to see how it "makes more sense",

It makes more sense in that an implementation can choose to distribute separate utilities for each format and implement compress as a wrapper around them that invokes the respective utility under the hood, or embed a bare bones implementation of each algorithm in compress and ship it as the main compression utility.

Standardizing a fairly complex utility that provides non-essential functionality would burden the developer with hours of work or force his hand to ship foreign code with all its defects and shortcomings. Extending an existing utility with basic support for an additional input format is less demanding and more likely to achieve results while staying neutral.

But let's imagine that a single monolithic tool becomes standard.

It already has.

Now if my system's uncompress(1) only support X but doesn't support Y this will not work. And if I try to replace my system's uncompress(1) with a separate implementation which supports Y but doesn't support Z (which may be needed elsewhere) then what's the solution to this?

If X, Y, and Z are all recognized by POSIX uncompress and your system provides a conforming uncompress implementation, it will work. Otherwise you'll have to rely on an extension either way.

Having separate compression utility that deal with separate formats avoids such issues. I can have gzip(1), bzip2(1), zstd(1) etc, all installed at once.

There is nothing in POSIX that stops you from doing that.

I can even use pigz(1) and symlink it to gzip(1) (which I do in my system) since it's a compatible alternative which also has multi-threaded compression.

This is not a good place to advertise our toy programs.

The rest of your message reads as if POSIX mandates the exclusion of individual compression utilities, which is not true. The upcoming issue of POSIX offers compress as a simple interface for lzw and deflate algorithms and doesn't say anything about alternative interfaces and additional formats.

(0006761)
nrk (reporter)
2024-04-17 07:47

> Extending an existing utility with basic support for an additional input format is less demanding

Both the gzip file-format and the underlying algorithm (DEFLATE) are *entirely different* to LZW. LZW builds up a dictionary on the go whereas DEFLATE is a combination of Huffman code + LZ77. Adding DEFLATE support to existing LZW implementation would be just as much work as writing a DEFLATE implementation from scratch because they share almost *nothing in common*.

Most other compression algorithms (bzip2, zstandard etc) in use are in similar boat where they are all vastly different to one another. Only exceptional case where a monolithic tool would save work is for xz, lzip and 7z; all of which uses a variant of LZMA.

So in cases where an implementation wants to support both algorithms, the current state of compress(1) is not "less demanding" and saves them no work. Not to mention, any implementation that wants to be useful in practice will need to ship gzip(1) anyways. So in practice, it's *more* work, not less.

(This is all ignoring the glaring fact that gzip has multiple implementations and is already shipped nearly everywhere. So in practice, standardizing gzip(1) would've required *no work* - both for developers and users/script-authors).

> as if POSIX mandates the exclusion of individual compression utilities, which is not true

It doesn't exclude it but it doesn't standardize it either, despite being the de facto standard. And it's hostile towards developers who want to provide standalone tools. Currently if I want to provide a posix-compliant compress(1) tool, I only need to worry about LZW. Or if gzip(1) was standardized, and I wanted to provide posix-compliant gzip(1) I'd only need to worry about DEFLATE. But in the current draft, implementating a posix-compliant compress(1) requires me to both implement LZW and DEFLATE.

> This is not a good place to advertise our toy programs.

I'm not sure why you think pigz(1) is a toy program. It's written and maintained by Mark Adler (one of the two original authors of the gzip format), is available in repos of many distros and some distros also offer using pigz(1) as a drop-in replacement for gzip(1) (such as gentoo via app-alternatives/gzip). This is the type of customizability and control the user would lose if instead everything was crammed into a single binary.

---

But in any case, putting all other arguments aside, what is unarguable is that standards that are not well adopted are next to useless. And by going for a monolithic compress(1) tool, POSIX is heading in a direction that is opposite of what the existing practice is.

And in doing so, it's asking people (both system developers and users/script-authors) to put in work in order to be compliant with the spec rather than making the spec compliant with existing practices; which is not a good direction to be heading in, if wide adoption is to be desired.
(0006762)
oguzismailuysal (reporter)
2024-04-17 11:03

Re: Note: 0006761

Both the gzip file-format and the underlying algorithm (DEFLATE) are *entirely different* to LZW.

Most other compression algorithms (bzip2, zstandard etc) in use are in similar boat where they are all vastly different to one another.

Doesn't matter as long as they can be used for compressing arbitrary data. They all take a parameter specifying the compression level which can be relayed through the option -b if compress were to be implemented as a wrapper for standalone utilities. POSIX defines the interface, how it is implemented is outside the standard's scope.

So in cases where an implementation wants to support both algorithms, the current state of compress(1) is not "less demanding" and saves them no work.

It does, though. The interface is already there, implementing only the algorithm is less work than implementing both the algorithm and a separate interface. Documenting a single option argument is also less work than documenting a whole new utility.

Not to mention, any implementation that wants to be useful in practice will need to ship gzip(1) anyways.

That's an opinion.

Currently if I want to provide a posix-compliant compress(1) tool, I only need to worry about LZW. Or if gzip(1) was standardized, and I wanted to provide posix-compliant gzip(1) I'd only need to worry about DEFLATE. But in the current draft, implementating a posix-compliant compress(1) requires me to both implement LZW and DEFLATE.

I don't see a problem here. You can choose to comply with an older version of POSIX or not comply with POSIX at all.

You keep mentioning "distros", I assume you mean Linux distributions. No Linux distribution is certified to be POSIX-conformant and I doubt any of them try to be so. So, what "distros" do is irrelevant here.
(0006763)
steffen (reporter)
2024-04-17 18:01

Sorry that i am here again, but #0006760

  
But let's imagine that a single monolithic tool becomes
  standard.

  It already has.

In fact there are many, and are installed like this, and maybe even used.
The zstd(1) tool now can be compiled "to do many" (.gz, .xz, .lzma maybe more).
(You do not want my opinion on that. Maybe one reason why it is so much larger than plzip or gzip (almost factor thirteen) even though the library is also so much larger (almost factor ten)), and then i wonder why a plaintext file with holes compresses much worse than plzip does by default, base64-encoded you can see the repetition with your own eyes, .. still 350+ percent better than gzip, however.

And then there are interfaces where this makes sense (imho), libarchive can this, even though its tar has a broken way to deal with its $TAR_WRITER_OPTIONS, for example, =zstd:threads=${JOBS},zstd:compression-level=19 *OR* =xz:threads=${JOBS}, you better specify the module it will be, or you are out. And soon the zstd module will also be able to dig threads=0, yay. (Though wrongly due to some namespace related pitfalls on Linux that noone wants to fix.)
(And note it is not pax.)

But i see i am off-topic again.
One thing is plain, writing such a tool is non-trivial, for example the FreeBSD manual of compress says:

     The program does not handle links well and has no link-handling options.

     Some of these might be considered otherwise-undocumented features.

and then it goes.
Talking about "toys".

Maybe instead effort should be put in something new that is carefully designed as tar -p -a aka libarchive-tar -p -a, and have an --algorithm/-Y (i think only -Y is left). Or have tar have its --format/-H (-H not libarchive-tar) be extended for also "plain", meaning no archive at all, but only compression.
(Note it is not pax -p -x FORMAT. *Noone* i know uses pax(1) or compress(1), if at all available, respectively.)

Now silent.

- Issue History
Date Modified Username Field Change
2024-04-15 03:23 nrk New Issue
2024-04-15 03:23 nrk Name => NRK
2024-04-15 03:23 nrk Section => compress(1) utility
2024-04-15 03:23 nrk Page Number => n/a
2024-04-15 03:23 nrk Line Number => n/a
2024-04-15 03:24 nrk Issue Monitored: nrk
2024-04-16 05:57 dannyniu Note Added: 0006752
2024-04-16 05:58 dannyniu Note Edited: 0006752
2024-04-16 18:39 steffen Note Added: 0006753
2024-04-16 22:51 nrk Note Added: 0006754
2024-04-16 23:03 nrk Note Added: 0006755
2024-04-16 23:32 oguzismailuysal Note Added: 0006756
2024-04-17 00:10 steffen Note Added: 0006757
2024-04-17 00:32 nrk Note Added: 0006758
2024-04-17 01:36 larryv Note Added: 0006759
2024-04-17 06:16 oguzismailuysal Note Added: 0006760
2024-04-17 06:18 oguzismailuysal Note Edited: 0006760
2024-04-17 06:19 oguzismailuysal Note Edited: 0006760
2024-04-17 07:47 nrk Note Added: 0006761
2024-04-17 11:03 oguzismailuysal Note Added: 0006762
2024-04-17 18:01 steffen Note Added: 0006763


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker