Forum: War Ensemble BBS

Re: Isn't that beauty ? (no it's not)

From DFS@nospam@dfs.com to comp.lang.c on Fri Mar 27 00:35:39 2026

From Newsgroup: comp.lang.c

On 3/26/2026 11:08 PM, Tim Rentsch wrote:

DFS <nospam@dfs.com> writes:

On 3/21/2026 2:07 AM, Tim Rentsch wrote:

DFS <nospam@dfs.com> writes:

On 3/12/2026 2:24 AM, Bonita Montero wrote:

There was a programming-"contest" on comp.lang.c and I wanted to show >>>>> the simpler code in C, here it is:

[C++ code]

C++ really rocks since you've to deal with much less details than in C. >>>>

Is your strategy to just ignore reality, and keep making bogus claims
that - for this challenge at least - you can't support?

Please don't encourage people who insist on using comp.lang.c
for other languages. It's better to just ignore them.

It doesn't bother me.

I'm sorry you place so little value on my participation.

Not true at all [1]. But clc is a "No Kings" newsgroup, and nobody can dictate who or what kind of posts are made. If Montero or I annoy you, killfile us.

It's no more off-topic than a discussion of sorting algorithms.

That's wrong. There have been lots of posts in the discussion of
sorting that are relevant to C. The problem with encouraging
discussion of C++, or Python, or other such languages, is they
tend to produce lots of postings that have nothing to do with C,
and rarely produce postings that have anything to do with C.

As a percent, I don't see lots of postings that have nothing to do with C.

This group is still very clean, as opposed to the rec.sport.golf group I
used to participate in; non-golfing interlopers completely wrecked it
years ago.

[1] You da man. You deserve another kudos with your performance in the
'sort of trivial code' challenge. I was looking at that code again, and
it's kind of mind-blowing. Is that what you do to challenge yourself?
How long have you been writing C programs? Do you do it for a living?

--- Synchronet 3.21f-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Thu Mar 26 21:53:28 2026

From Newsgroup: comp.lang.c

DFS <nospam@dfs.com> writes:
[...]

Not true at all [1]. But clc is a "No Kings" newsgroup, and nobody
can dictate who or what kind of posts are made. If Montero or I annoy
you, killfile us.

[...]

There is no authority that can enforce topicality rules in
comp.lang.c. But come on, the name of the programming language is
right in the name.

A lot of other languages have their own newsgroups, some of which
I actually read. Posts about language Foo, with no C content,
are better posted to comp.lang.foo, or perhaps to comp.lang.misc
if comp.lang.foo doesn't exist, or to comp.programming if the post
is more about algorithms than about the language, or ....

I dislike posts here that have no C content at all, and there have
been too many such posts lately.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21f-Linux NewsLink 1.2

From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.c on Fri Mar 27 17:03:16 2026

From Newsgroup: comp.lang.c

In article <10q51er$3ec1o$3@dont-email.me>, DFS <nospam@dfs.com> wrote:
...
(Tim wrote)

I'm sorry you place so little value on my participation.

Not true at all [1]. But clc is a "No Kings" newsgroup, and nobody can >dictate who or what kind of posts are made. If Montero or I annoy you, >killfile us.

I think yuo get this week's "Totally Missing the Point" award.
--
I've learned that people will forget what you said, people will forget
what you did, but people will never forget how you made them feel.

- Maya Angelou -
--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Mon Apr 6 12:32:48 2026

From Newsgroup: comp.lang.c

Bart <bc@freeuk.com> writes:

On 21/03/2026 02:35, DFS wrote:

On 3/20/2026 9:53 PM, Janis Papanagnou wrote:

On 2026-03-20 22:10, DFS wrote:

Sometimes listening to you clc guys is like listening to a room
full of CS professors (which I suspect some of you are or were).

CS professors sitting in their ivory tower and without practical
experiences, and programmers without a substantial CS background;
both of these extreme characters can be problematic (if fanatic).

It was meant as a compliment. Plenty of CS professors have good
practical and industry experience, too, which you'll see on their
bios and cv's.

It's got to be tough to find good CS teachers that stick around,
given they can probably make much more money in private industry.

Many folks here seem to have a good mix of necessary practical
IT, Project, and CS knowledge and proficiencies, as I'd value it.
And a clear mind to discuss topics. Sometimes it gets heated and
personal, though; certainly nothing to foster.

I've definitely seen Bart take some heat for continuing to post code
in his scripting language, rather than C

People post bits of code in all sorts of languages when it suits
them. It's happened in this thread (Bash, AWK, Python as well as C++),
and in many others.

But I get a lot of heat because I use my personal languages,

You get heat because you ignore the rules of usual newgroup
etiquette, which most other posters observe, and because you
insist on focusing on your self-centered interests, without
regard or consideration for other readers or the chartered
purpose of the group. Besides being better for the group,
it would be better for you if you could learn to be less
self-focused, and more considerate of others.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Mon Apr 6 13:23:02 2026

From Newsgroup: comp.lang.c

antispam@fricas.org (Waldek Hebisch) writes:

[...]

Heapsort is only marginaly more complex than bubble sort [...]

I think this claim is ridiculous. Code for heapsort is more than
twice as long as code for bubble sort, and is both harder to write
and harder to understand. Heapsort really needs two functions
rather than one. All of the simple sorts (bubble sort, selection
sort, insertion sort, merge sort) are IME easily coded without
needing any significant thought. Not so for heapsort. Even
quicksort, which isn't trivial, is easier than heapsort.

On average heapsort is not as fast as quicksort (main factor seem
to by much worse cache behaviour),

More to say on that matter; saving for another reply and hoping to
get to that soon.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Mon Apr 6 13:48:04 2026

From Newsgroup: comp.lang.c

Michael S <already5chosen@yahoo.com> writes:

[discussion of sorting code]

-----------------------
void isort(char** data, int ll, int rr) {
char* temp;
int i = ll, j = rr;
char* pivot = data[(ll + rr) / 2];

do {
while (strcmp(pivot, data[i]) > 0 && i < rr) ++i;
while (strcmp(pivot, data[j]) < 0 && j > ll) --j;

if (i <= j) {
temp = data[i]; data[i] = data[j]; data[j] = temp;
++i;
--j;
}
} while (i <= j);

if (ll < j) isort(data, ll, j);
if (i < rr) isort(data, i, rr);
}

//For a char* array A of N elements, call as 'isort(A, 0, n-1)'.

Looks o.k. speed wise.
It is not ideomatic C,

What about the code prompts you to say it is not idiomatic C?
Is it because there is a line with multiple statements on it,
or is there something else?

Personally I would not call this code non-idiomatic, with
having three statements on one line a minor style exception.

Also I think Bart should be applauded for posting code that
AFAICT falls completely within the confines of ISO C, and
even looks fairly natural, style-wise.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Mon Apr 6 15:13:32 2026

From Newsgroup: comp.lang.c

Michael S <already5chosen@yahoo.com> writes:

On Fri, 20 Mar 2026 14:47:18 +0100
David Brown <david.brown@hesbynett.no> wrote:

The cost of doing a comparison does not affect the complexity of
an algorithm - but it can certainly affect the best choice of
algorithm for the task in hand.

In case of lexicographic sort comparison most certainly affects
BigO. BigO is about counting elementary operations or, at least,
about counting non-elementary operations that have complexity
independent of N. In out specific case (described in the previous
paragraph) an average number of elementary comparisons within
strcmp() does depend on N. One can expect that it grows with N,
because for longer N quicksort spends relatively more time
comparing strings with longer and longer common prefixes.

This reasoning isn't right. The reason it isn't is that, even as
more words are sorted, the words don't get longer. Average match
length doesn't matter. All words in the two files mentioned are, as
best I can determine, not longer than 40 characters. Because the
length is bounded, so is the time needed to do a strcmp() call. In
other words the comparison operation is O(1). Using strcmp() can
change the constant factor, but it doesn't change BigO.

I did a couple of experiments.
Here is a number of character comparisons during sorting by
quicksort with strcmp-alike comparison as a function number of
sorted strings. Inputs used for experiments were first N lines of
two long books (those that I suggested to Bart in other post in
this thread).

Book 1:
1000 40367
2000 93404
3000 156995
4000 238062
5000 322275
6000 401332
7000 483294
8000 588128
9000 707559
10000 811159
11000 969934
12000 1061037
13000 1172776
14000 1327489
15000 1456356
16000 1565355
17000 1691351
18000 1811836
19000 1971700
20000 2101477
21000 2237132
22000 2348410
23000 2492149
24000 2618858
25000 2726828
26000 2881609
27000 3017903
28000 3183088
29000 3320716
30000 3523695
31000 3719782
32000 3852290
33000 4007476
34000 4149704
35000 4228143
36000 4399583
37000 4581223
38000 4761085
39000 4919182
40000 5110223
41000 5463271
42000 5605860
43000 5830282
44000 6096857
45000 6464758
46000 6860786
47000 7101972
48000 7434769
49000 7736464
50000 8028681
51000 8334358
52000 8560822
53000 8814096
54000 8986343
55000 9162604
56000 9358948
57000 9468853
58000 9638512
59000 9863370
60000 10000734
61000 10242187
62000 10500148
63000 10654166
64000 10882232
65000 11187624
66000 11290778
67000 11481047
68000 11664433
69000 11858529
70000 12059970
71000 12173240
72000 12232160
73000 12418818
74000 12577954
75000 12786122
76000 12939916
77000 13083515
78000 13275432
79000 13616563
80000 13797048
81000 14137362
82000 14388124
83000 14667272
84000 14914600
85000 15174195
86000 15227048
87000 15414838
88000 15526278
89000 15670034
90000 15966396
91000 16120859
92000 16267285
93000 16402168
94000 16718506
95000 16968960
96000 17169618
97000 17362159
98000 17486179
99000 17944209
100000 18085095
101000 18455071
102000 18733130
102100 18769521

Book 2:
1000 104846
2000 243080
3000 399564
4000 661436
5000 855870
6000 1017042
7000 1205936
8000 1341884
9000 1511727
10000 1681973
11000 1820114
12000 1978196
13000 2081813
14000 2162091
15000 2280537
16000 2375272
17000 2500672
18000 2617364
19000 2787782
20000 2960678
21000 3149483
22000 3286894
23000 3481047
24000 3638352
25000 3842186
26000 4001105
27000 4083466
28000 4239273
29000 4399250
30000 4505316
30384 4598719

It is easy to see that at the beginning # of comparisons grows faster
than N*log(N). For the 1st book the trend continues throughout all
chart. For the 2nd book it does not.

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines. That choice has an
effect, but as long as line lengths are bounded the strcmp() calls
are still O(1).

I'd guess that the reasons behind it is that the 1st book is
relatively homogeneous, while the 2nd book consists of rather
distinct parts, likely with different characteristics of beginnings
of lines. At parts the 2nd book is full of periods, of rhythmical
prose close to poetry and at other parts it is not. I didn't have
time to dig deeper.

A more careful experiment is needed.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.lang.c on Tue Apr 7 01:58:23 2026

From Newsgroup: comp.lang.c

On Mon, 06 Apr 2026 13:48:04 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

[discussion of sorting code]

-----------------------
void isort(char** data, int ll, int rr) {
char* temp;
int i = ll, j = rr;
char* pivot = data[(ll + rr) / 2];

do {
while (strcmp(pivot, data[i]) > 0 && i < rr) ++i;
while (strcmp(pivot, data[j]) < 0 && j > ll) --j;

if (i <= j) {
temp = data[i]; data[i] = data[j]; data[j] = temp;
++i;
--j;
}
} while (i <= j);

if (ll < j) isort(data, ll, j);
if (i < rr) isort(data, i, rr);
}

//For a char* array A of N elements, call as 'isort(A, 0, n-1)'.

Looks o.k. speed wise.
It is not ideomatic C,

What about the code prompts you to say it is not idiomatic C?
Is it because there is a line with multiple statements on it,
or is there something else?

Something else.
Here is a mechanical translation of Bart's code to what I consider more idiomatic C. Pay attention, the translation was mechanical, I didn't
check if original logic is 100% correct and whether it is possible to
omit some checks.

void isort(char** data, int len) {
if (len > 1) {
char** i = data, j = &data[len-1];
char* pivot = data[(len-1) / 2];
do {
while (strcmp(pivot, *i) > 0 && i < &data[len-1]) ++i;
while (strcmp(pivot, *j) < 0 && j > data) --j;
if (i <= j) {
char* temp = *i; *i = *j; *j = temp;
++i;
--j;
}
} while (i <= j);
if (j > data) isort(data, j+1-data);
if (i < &data[len-1]) isort(i, &data[len]-i);
}
}

Whether i and j are pointers or integers is less important.
But passing into call two indices instead of one and the fact that the
second index points into last element of araay instead of the next
place after last element is certainly non-idiomatic.

Personally I would not call this code non-idiomatic, with
having three statements on one line a minor style exception.

Also I think Bart should be applauded for posting code that
AFAICT falls completely within the confines of ISO C, and
even looks fairly natural, style-wise.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.lang.c on Tue Apr 7 02:22:13 2026

From Newsgroup: comp.lang.c

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Fri, 20 Mar 2026 14:47:18 +0100
David Brown <david.brown@hesbynett.no> wrote:

The cost of doing a comparison does not affect the complexity of
an algorithm - but it can certainly affect the best choice of
algorithm for the task in hand.

In case of lexicographic sort comparison most certainly affects
BigO. BigO is about counting elementary operations or, at least,
about counting non-elementary operations that have complexity
independent of N. In out specific case (described in the previous paragraph) an average number of elementary comparisons within
strcmp() does depend on N. One can expect that it grows with N,
because for longer N quicksort spends relatively more time
comparing strings with longer and longer common prefixes.

This reasoning isn't right. The reason it isn't is that, even as
more words are sorted, the words don't get longer. Average match
length doesn't matter. All words in the two files mentioned are, as
best I can determine, not longer than 40 characters. Because the
length is bounded, so is the time needed to do a strcmp() call. In
other words the comparison operation is O(1). Using strcmp() can
change the constant factor, but it doesn't change BigO.

I did a couple of experiments.
Here is a number of character comparisons during sorting by
quicksort with strcmp-alike comparison as a function number of
sorted strings. Inputs used for experiments were first N lines of
two long books (those that I suggested to Bart in other post in
this thread).

Book 1:
1000 40367
2000 93404
3000 156995
4000 238062
5000 322275
6000 401332
7000 483294
8000 588128
9000 707559
10000 811159
11000 969934
12000 1061037
13000 1172776
14000 1327489
15000 1456356
16000 1565355
17000 1691351
18000 1811836
19000 1971700
20000 2101477
21000 2237132
22000 2348410
23000 2492149
24000 2618858
25000 2726828
26000 2881609
27000 3017903
28000 3183088
29000 3320716
30000 3523695
31000 3719782
32000 3852290
33000 4007476
34000 4149704
35000 4228143
36000 4399583
37000 4581223
38000 4761085
39000 4919182
40000 5110223
41000 5463271
42000 5605860
43000 5830282
44000 6096857
45000 6464758
46000 6860786
47000 7101972
48000 7434769
49000 7736464
50000 8028681
51000 8334358
52000 8560822
53000 8814096
54000 8986343
55000 9162604
56000 9358948
57000 9468853
58000 9638512
59000 9863370
60000 10000734
61000 10242187
62000 10500148
63000 10654166
64000 10882232
65000 11187624
66000 11290778
67000 11481047
68000 11664433
69000 11858529
70000 12059970
71000 12173240
72000 12232160
73000 12418818
74000 12577954
75000 12786122
76000 12939916
77000 13083515
78000 13275432
79000 13616563
80000 13797048
81000 14137362
82000 14388124
83000 14667272
84000 14914600
85000 15174195
86000 15227048
87000 15414838
88000 15526278
89000 15670034
90000 15966396
91000 16120859
92000 16267285
93000 16402168
94000 16718506
95000 16968960
96000 17169618
97000 17362159
98000 17486179
99000 17944209
100000 18085095
101000 18455071
102000 18733130
102100 18769521

Book 2:
1000 104846
2000 243080
3000 399564
4000 661436
5000 855870
6000 1017042
7000 1205936
8000 1341884
9000 1511727
10000 1681973
11000 1820114
12000 1978196
13000 2081813
14000 2162091
15000 2280537
16000 2375272
17000 2500672
18000 2617364
19000 2787782
20000 2960678
21000 3149483
22000 3286894
23000 3481047
24000 3638352
25000 3842186
26000 4001105
27000 4083466
28000 4239273
29000 4399250
30000 4505316
30384 4598719

It is easy to see that at the beginning # of comparisons grows
faster than N*log(N). For the 1st book the trend continues
throughout all chart. For the 2nd book it does not.

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was stated
at one of earlier posts.
If it was about sorting English words it would be completely different
matter. Likely, dependency still exists for very short sorts, but
likely disappers at few K words.

That choice has an
effect, but as long as line lengths are bounded the strcmp() calls
are still O(1).

May be, for text, consistiong of many million of lines it would be
the case. But the longest books ever written by human are much much
shorter than that.

I'd guess that the reasons behind it is that the 1st book is
relatively homogeneous, while the 2nd book consists of rather
distinct parts, likely with different characteristics of beginnings
of lines. At parts the 2nd book is full of periods, of rhythmical
prose close to poetry and at other parts it is not. I didn't have
time to dig deeper.

A more careful experiment is needed.

There aren't many books that are sufficiently long.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Bart@bc@freeuk.com to comp.lang.c on Tue Apr 7 01:02:09 2026

From Newsgroup: comp.lang.c

On 06/04/2026 23:58, Michael S wrote:

On Mon, 06 Apr 2026 13:48:04 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

[discussion of sorting code]

-----------------------
void isort(char** data, int ll, int rr) {
char* temp;
int i = ll, j = rr;
char* pivot = data[(ll + rr) / 2];

do {
while (strcmp(pivot, data[i]) > 0 && i < rr) ++i;
while (strcmp(pivot, data[j]) < 0 && j > ll) --j;

if (i <= j) {
temp = data[i]; data[i] = data[j]; data[j] = temp;
++i;
--j;
}
} while (i <= j);

if (ll < j) isort(data, ll, j);
if (i < rr) isort(data, i, rr);
}

//For a char* array A of N elements, call as 'isort(A, 0, n-1)'.

Looks o.k. speed wise.
It is not ideomatic C,

What about the code prompts you to say it is not idiomatic C?
Is it because there is a line with multiple statements on it,
or is there something else?

Something else.
Here is a mechanical translation of Bart's code to what I consider more idiomatic C. Pay attention, the translation was mechanical, I didn't
check if original logic is 100% correct and whether it is possible to
omit some checks.

void isort(char** data, int len) {
if (len > 1) {
char** i = data, j = &data[len-1];
char* pivot = data[(len-1) / 2];
do {
while (strcmp(pivot, *i) > 0 && i < &data[len-1]) ++i;
while (strcmp(pivot, *j) < 0 && j > data) --j;
if (i <= j) {
char* temp = *i; *i = *j; *j = temp;
++i;
--j;
}
} while (i <= j);
if (j > data) isort(data, j+1-data);
if (i < &data[len-1]) isort(i, &data[len]-i);
}
}

Whether i and j are pointers or integers is less important.
But passing into call two indices instead of one and the fact that the
second index points into last element of araay instead of the next
place after last element is certainly non-idiomatic.

It can be idiomatic for a Quicksort API. I've just browsed loads of
examples on RosettaCode, and Left/Right arguments are common, even when
arrays contain their own dimensions so allowing slices to be passed.

Regarding the upper bound not being one past the end, both are intended
to refer to actual elements.

I also saw some examples like this Python:

def quicksort(array):
_quicksort(array, 0, len(array) - 1)

def _quicksort(array, start, stop): ...

Python is zero-based like C, but here the 'stop' index is still that of
the last element, not one past.

I think your idea of idiomatic C is to have something that is harder to understand and less readable. I acknowledge that a lot of C is like
that, but it doesn't need to be.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Mon Apr 6 18:37:47 2026

From Newsgroup: comp.lang.c

Michael S <already5chosen@yahoo.com> writes:

On Thu, 19 Mar 2026 14:49:16 +0000
Bart <bc@freeuk.com> wrote:

On 19/03/2026 09:19, Michael S wrote:

On Wed, 18 Mar 2026 11:20:03 +0100
David Brown <david.brown@hesbynett.no> wrote:

Normally however (and again in scripting code) I'd use my built-in
sort() based on quicksort, which is nearly 1000 times faster than
bubble-sort for my test (sort 20K random strings), and some 300x
faster than your routines. It's not O(n-squared) either.

For lexicographic sort of 20K random strings, plain quicksort is
probably quite sub-optimal.
If performance is important, I'd consider combined method: first
pre-sort by 3-char or 4-char prefix with Radix sort ('by LS character
first' variation of algorithm), then use quicksprt to sort sections
with the same prefix. For string taken from the real world it will not
work as well as for artificial random strings, but should still
significantly outperform plain quicksort.

Put the strings in a hash table. If a string has been seen
before nothing more needs to be done. Any string not seen
before to be inserted into a balanced tree (with no further
tests for equality needed). The balanced tree can be used as
a basis to associate an integer with each string, where the
associated integers have the same sort order as the strings.
The only string comparisons needed are for building the
hash table (which for the most part will have no misses),
and for building the balanced tree, which is good because
most of the comparisons are with values that are far away,
lexicographically, from the string being inserted, and so
not many character comparisons are needed before a decision
is made about which way to go in the tree.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Mon Apr 6 21:00:46 2026

From Newsgroup: comp.lang.c

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic behavior
as N goes to infinity.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.lang.c on Tue Apr 7 09:37:05 2026

From Newsgroup: comp.lang.c

On Mon, 06 Apr 2026 21:00:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic behavior
as N goes to infinity.

So, how do you call part of the curve from 1e2 to 3e5 lines? MiddleO ?

--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Tue Apr 7 02:12:49 2026

From Newsgroup: comp.lang.c

Michael S <already5chosen@yahoo.com> writes:

[..lots left out..]

On Wed, 18 Mar 2026 11:20:03 +0100
David Brown <david.brown@hesbynett.no> wrote:

[...]

I don't think pure bubblesort has much serious use outside
educational purposes, but it is easy to understand, easy to
implement correctly in pretty much any language, and can do a
perfectly good job if you don't need efficient sorting of large
datasets (for small enough datasets, a hand-written bubblesort
in C will be faster than calling qsort).

IMHO, Bubble Sort is harder to understand then Straight Select
Sort.

After reading through the several discussions regarding sorting,
I have a variety of comments to make about sorting algorithms,
not just the remarks above but to the discussions in general.
The posting above seems a reasonable departure point for those
comments, although much of what I have to say is prompted by
comments in other postings.

For starters, I contend that a completely simple bubble sort is
the easiest sorting algorithm to show to novice programmers. By
novice here I mean people who barely even know what is meant by
the word algorithm. To be specific, what I mean by simple bubble
sort is like this:

void
simple_minded_bubble_sort( unsigned n, Element a[n] ){
unsigned i, j;
for( i = 0; i < n; i++ ){
for( j = 1; j < n; j++ ){
if( follows( j-1, j, a ) ) swap( j-1, j, a );
}
}
}

Incidentally all the algorithms I will show here are written in
terms of the two functions follows() and swap(), whose meanings
are (I hope) very obvious. The array to be sorted has values of
type Element, which doesn't play any direct role in the
discussion.

I think it goes without saying that this is a crummy algorithm.
But it's a very simple algorithm, which makes it a good first
choice for novice programmers.

We can reasonably ask how bad this algorithm is, and ask in a way
that novices can understand. It seems obvious that a good
sorting algorithm should compare any two particular elements at
most once. If we try this algorithm on an array of randomly
chosen input values, we might get statistics like this:

simple bubble 151277700 37391291 200.000

Here the columns are name, number of compares, number of swaps,
and number of compares shown as a percentage of the "maximal"
number of compares. The crumminess is evident: this algorithm
compares each pair of elements not once but twice. Awful!

There is a second and more subtle problem, which is that even
though the algorithm is simple it isn't obvious that it works or
why. However it isn't hard to see that the inner loop carries
the largest value to the upper end of the array, and similarly
the second iteration of the inner loop will carry the second
largest value to the last place before that, and so on. This
observation leads to a second algorithm that is both faster and
has a motivating argument for why it works:

void
bubble_sort( unsigned n, Element a[n] ){
unsigned i, j;
for( i = n; i > 1; i-- ){
for( j = 1; j < i; j++ ){
if( follows( j-1, j, a ) ) swap( j-1, j, a );
}
}
}

Now the inner loop works over smaller and smaller subsections of
the whole array, each time carrying the largest remaining value
to its appropriate location in the to-be-sorted array. Using the
same test data as before, we get these measurements:

simple bubble 151277700 37391291 200.000
bubble 75638850 37391291 100.000

Excellent! Besides being able to see why it works, we do better
on the number of compares by a factor of two. The number of
swaps is the same (as indeed it cannot be improved for any
algorithm that swaps only adjacent elements), but cutting down on
the number of compares is clearly a step in the right direction.

Now if we are inspired we can notice something. During one
iteration of the inner loop, if we remember the last place a swap
occurred, subsequent iterations of the inner don't need to go any
farther than that. The reason is that we know all the values
after the last swap must the larger than the element being
carried along, so only smaller values were passed over, so we
don't need to look past the location of the rightmost swap. It's
kind of a subtle argument, but we can program it thusly:

void
knuth_bubble_sort( unsigned n, Element a[n] ){
unsigned i, j, t;
for( i = n; i > 1; i = t ){
for( t = 0, j = 1; j < i; j++ ){
if( follows( j-1, j, a ) ){
swap( j-1, j, a ), t = j;
}
}
}
}

The variable 't' records the location past which no more compares
need be done in the next iteration. By using 'i = t' in the
outer loop control, we get to limit the range of what is looked
at in the next iteration of the inner loop. Here is the data for
the first three algorithms, again all run on the same input:

simple bubble 151277700 37391291 200.000
bubble 75638850 37391291 100.000
knuth bubble 75611850 37391291 99.964

The number of compares done has indeed gone down, but not by
much. There are a couple of dirty secrets about the "improved"
version of bubble sort. One is that it doesn't help very much.
The other is about the idea that it terminates early when the
array values are already almost sorted. That is mostly a myth.
It can happen that when the array values are already almost
sorted the knuth_bubble_sort() will terminate early, but it
almost never does. Even when there is only one array value out
of place, it is relatively rare for this algorithm to get a big
speedup.

It's time to look at another algorithm. Here is one that wasn't
around when I was first learning about sorting algorithms, called
gnome sort:

void
gnome_sort( unsigned n, Element a[n] ){
unsigned i = 0;
while( ++i < n ){
if( follows( i-1, i, a ) ){
swap( i-1, i, a );
i = i > 1 ? i-2 : 1;
}
}
}

The idea is simple. Imagine a gnome wandering among the array
elements, starting at the left end. If the two elements being
looked at are in order the gnome moves right; if they aren't
then the gnome swaps them and moves left (of course never moving
past the left end). So the gnome just wanders back and forth,
swapping elements that are out of order and going either left or
right, depending on whether it just did a swap or not. How does
it do, measurement-wise? Here is the data:

simple bubble 151277700 37391291 200.000
bubble 75638850 37391291 100.000
knuth bubble 75611850 37391291 99.964
gnome 74794859 37391291 98.884

We see the relatively simple gnome sort does better than the more
complicated knuth bubble sort. Perhaps not a lot better, but
still better.

Even so, these numbers are not very inspiring. We can do better
if we take the basic idea of (knuth) bubble sort and make it
boustrophedonic, or what might be called a ping pong sort:

void
ping_pong_sort( unsigned n, Element a[n] ){
unsigned i=1, j=-1, k=n, newj, newk;
while( j+1 < k-1 ){
for( i = j+2, newk = i; i < k; i++ ){
if( follows( i-1, i, a ) ){
swap( i-1, i, a ), newk = i;
}
}
k = newk;
for( i = k-1, newj = i; i > j+1; i-- ){
if( follows( i-1, i, a ) ){
swap( i-1, i, a ), newj = i-1;
}
}
j = newj;
}
}

Gosh, what a mess. But let's look at the metrics.

simple bubble 151277700 37391291 200.000
bubble 75638850 37391291 100.000
knuth bubble 75611850 37391291 99.964
gnome 74794859 37391291 98.884
ping pong 49983149 37391291 66.081

That is nice to see - a healthy performance improvement. But of
course what everyone is waiting for are the well-known simple
standard sorts:

void
selection_sort( unsigned n, Element a[n] ){
unsigned i, j;
for( i = 0; i < n; i++ ){
unsigned k = i;
for( j = i+1; j < n; j++ ){
if( follows( k, j, a ) ) k = j;
}
swap( i, k, a );
}
}

void
insertion_sort( unsigned n, Element a[n] ){
unsigned i, j;
for( i = 1; i < n; i++ ){
for( j = i; j > 0; j-- ){
if( !follows( j-1, j, a ) ) break;
swap( j-1, j, a );
}
}
}

simple bubble 151277700 37391291 200.000
bubble 75638850 37391291 100.000
knuth bubble 75611850 37391291 99.964
gnome 74794859 37391291 98.884
ping pong 49983149 37391291 66.081

selection 75638850 12289 100.000
insertion 37403579 37391291 49.450

Clearly selection sort should be used only when the amount of
movement is the overriding factor, and compares don't matter.
Conversely, insertion sort is the flip side of that - the number
of compares is much lower, the amount of movement is the same as
all the others.

For my own amusement I wrote a new sorting algorithm, which I
whimsically call titanic sort:

void
titanic_sort( unsigned n, Element a[n] ){
unsigned i, j;
for( i = n; i > 0; i-- ){
for( j = i-1; j < n-1; j++ ){
if( follows( j, j+1, a ) ) swap( j, j+1, a );
}
}
}

Titanic sort is just like an (ordinary) bubble sort, except it
does ever larger iterations in the inner loop, rather than ever
smaller iterations like an ordinary bubble sort. Here are the
metrics:

simple bubble 151277700 37391291 200.000
bubble 75638850 37391291 100.000
knuth bubble 75611850 37391291 99.964
gnome 74794859 37391291 98.884
ping pong 49983149 37391291 66.081

selection 75638850 12289 100.000
insertion 37403579 37391291 49.450

titanic 75638850 37391291 100.000

After writing the titanic sort, I had an idea. Because we are
working on ever growing subintervals, we can stop the inner loop
early whenever we find we don't need to swap. For this sort I
once again chose a whimsical name, namely seven up sort:

void
sevenup_sort( unsigned n, Element a[n] ){
unsigned i, j;
for( i = n; i > 0; i-- ){
for( j = i-1; j < n-1; j++ ){
if( !follows( j, j+1, a ) ) break;
swap( j, j+1, a );
}
}
}

simple bubble 151277700 37391291 200.000
bubble 75638850 37391291 100.000
knuth bubble 75611850 37391291 99.964
gnome 74794859 37391291 98.884
ping pong 49983149 37391291 66.081

selection 75638850 12289 100.000
insertion 37403579 37391291 49.450

titanic 75638850 37391291 100.000
seven up 37403579 37391291 49.450

Oh of course; a seven up sort is just like an insertion sort,
but inserting on the far end rather than the near end. Duh!

After doing all these I wrote and measured three other sorts, for
which I will give the metrics but the code for only one. Here
are the metrics:

simple bubble 151277700 37391291 200.000
bubble 75638850 37391291 100.000
knuth bubble 75611850 37391291 99.964
gnome 74794859 37391291 98.884
ping pong 49983149 37391291 66.081

selection 75638850 12289 100.000
insertion 37403579 37391291 49.450

titanic 75638850 37391291 100.000
seven up 37403579 37391291 49.450

heapsort 309459 168889 0.409
quicksort 192033 117831 0.254
binary insertion 150102 37391291 0.198

The code for heapsort and for quicksort was not as clean as I
would like so I'm not posting that. Here is the code for binary
insertion sort:

void
binary_insertion_sort( unsigned n, Element a[n] ){
unsigned i, j, k, m;
for( i = 1; i < n; i++ ){
j = -1, k = i;
while( j+1 < k ){
m = j+(k-j)/2;
if( follows( m, i, a ) ) k = m;
else j = m;
}
for( j = i; j > k; j-- ) swap( j-1, j, a );
}
}

A comment about heapsort, especially as compared with quicksort.
Using this test input heapsort did a lot more compares than
quicksort did. That result is not an accident of the data. Of
course sometimes quicksort misbehaves, but usually it doesn't,
and in those cases heapsort is going to run slower than quicksort
because it does more compares. That result is inherent in the
heapsort algorithm. It's true that the worst case for heapsort
is O( N * log N ), but constant for heapsort is bigger than (the
average case) for quicksort. Also it's worth noting that binary
insertion sort does the best of all in terms of number of
compares done.

My conclusions...

The right place to start on sorting algorithms is with bubble
sort, not because it's a good algorithm, but because it's the
easiest one to show, and what is more important it's a good way
to teach about how algorithms evolve.

There is no reason to use the knuth version of bubble sort.
Compared to ordinary bubble sort the upside is small and not
worth the downside of needing more intricate code.

If there is a good chance that input starts off being close to
sorted, ping pong sort can be considered.

In almost all cases prefer binary insertion sort to plain
insertion sort, not because the average case is better but
because the worst case is better. Moreover the extra code needed
is small and easily understood.

Writing a full-scale sorting algorithm, such as a replacement for
qsort, is a non-trivial undertaking.

Other considerations apply if the sort needs to be stable, but
surely this posting is long enough already. :)
--- Synchronet 3.21f-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.lang.c on Tue Apr 7 14:00:41 2026

From Newsgroup: comp.lang.c

On Tue, 07 Apr 2026 02:12:49 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

<snip>

My conclusions...

The right place to start on sorting algorithms is with bubble
sort, not because it's a good algorithm, but because it's the
easiest one to show, and what is more important it's a good way
to teach about how algorithms evolve.

If you write it in terms of moves rather than in terms of swaps,
bubble sort is not easier than insertion and selection.

There is no reason to use the knuth version of bubble sort.
Compared to ordinary bubble sort the upside is small and not
worth the downside of needing more intricate code.

knuth version has at least one merit. You claim that in practice the
it's too rare that the merit is exercised. May be, it is true. Or may
be cases of almost sorted array with misplaced elements coming in pairs
with close proximity between peers is not that uncommon. I don't know.
But merit exist.
OTOH, 'ordinary bubble sort' has no merits at all relatively to simple insertion sort.

If there is a good chance that input starts off being close to
sorted, ping pong sort can be considered.

In almost all cases prefer binary insertion sort to plain
insertion sort, not because the average case is better but
because the worst case is better. Moreover the extra code needed
is small and easily understood.

The most important practical use case for insertion sorts is in sorting
short sections created by higher level sorting algorithms. I'm
reasonably certain that for this use case straight insertion sort is
better than binary insertion sort.
Esp. so on modern CPUs where branch mispredictions become relatively
more expensive (about the same as 15-20 years ago in terms of cycles,
but in each cycle modern CPU can do 3-4 times more work) and with
modern gcc compiler that [rightly] applies branch avoidance techniques (if-conversion in gcc docs) much more conservatively than in the older versions.

Writing a full-scale sorting algorithm, such as a replacement for
qsort, is a non-trivial undertaking.

Other considerations apply if the sort needs to be stable, but
surely this posting is long enough already. :)

Still qute a lot shorter TAOCP vol.3 :-)

BTW, thank you for 'boustrophedon'. Not that there is a serious chance
that I will remember the word tomorrow or even tonight, But
hopefully I'd remember that such word exists.

--- Synchronet 3.21f-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.lang.c on Tue Apr 7 14:46:19 2026

From Newsgroup: comp.lang.c

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic behavior
as N goes to infinity.

Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there
are finitely many keys and sorting is equivalent to counting how
many times each key appears in the input.
--
Waldek Hebisch
--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Tue Apr 7 08:01:49 2026

From Newsgroup: comp.lang.c

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 13:48:04 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

[discussion of sorting code]

-----------------------
void isort(char** data, int ll, int rr) {
char* temp;
int i = ll, j = rr;
char* pivot = data[(ll + rr) / 2];

do {
while (strcmp(pivot, data[i]) > 0 && i < rr) ++i;
while (strcmp(pivot, data[j]) < 0 && j > ll) --j;

if (i <= j) {
temp = data[i]; data[i] = data[j]; data[j] = temp;
++i;
--j;
}
} while (i <= j);

if (ll < j) isort(data, ll, j);
if (i < rr) isort(data, i, rr);
}

//For a char* array A of N elements, call as 'isort(A, 0, n-1)'.

Looks o.k. speed wise.
It is not ideomatic C,

What about the code prompts you to say it is not idiomatic C?
Is it because there is a line with multiple statements on it,
or is there something else?

Something else.
Here is a mechanical translation of Bart's code to what I consider more idiomatic C. Pay attention, the translation was mechanical, I didn't
check if original logic is 100% correct and whether it is possible to
omit some checks.

void isort(char** data, int len) {
if (len > 1) {
char** i = data, j = &data[len-1];
char* pivot = data[(len-1) / 2];
do {
while (strcmp(pivot, *i) > 0 && i < &data[len-1]) ++i;
while (strcmp(pivot, *j) < 0 && j > data) --j;
if (i <= j) {
char* temp = *i; *i = *j; *j = temp;
++i;
--j;
}
} while (i <= j);
if (j > data) isort(data, j+1-data);
if (i < &data[len-1]) isort(i, &data[len]-i);
}
}

Whether i and j are pointers or integers is less important.
But passing into call two indices instead of one and the fact that the
second index points into last element of araay instead of the next
place after last element is certainly non-idiomatic.

I agree that using one past the last item is more common. I
myself wouldn't call using at the last item non-idiomatic,
but I understand your point.

As for whether the number of index/pointer arguments is one
or two, that's an internal design decision. Depending on
other factors either choice might be preferable; I have
used both in the past, even just in this last exercise of
trying out different sorting algorithms. So I have to
object to calling either choice non-idiomatic.
--- Synchronet 3.21f-Linux NewsLink 1.2

From DFS@nospam@dfs.com to comp.lang.c on Tue Apr 7 16:39:03 2026

From Newsgroup: comp.lang.c

On 4/7/2026 5:12 AM, Tim Rentsch wrote:

<snip>
> simple bubble 151277700 37391291 200.000

bubble 75638850 37391291 100.000
knuth bubble 75611850 37391291 99.964
gnome 74794859 37391291 98.884
ping pong 49983149 37391291 66.081

selection 75638850 12289 100.000
insertion 37403579 37391291 49.450

titanic 75638850 37391291 100.000
seven up 37403579 37391291 49.450

heapsort 309459 168889 0.409
quicksort 192033 117831 0.254
binary insertion 150102 37391291 0.198

Nice post! About 15 minutes to write? :)

A visualization with sound would be nice:

https://www.youtube.com/watch?v=NVIjHj-lrT4

--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Tue Apr 7 20:04:38 2026

From Newsgroup: comp.lang.c

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.

Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.

There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.

What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration. On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.

So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Tue Apr 7 21:54:30 2026

From Newsgroup: comp.lang.c

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 21:00:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic behavior
as N goes to infinity.

So, how do you call part of the curve from 1e2 to 3e5 lines? MiddleO ?

Consider a thought experiment. We have a qsort-like sorting function,
for the sake of discussion let's say it uses heapsort. It is widely
understood that heapsort is an O(N*log(N)) algorithm (not counting the theoretical objections mentioned in Waldek Hebisch's post and my
response to those comments).

Now suppose we have two distinct invocations of said function. In
both cases the arguments are length 1000 character strings, having
only spaces and alphabetic characters, and no duplicates between the
strings. In one call all the strings have distinct values in the
first six characters, and in the other call the strings are all the
same in the first 993 characters, differing only in the last six
characters. The call to the sorting function points back to a
function that uses strcmp() to do its comparisons.

The heapsort algorithm is well-known to be N log N. All the strings
in both sorts are exactly 1000 characters, so the strcmp() calls have
a fixed maximum time cost. And yet one of these sorts runs almost 200
times as fast as the other.

I think what you're seeing is a data dependency that is orthogonal to
the time complexity of the algorithm. For example, the travelling
salesman problem is widely known to be NP complete, and yet on some
inputs a solution can be found very quickly. BigO is about worst case behavior, or sometimes average case behavior. Other kinds of behavior
can depend dramatically on the nature of the inputs.
--- Synchronet 3.21f-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.lang.c on Thu Apr 9 16:06:32 2026

From Newsgroup: comp.lang.c

In article <86y0iyvypl.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 21:00:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic behavior
as N goes to infinity.

So, how do you call part of the curve from 1e2 to 3e5 lines? MiddleO ?

Consider a thought experiment. We have a qsort-like sorting function,
for the sake of discussion let's say it uses heapsort. It is widely >understood that heapsort is an O(N*log(N)) algorithm (not counting the >theoretical objections mentioned in Waldek Hebisch's post and my
response to those comments).

Now suppose we have two distinct invocations of said function. In
both cases the arguments are length 1000 character strings, having
only spaces and alphabetic characters, and no duplicates between the
strings. In one call all the strings have distinct values in the
first six characters, and in the other call the strings are all the
same in the first 993 characters, differing only in the last six
characters. The call to the sorting function points back to a
function that uses strcmp() to do its comparisons.

The heapsort algorithm is well-known to be N log N. All the strings
in both sorts are exactly 1000 characters, so the strcmp() calls have
a fixed maximum time cost. And yet one of these sorts runs almost 200
times as fast as the other.

I think what you're seeing is a data dependency that is orthogonal to
the time complexity of the algorithm. For example, the travelling
salesman problem is widely known to be NP complete, and yet on some
inputs a solution can be found very quickly.

A useful thought experiment, but it is perhaps easier to explain
that the asymptotic time complexity of a sorting algorithm, for
example heapsort being O(n log n), is expressed in terms of the
number of comparison operations required by the algorithm,
independent of the complexity of those comparisons.

Obviously, the real-world performance of such algorithms
depends on the time required for the comparison, but analysis of
algorithms themselves abstracts over that and considers it a
constant.

BigO is about worst case
behavior, or sometimes average case behavior. Other kinds of behavior
can depend dramatically on the nature of the inputs.

I disagree with this. The "Big-O" of an algorithm isn't about
worst-case behavior, it's a notation for categorizing which term
dominates in an expression describing the complexity of an
algorithm, as the number of inputs to that algorithm grows.

That is, it describe an aspect of rates of growth; "Big-O" in
particular expresses an upper bound for such growth (so an
O(n^2) algorithm is also O(n^3) and O(n^k) for k in 2..inf).

Perhaps you meant "upper bound" instead of "worst case", but I
would argue that this is misleading, as what one canonically
calls the "Worst case" is again different; consider QuickSort,
which is O(n^2) in the worst case (an already ordered input, or
one that is otherwise pathological with respect to selection of
the pivot), but O(n lg n) in the average case (a
relatively-random distribution in the input's order).

There are other common notations for related concepts. Omega is
often used for describing lower bounds, and Theta for what CLRS
call "tight" bounds, which we may think of as the infimum of O.

Note that these notions predate computer science by some time.
Hardy defined all three rigorously in "A Course of Pure
Mathematics" (1908), using O() in the conventional sense, though
he used "o" and "~" for omega and theta, respectively. These
are useful in different fields of analysis (among other places).

- Dan C.

--- Synchronet 3.21f-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.lang.c on Thu Apr 9 21:15:14 2026

From Newsgroup: comp.lang.c

In article <863416xid5.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.

Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.

There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.

What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration.

If by "Knuth's books" you're referring to TAOCP, then he does
not seem to give it too much attention. At the start of Volume
3, he spends considerable time discussing the combinatorics of
permutations as prefatory to discussing sorting, but if he talks
about the complexity of key comparison at all, it's not an
extended treatment. He refers to a "<" operation for comparing
keys, and he does talk about "multiprecision keys" and
lexiographic orderings, but doesn't spend a lot of ink talking
about how it might be implemented; one of the exercises
acknowledges that this can have a significant impact on
performance, but doesn't go into much detail. Another excise
challenges the reader to write a MIX program for comparing keys,
but again, doesn't give details about complexity analysis _of
comparison_. There is another exercise that talks about this
tangentially, in which he suggests sorting (it's kind of implied
of text data) slows down as the sort nears completion, since
more and more of the key value is now the same, and suggests
keeping track of the length of the common prefix to use as an
offset for subsequent comparisons.

I have seen the notion that the actual time required for
individual operations should be taken into account when
analyzing their time complexity does appear in other books;
Dasgupta, Papadimitriou, and Vazirani (http://algorithmics.lsi.upc.edu/docs/Dasgupta-Papadimitriou-Vazirani.pdf)
talk about this in the context of computing Fibonacci numbers,
for example.

On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.

So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.

Indeed. As Rob Pike once put it, "Fancy algorithms are slow
when $n$ is small, and $n$ is usually small. Fancy algorithms
have big constants. Until you know that $n$ is frequently going
to get big, don't get fancy."

- Dan C.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.lang.c on Fri Apr 10 01:31:39 2026

From Newsgroup: comp.lang.c

On Thu, 9 Apr 2026 21:15:14 -0000 (UTC)
cross@spitfire.i.gajendra.net (Dan Cross) wrote:

In article <863416xid5.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you
don't say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.

Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.

There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.

What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of >records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration.

If by "Knuth's books" you're referring to TAOCP, then he does
not seem to give it too much attention. At the start of Volume
3, he spends considerable time discussing the combinatorics of
permutations as prefatory to discussing sorting, but if he talks
about the complexity of key comparison at all, it's not an
extended treatment. He refers to a "<" operation for comparing
keys, and he does talk about "multiprecision keys" and
lexiographic orderings, but doesn't spend a lot of ink talking
about how it might be implemented; one of the exercises
acknowledges that this can have a significant impact on
performance, but doesn't go into much detail. Another excise
challenges the reader to write a MIX program for comparing keys,
but again, doesn't give details about complexity analysis _of
comparison_. There is another exercise that talks about this
tangentially, in which he suggests sorting (it's kind of implied
of text data) slows down as the sort nears completion, since
more and more of the key value is now the same, and suggests
keeping track of the length of the common prefix to use as an
offset for subsequent comparisons.

I have seen the notion that the actual time required for
individual operations should be taken into account when
analyzing their time complexity does appear in other books;
Dasgupta, Papadimitriou, and Vazirani (http://algorithmics.lsi.upc.edu/docs/Dasgupta-Papadimitriou-Vazirani.pdf) talk about this in the context of computing Fibonacci numbers,
for example.

On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.

So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.

Indeed. As Rob Pike once put it, "Fancy algorithms are slow
when $n$ is small, and $n$ is usually small. Fancy algorithms
have big constants. Until you know that $n$ is frequently going
to get big, don't get fancy."

- Dan C.

One case where these considerations are not at all theoretical and
where simple quicksort from the books performs very very slowly exactly
because when sorting progresses each lexicographic comparison
takes more and more time, is a sorting at core of Burrows–Wheeler
transform, which in turn is at core of various compression schemes,
including bzip2. The problem hits you the worst when data set compresses
well. In specific case of bzip2, they limited block size to 900KB which
is quite low and did preprocessing on input data which often seriously
impacts the quality of compression. I can't say for sure, but it seems
to me that the reason was exactly that - avoiding prohibitively slow
sorting. Were they had time and desire to use "fancy algorithms",
either combinations of bucket and non-bucket variants of radix sort, or quicksort that memorizes common prefixes, or even combination of all
three, then they would not need to use preprocessing and small blocks
and would end up with better compression ratios.
--- Synchronet 3.21f-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.lang.c on Thu Apr 9 23:33:28 2026

From Newsgroup: comp.lang.c

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.

Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.

There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.

There is one choice, made often to simplify problem. But there
is quite universal choice: N is total size of input data.

What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration. On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.

So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.

It was you who insisted on asymptitic complexity, rejecting
practical amendments...
--
Waldek Hebisch
--- Synchronet 3.21f-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.lang.c on Fri Apr 10 11:35:38 2026

From Newsgroup: comp.lang.c

In article <10r9d06$32585$1@paganini.bofh.team>,
Waldek Hebisch <antispam@fricas.org> wrote:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.

Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.

There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.

There is one choice, made often to simplify problem. But there
is quite universal choice: N is total size of input data.

This is too simplistic.

Analysis of sorting algorithms treats $n$ as the number of input
_records_, only considering their size tangentially. In turn,
records are distinguished by having some kind of "key" and the
existence of some comparison operator, "<", that obeys the usual
ordering properties of mathematics.

Sorting involves establishing an order for some collection of
records, <R_1, ..., R_n> so that R_1 < R_2 < ... < R_n. When we
say, O(n lg n), we are referring to the number of comparisons
required.

If the comparison function itself is complex, specifically if it
is non-constant, is an entirely separate matter.

"Total size of input data" doesn't enter into it at at all;
indeed, it doesn't even make much sense conceptually: suppose I
have some sequence of large records, but the key is just a fixed
size integer in that record. Then the comparison is cheap;
probably just a single machine instruction on real computers,
despite the data itself being large.

"Aha, but what about the cost of moving data within such a
sequence? If the records are large, that's non-trivial and you
must account for that." Indeed, this can be an issue; often one
works around it by holding an auxiliary array of pointers to the
records themselves, and sorting those. Again, exchanges here
are cheap: just a handful of machine instructions.

Do real-world programmers care about the actual cost of both
the comparison function _and_ moving data around to exchange
elements in e.g., an array when sorting? Absolutely. But
trying to shoehorn those concerns into big-O notation is not a
useful exercise in accounting for them.

- Dan C.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Sat Apr 11 09:04:44 2026

From Newsgroup: comp.lang.c

cross@spitfire.i.gajendra.net (Dan Cross) writes:

In article <86y0iyvypl.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 21:00:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic behavior
as N goes to infinity.

So, how do you call part of the curve from 1e2 to 3e5 lines? MiddleO ?

Consider a thought experiment. We have a qsort-like sorting function,
for the sake of discussion let's say it uses heapsort. It is widely
understood that heapsort is an O(N*log(N)) algorithm (not counting the
theoretical objections mentioned in Waldek Hebisch's post and my
response to those comments).

Now suppose we have two distinct invocations of said function. In
both cases the arguments are length 1000 character strings, having
only spaces and alphabetic characters, and no duplicates between the
strings. In one call all the strings have distinct values in the
first six characters, and in the other call the strings are all the
same in the first 993 characters, differing only in the last six
characters. The call to the sorting function points back to a
function that uses strcmp() to do its comparisons.

The heapsort algorithm is well-known to be N log N. All the strings
in both sorts are exactly 1000 characters, so the strcmp() calls have
a fixed maximum time cost. And yet one of these sorts runs almost 200
times as fast as the other.

I think what you're seeing is a data dependency that is orthogonal to
the time complexity of the algorithm. For example, the travelling
salesman problem is widely known to be NP complete, and yet on some
inputs a solution can be found very quickly.

A useful thought experiment, but it is perhaps easier to explain
that the asymptotic time complexity of a sorting algorithm, for
example heapsort being O(n log n), is expressed in terms of the
number of comparison operations required by the algorithm,
independent of the complexity of those comparisons.

No. This misses the essence of what I was trying to convey.

BigO is about worst case behavior, or sometimes average case
behavior. Other kinds of behavior can depend dramatically on the
nature of the inputs.

I disagree with this. The "Big-O" of an algorithm isn't about
worst-case behavior, it's a notation for categorizing which term
dominates in an expression describing the complexity of an
algorithm, as the number of inputs to that algorithm grows.

You misunderstood the point of my comment. Of course O()
notation is about expressing an asymptotic upper bound of
some function. The key issue here though is not about O()
but about what function should be the one under consideration.
--- Synchronet 3.21f-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.lang.c on Sat Apr 11 19:55:10 2026

From Newsgroup: comp.lang.c

In article <86a4v9a3fn.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

cross@spitfire.i.gajendra.net (Dan Cross) writes:

In article <86y0iyvypl.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 21:00:46 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't >>>>>>> say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic behavior >>>>> as N goes to infinity.

So, how do you call part of the curve from 1e2 to 3e5 lines? MiddleO ? >>>

Consider a thought experiment. We have a qsort-like sorting function,
for the sake of discussion let's say it uses heapsort. It is widely
understood that heapsort is an O(N*log(N)) algorithm (not counting the
theoretical objections mentioned in Waldek Hebisch's post and my
response to those comments).

Now suppose we have two distinct invocations of said function. In
both cases the arguments are length 1000 character strings, having
only spaces and alphabetic characters, and no duplicates between the
strings. In one call all the strings have distinct values in the
first six characters, and in the other call the strings are all the
same in the first 993 characters, differing only in the last six
characters. The call to the sorting function points back to a
function that uses strcmp() to do its comparisons.

The heapsort algorithm is well-known to be N log N. All the strings
in both sorts are exactly 1000 characters, so the strcmp() calls have
a fixed maximum time cost. And yet one of these sorts runs almost 200
times as fast as the other.

I think what you're seeing is a data dependency that is orthogonal to
the time complexity of the algorithm. For example, the travelling
salesman problem is widely known to be NP complete, and yet on some
inputs a solution can be found very quickly.

A useful thought experiment, but it is perhaps easier to explain
that the asymptotic time complexity of a sorting algorithm, for
example heapsort being O(n log n), is expressed in terms of the
number of comparison operations required by the algorithm,
independent of the complexity of those comparisons.

No. This misses the essence of what I was trying to convey.

Then what you are trying to convey is misleading. Note that I
was making factual statements.

BigO is about worst case behavior, or sometimes average case
behavior. Other kinds of behavior can depend dramatically on the
nature of the inputs.

I disagree with this. The "Big-O" of an algorithm isn't about
worst-case behavior, it's a notation for categorizing which term
dominates in an expression describing the complexity of an
algorithm, as the number of inputs to that algorithm grows.

You misunderstood the point of my comment. Of course O()
notation is about expressing an asymptotic upper bound of
some function. The key issue here though is not about O()
but about what function should be the one under consideration.

The misunderstanding is yours, I'm afraid.

You wrote, "BigO is about worst case behavior." That is
definitionally incorrect, and it's not really something that is
up for debate or open to alternative interpretations.

If you want to account for the complexity of a comparison
function, and you can characterize it in some useful, way, you
can of course incorporate that into the analysis. For example,
string comparison is linear in the size of the input strings;
if one denotes that as m, then the time required to sort strings
using heapsort might be described is O(m*n*log(n))

- Dan C.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Sat Apr 11 21:32:21 2026

From Newsgroup: comp.lang.c

cross@spitfire.i.gajendra.net (Dan Cross) writes:

In article <863416xid5.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.

Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.

There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.

What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration.

If by "Knuth's books" you're referring to TAOCP, then he does
not seem to give it too much attention. [...]

In most of the chapter on Sorting, TAOCP uses the number of
comparisons as the basis of comparison. But not everywhere
in the chapter.

My statement was not meant to be limited to the discussion of
Sorting.

On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.

So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.

Indeed. As Rob Pike once put it, "Fancy algorithms are slow
when $n$ is small, and $n$ is usually small. Fancy algorithms
have big constants. Until you know that $n$ is frequently going
to get big, don't get fancy."

Whether the Rob Pike advisory is applicable or not is beside the
point. My comment was about fancy mathematics, not fancy
algorithms. My statement is just as applicable to Tim Sort (one
of the fancier sorting algorithms) as it is to Bubble Sort.
--- Synchronet 3.21f-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.lang.c on Sun Apr 12 04:59:39 2026

From Newsgroup: comp.lang.c

In article <861pgkaje2.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

cross@spitfire.i.gajendra.net (Dan Cross) writes:

In article <863416xid5.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't >>>>>>> say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.

Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.

There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.

What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration.

If by "Knuth's books" you're referring to TAOCP, then he does
not seem to give it too much attention. [...]

In most of the chapter on Sorting, TAOCP uses the number of
comparisons as the basis of comparison. But not everywhere
in the chapter.

It seems like I pointed out a few places where he acknowledges
a more complex picture. Are there other places to which you are
referring?

My statement was not meant to be limited to the discussion of
Sorting.

What do you think I was referring to, exactly? I was responding
to your comments about Knuth's books, specifically, and the
quoted text above, which seems concerned solely with sorting.

As I mentioned, Dasgupta et al _do_ mention that analysis of
algorithms is more complex than most treatments, because of
precisely the idea that as things grow, seemingly constant
operations are no longer constant. As I mentioned, they did
this within the context of Fibonacci numbers, not sorting, but
the point stands. Since, as you say, your statement was not
meant to be limited to discussions of sorting, then it seems to
be supporting what you are saying.

On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.

So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.

Indeed. As Rob Pike once put it, "Fancy algorithms are slow
when $n$ is small, and $n$ is usually small. Fancy algorithms
have big constants. Until you know that $n$ is frequently going
to get big, don't get fancy."

Whether the Rob Pike advisory is applicable or not is beside the
point.

On the contrary; I mentioned it because it supports your thesis.

My comment was about fancy mathematics, not fancy
algorithms. My statement is just as applicable to Tim Sort (one
of the fancier sorting algorithms) as it is to Bubble Sort.

There's nothing particularly fancy about it, but that aside, I'm
honestly not sure what exactly I said that you are (apparently?)
disagreeing with.

I was responding with a specific statement about Knuth's books,
a reference to another book in support of your statement, and
yet another reference to something that Pike had written that,
again, supports your point.

- Dan C.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Sun Apr 12 06:17:12 2026

From Newsgroup: comp.lang.c

Michael S <already5chosen@yahoo.com> writes:

On Thu, 9 Apr 2026 21:15:14 -0000 (UTC)
cross@spitfire.i.gajendra.net (Dan Cross) wrote:

[...]

Indeed. As Rob Pike once put it, "Fancy algorithms are slow
when $n$ is small, and $n$ is usually small. Fancy algorithms
have big constants. Until you know that $n$ is frequently going
to get big, don't get fancy."

One case where these considerations are not at all theoretical and
where simple quicksort from the books performs very very slowly exactly because when sorting progresses each lexicographic comparison
takes more and more time, is a sorting at core of Burrows-Wheeler
transform, which in turn is at core of various compression schemes,
including bzip2. The problem hits you the worst when data set compresses well.

I'm curious to know if you can quantify this to some degree, and
if so how much. I don't have any experience with Burrows-Wheeler
transform or (any internals of) bzip2.

In specific case of bzip2, they limited block size to 900KB which
is quite low and did preprocessing on input data which often seriously impacts the quality of compression. I can't say for sure, but it seems
to me that the reason was exactly that - avoiding prohibitively slow
sorting. Were they had time and desire to use "fancy algorithms",
either combinations of bucket and non-bucket variants of radix sort, or quicksort that memorizes common prefixes, or even combination of all
three, then they would not need to use preprocessing and small blocks
and would end up with better compression ratios.

There could be other reasons for wanting to limit the block size.
Have you done any web searches or tried to investigate in some
other ways?
--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Sun Apr 12 07:13:41 2026

From Newsgroup: comp.lang.c

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't
say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.

Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.

There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.

There is one choice, made often to simplify problem. But there
is quite universal choice: N is total size of input data.

No, N is usually the number of records to be sorted, not the size of
the input. Also, in the discussion I was responding to, there were
clearly two distinct axes relevant to the discussion.

What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration. On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.

So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.

It was you who insisted on asymptitic complexity, rejecting
practical amendments...

This statement isn't right. BigO was already part of the discussion
when I joined in. Also, it is customary in discussion of sorting
algorithms to use the metric of number of comparisons done, without
regard to the size of the variables needed to hold the indices of
the records being sorted. See Knuth chapter 5, on Sorting, in
volume 3 of TAOCP.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Sun Apr 12 11:16:50 2026

From Newsgroup: comp.lang.c

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

[...]

After my previous posting I got curious about how different sorting
algorithms behave for "nearly sorted" inputs. I decided to run some
trials for some of the sorting functions given in my last posting.
Here are the results:

Average number of compares for sorting 100 elements, with one
element out of place:

insertion sort 132.647
gnome sort 166.293
ping pong sort 180.177
knuth bubble sort 956.500

Average number of compares for sorting 100 elements, with two
elements out of place:

insertion sort 166.126
gnome sort 233.253
ping pong sort 234.163
knuth bubble sort 1566.487

My conclusion, based on the above, is that bubble sort has nothing
to recommend it as a practical sorting algorithm, even considering
the refinement of early pruning (called the "knuth bubble sort" in
the list above). The gnome sort is both simpler and better in terms
of performance.

Incidentally, some years ago I devised a variation of gnome sort,
which I call "leaping gnome sort". A leaping gnome sort is like
gnome sort when going right to left, that is, when the elements
being looked at are out of order, swap them and move one place to
the left. When no swap is needed, the gnome "leaps" to the last
place a rightward shift was initiated. In terms of code, leaping
gnome sort can be written as follows:

void
leaping_gnome_sort( unsigned n, Element a[n] ){
for(
unsigned i = 1, j = 2;
i < n;
i = follows(i-1,i,a) ? swap(i-1,i,a), i>1 ?i-1 :j++ : j++
){}
}

The "leaping" behavior is evident whenever the current location 'i'
gets the value 'j++' rather than 'i-1'.

Of course this algorithm could be written using a while() loop
rather than a for() loop:

void
leaping_gnome_sort( unsigned n, Element a[n] ){
unsigned i = 1, j = 2;
while( i < n ){
if( !follows(i-1,i,a) ) i = j++;
else {
swap(i-1,i,a);
i = i > 1 ? i-1 : j++;
}
}
}

In either case it's nice having only a single loop structure rather
than an outer loop and a second, nested loop.
--- Synchronet 3.21f-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.lang.c on Mon Apr 13 20:44:07 2026

From Newsgroup: comp.lang.c

In article <86fr508dwq.fsf@linuxsc.com>,
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 06 Apr 2026 15:13:32 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Obviously what words (or lines) appear can affect the character
counts, but that still doesn't change BigO. By the way you don't >>>>>>> say whether you are sorting words or lines.

This sub-thread is about sorting lines with average length of few
dozens characters, i.e. many times longer than log2(N). That was
stated at one of earlier posts.

That has nothing to do with BigO, which is about asymptotic
behavior as N goes to infinity.

Honest Big(O) varies length of the key with N. In practical range
key length may be constant, but fixing length gives unrealistic
problem for Big(O) analysis: without varying key length there are
finitely many keys and sorting is equivalent to counting how many
times each key appears in the input.

There's an important clarification to make here. There are two
independent parameters: N, the number of records to be sorted (a
record being a character string that is either a word or a line),
and the (maximum) length of any record, which in the discussion is
bounded above by a constant.

There is one choice, made often to simplify problem. But there
is quite universal choice: N is total size of input data.

No, N is usually the number of records to be sorted, not the size of
the input. Also, in the discussion I was responding to, there were
clearly two distinct axes relevant to the discussion.

What is being asked about is the behavior as a function of N as N
increases without bound. Of course, theoretically, as the number of
records increases without bound, eventually the character strings
being sorted will have to have duplicates. But long before that
happens the index variable N will run out of bits. This property is
well understood in theoretical computer science, not just in terms
of how much time is used but how much storage is needed. In theory
log N bits are needed just to hold the index pointers. It is
customary though, outside of purely theoretical discussions, to
ignore that and treat the size of an index or pointer variable as
constant. In purely theoretical terms no sorting algorithm is
O(N*log(N)), because just incrementing a pointer takes more than
O(1) operations. Surely the discussions in Knuth's books take such
things into consideration. On the practical side, which almost
always covers discussions that take place in usenet newsgroups,
these minor theoretical issues are ignored. Any actul computer in
the physical universe will never have occasion to process more than
2**512 records, due to the limitation of the number of elementary
particles in the universe, so a 512-bit address (or index value)
always suffices.

So yes, in theory, the considerations around processing an enormous
number of values are relevant. In the practical context of the
discussion underway here, they aren't.

It was you who insisted on asymptitic complexity, rejecting
practical amendments...

This statement isn't right. BigO was already part of the discussion
when I joined in. Also, it is customary in discussion of sorting
algorithms to use the metric of number of comparisons done, without
regard to the size of the variables needed to hold the indices of
the records being sorted.

Hmm, this response echoes almost exactly what I wrote in <10ranaa$ihf$1@reader1.panix.com> on April 10th, also in
response to Waldek. Did you see it?

See Knuth chapter 5, on Sorting, in volume 3 of TAOCP.

As for TAOCP, as much as I respect and admire Knuth, I don't
think "The Art of Computer Programming" is a good reference for
working programmers. At 391 pages long, Chapter Five occupies
more than half of an ~750 page book, the examples are all in
assembly language for his notional MIX computer, and the
analysis he presents presumes a mathematics background that,
frankly, most programmers neither have nor need. "See chapter
5" is thus not useful as a reference, and rather comes across as
an admonishment.

A much more useful treatment is Cormen, Leiserson, Rivest, and
Stein's, "Introduction to Algorithms." The treatment is much
more accessible than TAOCP, while still rigorous. Disclaimer:
I once served as a industry mentor for students taking one of
Leiserson's classes, though I haven't worked with him closely.

CLRS suggests that Aho, Hopcroft, and Ullman advocated for
asymptotic analysis of algorithms in, "The Design and Analysis
of Computer Algorithms". Again, it's a wonderful book, but I
would argue it's more theoretical than most programmers would
find comfortable. Disclaimer: Aho taught me compilers.

Personally, I like Skiena's, "Algorithm Design Manual" and think
that its treatment is among the best I've seen when it comes to
threading the needle between rigorous attention to detail and
accessibility, though one needs more mathematics to negotiate it
than e.g., CLRS.

- Dan C.

--- Synchronet 3.21f-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Thu Apr 16 10:23:23 2026

From Newsgroup: comp.lang.c

Michael S <already5chosen@yahoo.com> writes:

On Tue, 07 Apr 2026 02:12:49 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

<snip>

My conclusions...

The right place to start on sorting algorithms is with bubble
sort, not because it's a good algorithm, but because it's the
easiest one to show, and what is more important it's a good way
to teach about how algorithms evolve.

If you write it in terms of moves rather than in terms of swaps,
bubble sort is not easier than insertion and selection.

A reason to use swap() is that swap is easier to understand
than moves. Each swap() operation, all by itself, preserves
the invariant that the array contains the same elements after
the swap() is done that were there before. With code written
using moves, several statements together need to be looked at
to verify that they are doing the right thing. Furthermore,
when used in conjunction with follows(), whenever we see the
two together written like this

if( follows(x,y,array) ) swap(x,y,array);

we know that both (1) the array has the same elements as it did
before, and (2) the "sortedness" of array elements has increased,
or at least has remained unchanged. Compare that to an insertion
sort written using moves, where a whole loop body (including a
'break;' no less), plus a statement after, needs to be examined
to verify that it is doing something sensible. Clearly less
mental effort is needed to see that the one line written above is
working than what is needed for the eight lines seen in the
insertion sort shown in your (much) earlier posting.

There is no reason to use the knuth version of bubble sort.
Compared to ordinary bubble sort the upside is small and not
worth the downside of needing more intricate code.

knuth version has at least one merit. You claim that in practice the
it's too rare that the merit is exercised. May be, it is true. Or may
be cases of almost sorted array with misplaced elements coming in pairs
with close proximity between peers is not that uncommon. I don't know.
But merit exist.

I posted a performance comparison of several sorts including knuth
bubble sort. It's true that knuth bubble sort is better than plain
bubble sort. But it's still awful compared to the even simpler
gnome sort.

OTOH, 'ordinary bubble sort' has no merits at all relatively to simple insertion sort.

Simple bubble sort is easier to understand than insertion sort.

For those who don't like bubble sort, there is gnome sort, which
is even simpler than bubble sort, and actually has better
performance. And, if people will forgive me, it's only a short
leap to get to leaping gnome sort, which has better performance
still.

BTW, thank you for 'boustrophedon'. Not that there is a
serious chance that I will remember the word tomorrow or even
tonight, But hopefully I'd remember that such word exists.

It comes from Greek, meaning roughly "ox plow turning". Maybe
that will help you remember; I think it does for me.
--- Synchronet 3.21f-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,113
Nodes:	10 (0 / 10)
Uptime:	492402:22:13
Calls:	14,249
Files:	186,315
D/L today:	85 files (22,085K bytes)
Messages:	2,515,148

Re: Isn't that beauty ? (no it's not)

Who's Online

System Info