Hi James, umm 'guarantees'? No no... It does NOT verify:Windows has the ...W() APIs along with codepage-based APIs with
- whether the environment actually supports UTF8 fully
- whether multibyte functions are enabled
- whether the terminal supports UTF8
- whether the C library supports UTF8 normalization
(combining characters, etc. but it seems to work well here)
To be sure: It's not a UTF-8 capability test. It's only a
locale-string check. So it likely misses many valid UTF8
locale variants...
Here I'm running any mixture of: Windows/BSD/Linix Mint LMDE.
Here I'm running any mixture of: Windows/BSD/Linix Mint LMDE.
Windows has the ...W() APIs along with codepage-based APIs with
the ...A() Suffix. The W()-APIs support UTF-16, so no need for
We want portability across diverse OSs. In my case, the program
does NOT care what the character is, it simply needs to be able
to find it when searching data & displaying it in an ordered way.
The code below works perfectly:
#include <stdio.h>
#include <string.h>
int utf8_display_width(const char *s) {
int w = 0;
while (*s) {
unsigned char b = *s;
unsigned cp;
int n;
// UTF-8 decoder
if (b <= 0x7F) { // 1-byte ASCII
cp = b;
n = 1;
} else if (b >= 0xC0 && b <= 0xDF) { // 2-byte
cp = ((b & 0x1F) << 6) |
(s[1] & 0x3F);
n = 2;
} else if (b >= 0xE0 && b <= 0xEF) { // 3-byte
cp = ((b & 0x0F) << 12) |
((s[1] & 0x3F) << 6) |
(s[2] & 0x3F);
n = 3;
} else if (b >= 0xF0 && b <= 0xF7) { // 4-byte
cp = ((b & 0x07) << 18) |
((s[1] & 0x3F) << 12) |
((s[2] & 0x3F) << 6) |
(s[3] & 0x3F);
n = 4;
} else { // invalid, treat as 1-byte
cp = b;
n = 1;
}
// display width
if (cp >= 0x0300 && cp <= 0x036F) {} // combining marks like é (zero
width)
else if ( // double-width characters...
(cp >= 0x1100 && cp <= 0x115F) || // hangul jamo
(cp >= 0x2E80 && cp <= 0xA4CF) || // cjk radicals & unified ideographs
(cp >= 0xAC00 && cp <= 0xD7A3) || // hangul syllables
(cp >= 0xF900 && cp <= 0xFAFF) || // cjk compatibility ideographs
(cp >= 0x1F300 && cp <= 0x1FAFF) // emoji + symbols
) { w += 2; }
// exceptional wide characters (unicode requirement I've read elsewhere)
else if (cp == 0x2329 || cp == 0x232A) { w += 2; }
else { w += 1; } // normal width for everything else
s += n;
}
return w;
}
int main(void) {
const char *tests[] = {
"hello",
"Café",
"漢字",
"✓",
"🙂",
NULL
};
// find maximum display width in 1st column
int maxw = 0;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
if (w > maxw) maxw = w;
}
// total padding after each 1st column + 3 spaces
int total_pad = maxw + 3;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
int sl = strlen(tests[i]);
printf("%s", tests[i]);
int pad = total_pad - w;
while (pad-- > 0) putchar(' ');
printf("strlen: %d utf8 display width: %d\n", sl, w);
}
return 0;
}
// eof
On 2025-12-03 13:33, Michael Sanders wrote:
...
We want portability across diverse OSs. In my case, the program
does NOT care what the character is, it simply needs to be able
to find it when searching data & displaying it in an ordered way.
The code below works perfectly:
#include <stdio.h>
#include <string.h>
int utf8_display_width(const char *s) {
int w = 0;
while (*s) {
unsigned char b = *s;
unsigned cp;
int n;
// UTF-8 decoder
if (b <= 0x7F) { // 1-byte ASCII
cp = b;
n = 1;
} else if (b >= 0xC0 && b <= 0xDF) { // 2-byte
cp = ((b & 0x1F) << 6) |
(s[1] & 0x3F);
n = 2;
} else if (b >= 0xE0 && b <= 0xEF) { // 3-byte
cp = ((b & 0x0F) << 12) |
((s[1] & 0x3F) << 6) |
(s[2] & 0x3F);
n = 3;
} else if (b >= 0xF0 && b <= 0xF7) { // 4-byte
cp = ((b & 0x07) << 18) |
((s[1] & 0x3F) << 12) |
((s[2] & 0x3F) << 6) |
(s[3] & 0x3F);
n = 4;
} else { // invalid, treat as 1-byte
cp = b;
n = 1;
}
// display width
if (cp >= 0x0300 && cp <= 0x036F) {} // combining marks like é (zero
width)
else if ( // double-width characters...
(cp >= 0x1100 && cp <= 0x115F) || // hangul jamo
(cp >= 0x2E80 && cp <= 0xA4CF) || // cjk radicals & unified ideographs
(cp >= 0xAC00 && cp <= 0xD7A3) || // hangul syllables
(cp >= 0xF900 && cp <= 0xFAFF) || // cjk compatibility ideographs
(cp >= 0x1F300 && cp <= 0x1FAFF) // emoji + symbols
) { w += 2; }
// exceptional wide characters (unicode requirement I've read elsewhere)
else if (cp == 0x2329 || cp == 0x232A) { w += 2; }
else { w += 1; } // normal width for everything else
s += n;
}
return w;
}
int main(void) {
const char *tests[] = {
"hello",
"Café",
"漢字",
"✓",
"🙂",
NULL
};
// find maximum display width in 1st column
int maxw = 0;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
if (w > maxw) maxw = w;
}
// total padding after each 1st column + 3 spaces
int total_pad = maxw + 3;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
int sl = strlen(tests[i]);
printf("%s", tests[i]);
int pad = total_pad - w;
while (pad-- > 0) putchar(' ');
printf("strlen: %d utf8 display width: %d\n", sl, w);
}
return 0;
}
// eof
I find it confusing that this is supposed to "work perfectly" "across
diverse OSs". The amount of space that a character takes up varies
depending upon the installed fonts, especially on whether the font is monospaced or proportional. Those fonts can be different for display on screen or on a printer. I don't see any query to determine even what the current font is, much less what it's characteristics are. I don't know
of any OS-independent way of collecting such information. Does this
solution "work perfectly" only for your own particular favorite font?
This looks like a solution for a fixed-pitch font. I get this outputIt sounds as a luck. é in your text just happened to be encoded as
for a Windows console display (with - used for space):
hello---strlen: 5 utf8 display width: 5
Café----strlen: 5 utf8 display width: 4
漢字----strlen: 6 utf8 display width: 4
✓-------strlen: 3 utf8 display width: 1
🙂------strlen: 4 utf8 display width: 2
I was hoping this would be lined up, but already, in a Thunderbird
edit Window, the last lines aren't lined up properly.
Same problem with Notepad (fixed pitch) and LibreOffice (fixed pitch).
It only looks alright in Windows and WSL consoles/terminals. But
maybe that's all that's needed.
On 03/12/2025 19:01, James Kuyper wrote:[...]
[...]I find it confusing that this is supposed to "work perfectly"
"across
diverse OSs". The amount of space that a character takes up varies
depending upon the installed fonts, especially on whether the font is
monospaced or proportional. Those fonts can be different for display on
screen or on a printer. I don't see any query to determine even what the
current font is, much less what it's characteristics are. I don't know
of any OS-independent way of collecting such information. Does this
solution "work perfectly" only for your own particular favorite font?
This looks like a solution for a fixed-pitch font. I get this output
for a Windows console display (with - used for space):
I find it confusing that this is supposed to "work perfectly" "across
diverse OSs". The amount of space that a character takes up varies
depending upon the installed fonts, especially on whether the font is monospaced or proportional. Those fonts can be different for display on screen or on a printer. I don't see any query to determine even what the current font is, much less what it's characteristics are. I don't know
of any OS-independent way of collecting such information. Does this
solution "work perfectly" only for your own particular favorite font?
bart <bc@freeuk.com> writes:
On 03/12/2025 19:01, James Kuyper wrote:[...]
[...]I find it confusing that this is supposed to "work perfectly"
"across
diverse OSs". The amount of space that a character takes up varies
depending upon the installed fonts, especially on whether the font is
monospaced or proportional. Those fonts can be different for display on
screen or on a printer. I don't see any query to determine even what the >>> current font is, much less what it's characteristics are. I don't know
of any OS-independent way of collecting such information. Does this
solution "work perfectly" only for your own particular favorite font?
This looks like a solution for a fixed-pitch font. I get this output
for a Windows console display (with - used for space):
I think bart is right that this is specific to fixed-width fonts.
For a variable width font, 'W' is going to be wider than '|'.
See also the POSIX `int wcwidth(wchar_t wc)` function, which returns
the "number of column positions of a wide-character code". It does
depend on the current locale.
The assumption seems to be that fixed-width fonts are expected to be consistent about the widths of characters.
On Wed, 3 Dec 2025 06:24:23 +0100, Bonita Montero wrote:VC++ supports C- and C++ locale if you like to have it portable.
Hi Bonita.Here I'm running any mixture of: Windows/BSD/Linix Mint LMDE.Windows has the ...W() APIs along with codepage-based APIs with
the ...A() Suffix. The W()-APIs support UTF-16, so no need for
Yes that's correct, but...
- that assumes we know in advance what the character is
- it would only work under Windows
We want portability across diverse OSs. In my case, the program
does NOT care what the character is, it simply needs to be able
to find it when searching data & displaying it in an ordered way.
The code below works perfectly:
#include <stdio.h>
#include <string.h>
int utf8_display_width(const char *s) {
int w = 0;
while (*s) {
unsigned char b = *s;
unsigned cp;
int n;
// UTF-8 decoder
if (b <= 0x7F) { // 1-byte ASCII
cp = b;
n = 1;
} else if (b >= 0xC0 && b <= 0xDF) { // 2-byte
cp = ((b & 0x1F) << 6) |
(s[1] & 0x3F);
n = 2;
} else if (b >= 0xE0 && b <= 0xEF) { // 3-byte
cp = ((b & 0x0F) << 12) |
((s[1] & 0x3F) << 6) |
(s[2] & 0x3F);
n = 3;
} else if (b >= 0xF0 && b <= 0xF7) { // 4-byte
cp = ((b & 0x07) << 18) |
((s[1] & 0x3F) << 12) |
((s[2] & 0x3F) << 6) |
(s[3] & 0x3F);
n = 4;
} else { // invalid, treat as 1-byte
cp = b;
n = 1;
}
// display width
if (cp >= 0x0300 && cp <= 0x036F) {} // combining marks like é (zero width)
else if ( // double-width characters...
(cp >= 0x1100 && cp <= 0x115F) || // hangul jamo
(cp >= 0x2E80 && cp <= 0xA4CF) || // cjk radicals & unified ideographs
(cp >= 0xAC00 && cp <= 0xD7A3) || // hangul syllables
(cp >= 0xF900 && cp <= 0xFAFF) || // cjk compatibility ideographs
(cp >= 0x1F300 && cp <= 0x1FAFF) // emoji + symbols
) { w += 2; }
// exceptional wide characters (unicode requirement I've read elsewhere)
else if (cp == 0x2329 || cp == 0x232A) { w += 2; }
else { w += 1; } // normal width for everything else
s += n;
}
return w;
}
int main(void) {
const char *tests[] = {
"hello",
"Café",
"漢字",
"✓",
"🙂",
NULL
};
// find maximum display width in 1st column
int maxw = 0;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
if (w > maxw) maxw = w;
}
// total padding after each 1st column + 3 spaces
int total_pad = maxw + 3;
for (int i = 0; tests[i]; i++) {
int w = utf8_display_width(tests[i]);
int sl = strlen(tests[i]);
printf("%s", tests[i]);
int pad = total_pad - w;
while (pad-- > 0) putchar(' ');
printf("strlen: %d utf8 display width: %d\n", sl, w);
}
return 0;
}
// eof
I find it confusing that this is supposed to "work perfectly" "acrossCan C handle that with those means given by the standard itself.
diverse OSs". The amount of space that a character takes up varies
depending upon the installed fonts, especially on whether the font is monospaced or proportional. Those fonts can be different for display on screen or on a printer. I don't see any query to determine even what the current font is, much less what it's characteristics are. I don't know
of any OS-independent way of collecting such information. Does this
solution "work perfectly" only for your own particular favorite font?
Could you identify which document guarantees that every Unicode locale contains "UTF-8"?
On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
Could you identify which document guarantees that every Unicode locale
contains "UTF-8"?
How else would it work? Bytes have to be 8-bit.
Lawrence D’Oliveiro <ldo@nz.invalid> writes:
On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
Could you identify which document guarantees that every Unicode locale
contains "UTF-8"?
How else would it work? Bytes have to be 8-bit.
I can't figure out what point you're trying to make.
Obviously bytes in C have to be *at least* 8 bits, but I don't see
the relevance.
Take a look at the article to which you replied. How does your
followup have anything to do with it?
One of several points that you snipped is that locale names can
contain the string "utf8", not "UTF-8".
On 12/24/2025 12:22 AM, Keith Thompson wrote:
Lawrence D’Oliveiro <ldo@nz.invalid> writes:
On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
Could you identify which document guarantees that every Unicode
locale contains "UTF-8"?
How else would it work? Bytes have to be 8-bit.
I can't figure out what point you're trying to make.
Obviously bytes in C have to be *at least* 8 bits, but I don't see
the relevance.
Take a look at the article to which you replied. How does your
followup have anything to do with it?
One of several points that you snipped is that locale names can
contain the string "utf8", not "UTF-8".
Did C never work on the 6 bit machines such as the Univac 1108 (36
bit) or the CDC 7600 (60 bit) ?
Lynn
On 12/24/2025 12:22 AM, Keith Thompson wrote:
Lawrence D’Oliveiro <ldo@nz.invalid> writes:
On Tue, 18 Nov 2025 14:27:53 -0500, James Kuyper wrote:
Could you identify which document guarantees that every Unicode locale >>>> contains "UTF-8"?
How else would it work? Bytes have to be 8-bit.
I can't figure out what point you're trying to make.
Obviously bytes in C have to be *at least* 8 bits, but I don't see
the relevance.
Take a look at the article to which you replied. How does your
followup have anything to do with it?
One of several points that you snipped is that locale names can
contain the string "utf8", not "UTF-8".
Did C never work on the 6 bit machines such as the Univac 1108 (36 bit)
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,090 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 45:24:55 |
| Calls: | 13,946 |
| Calls today: | 3 |
| Files: | 187,034 |
| D/L today: |
8,063 files (2,942M bytes) |
| Messages: | 2,460,945 |