Forum: War Ensemble BBS

Re: u8"" c11 c23

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Mon Dec 15 11:13:21 2025

From Newsgroup: comp.lang.c

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

Thiago Adams <thiago.adams@gmail.com> writes:

speaking on signed x unsigned,

u8"a" in C11 had the type char [N]. Normally char is signed

I would have said "commonly" rather than "normally". Not an
important point.

in C23 it is unsigned char8_t [N].

when converting code from c11 to c23 we have a error here
const char* s = u8""

I generally "cast char* " to "unsigned char*" when handling
something with utf8. I am not u8"" , I use just " " with utf8
encoded source code and I just assume const char* is utf8.

That raises another issue.

The <uchar.h> header was introduced in C99. In C99, C11, and C17,
that header defines char16_t and char32_t. C23 introduces char8_t.

There doesn't seem to be any way, other than checking the value of __STDC_VERSION__ to determine whether char8_t is defined or not.
There are not *_MIN or *_MAX macros for these types, either in
<uchar.h> or in <limits.h>. A test program I just wrote would have
been a little simpler if I could have used `#ifdef CHAR8_MAX`.

Here's the test program :

#include <stdio.h>
#include <uchar.h>

#define TYPEOF(x) \
(_Generic(x, \
char: "char", \
signed char: "signed char", \
unsigned char: "unsigned char", \
short: "short", \
unsigned short: "unsigned short", \
int: "int", \
unsigned int: "unsigned int", \
long: "long", \
unsigned long: "unsigned long", \
long long: "long long", \
unsigned long long: "unsigned long long"))

int main(void) {
printf("__STDC_VERSION__ = %ldL\n", __STDC_VERSION__);
printf("u8\"a\"[0] is of type %s\n",
TYPEOF(u8"a"[0]));
#if __STDC_VERSION__ >= 202311L
printf("char8_t is %s\n", TYPEOF((char8_t)0));
#endif
printf("char16_t is %s\n", TYPEOF((char16_t)0));
printf("char32_t is %s\n", TYPEOF((char32_t)0));
}

Its output with `gcc -std=c17` :

__STDC_VERSION__ = 201710L
u8"a"[0] is of type char
char16_t is unsigned short
char32_t is unsigned int

Its output with `gcc -std=c23` :

__STDC_VERSION__ = 202311L
u8"a"[0] is of type unsigned char
char8_t is unsigned char
char16_t is unsigned short
char32_t is unsigned int

Since C23 defines char8_t to be the same type as unsigned char,
it seems better to just define it when it isn't there:

#include <limits.h>

#if CHAR_BIT == 8 && __STDC_VERSION__ < 202311
typedef unsigned char char8_t;
#endif

--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Mon Dec 15 14:27:26 2025

From Newsgroup: comp.lang.c

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

[...]

The <uchar.h> header was introduced in C99. In C99, C11, and C17,
that header defines char16_t and char32_t. C23 introduces char8_t.

There doesn't seem to be any way, other than checking the value of
__STDC_VERSION__ to determine whether char8_t is defined or not.
There are not *_MIN or *_MAX macros for these types, either in
<uchar.h> or in <limits.h>. A test program I just wrote would have
been a little simpler if I could have used `#ifdef CHAR8_MAX`.

[...]

Since C23 defines char8_t to be the same type as unsigned char,
it seems better to just define it when it isn't there:

#include <limits.h>

#if CHAR_BIT == 8 && __STDC_VERSION__ < 202311
typedef unsigned char char8_t;
#endif

Yes. And the test for CHAR_BIT may not be necessary, depending on the programmer's intent. char8_t is the same type as unsigned char even if CHAR_BIT > 8. Similarly, char16_t and char32_t are the same type as uint_least16_t and uint_least32_t, respectively.

But before C23, u8"a" is a syntax error.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thiago Adams@thiago.adams@gmail.com to comp.lang.c on Tue Dec 16 07:57:27 2025

From Newsgroup: comp.lang.c

On 12/15/2025 7:27 PM, Keith Thompson wrote:
...

But before C23, u8"a" is a syntax error.

u8"a" was introduced in C11.
u8'a' was introduced in C23.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.lang.c on Tue Dec 16 04:17:29 2025

From Newsgroup: comp.lang.c

Thiago Adams <thiago.adams@gmail.com> writes:

On 12/15/2025 7:27 PM, Keith Thompson wrote:
...

But before C23, u8"a" is a syntax error.

u8"a" was introduced in C11.
u8'a' was introduced in C23.

Thank you, I stand corrected.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.lang.c on Tue Dec 16 14:59:01 2025

From Newsgroup: comp.lang.c

On 10/20/2025 1:35 PM, Thiago Adams wrote:

speaking on signed x unsigned,

u8"a" in C11 had the type char [N]. Normally char is signed

in C23 it is unsigned char8_t [N].

when converting code from c11 to c23 we have a error here
const char* s = u8""

I generally "cast char* " to "unsigned char*" when handling something
with utf8. I am not u8"" , I use just " " with utf8 encoded source code
and I just assume const char* is utf8.

It may not be so simple, as source-code bytes don't necessarily map 1:1
with string literal bytes (and are more likely to be translated than
passed through as-is).

Implicitly, it may depend on the default locale and similar assumed by
the C compiler.

If the source-code is UTF-8, and the default locale is UTF-8, then OK.

More conservative though is to assume that the default locale's
character encoding is potentially something like 8859-1 or 1252, which
will not preserve UTF-8 codepoints if not mapped into an area supported
by the relevant encoding (so, things may get remapped).

So, you need a UTF-8 string literal or similar to specify that the
string does in-fact encode text as UTF-8.

In a compiler, one may need to try to detect and deal with text
encoding, say:
ASCII text:
No BOM, limited range of characters
(0x20..0xx7E, 0x09, 0x0D, 0x0A, etc).
UTF-8:
Also Includes 80..EF
Only allow valid codepoint sequences
May include a BOM
8859-1 or 1252:
Includes 80..FF, excludes text which is also valid as UTF-8.
No BOM.
Other encodings possible,
Like 437 / KOI-8 / JIS / etc,
but far less common than 1252.
No good way to distinguish them reliably.
UTF-16 (*1):
Even number of bytes
Strongly hinted if even or odd bytes are frequently NUL;
Frequent even NUL: UTF-16, likely big-endian;
Frequent odd NUL: UTF-16, likely little-endian;
Excluded if matching the pattern for one of the above;
If text is valid ASCII or UTF-8, assume these instead.
May include a BOM.

*1: More commonly produced by older versions of Visual Studio or
Notepad, if a non-ASCII codepoint was present. Newer versions tend to
default to UTF-8 instead.

Compiler may normalize on UTF-8 or similar internally, but this again
doesn't mean it can be assumed for string literals (which are more
likely to be mashed into 1252 or something, such as with a compiler like MSVC).

Though, that said, does seem that GCC defaults to assuming UTF-8 if
nothing else is specified. So, UTF-8 => UTF-8 with default string
literals may be workable if one also assumes that the code is always
compiled with GCC or similar.

Though, curiously, it seems newer MSVC will still use UTF-8 with a
default string literal if the character is given as "\uXXXX", but will
use a single-byte encoding in other cases.

Checking, newer versions of MSVC are also aware of u8 literals.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.lang.c on Sun Dec 21 22:37:15 2025

From Newsgroup: comp.lang.c

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

[...]

The <uchar.h> header was introduced in C99. In C99, C11, and C17,
that header defines char16_t and char32_t. C23 introduces char8_t.

There doesn't seem to be any way, other than checking the value of
__STDC_VERSION__ to determine whether char8_t is defined or not.
There are not *_MIN or *_MAX macros for these types, either in
<uchar.h> or in <limits.h>. A test program I just wrote would have
been a little simpler if I could have used `#ifdef CHAR8_MAX`.

[...]

Since C23 defines char8_t to be the same type as unsigned char,
it seems better to just define it when it isn't there:

#include <limits.h>

#if CHAR_BIT == 8 && __STDC_VERSION__ < 202311
typedef unsigned char char8_t;
#endif

Yes. And the test for CHAR_BIT may not be necessary, depending on
the programmer's intent. char8_t is the same type as unsigned char
even if CHAR_BIT > 8.

That's humorous. It's like a name designed to be confusing or
misleading. But thank you for the information, I wouldn't have
guessed it.

Similarly, char16_t and char32_t are the same type as
uint_least16_t and uint_least32_t, respectively.

Kind of weird, but at least it's consistent, and it explains why
char8_t is the same as unsigned char. Then again, why not
uint_least8_t? Has C23 changed to the point where unsigned char
and uint_least8_t have to be the same type? My recollection is
that in earlier editions of the C standard it is possible, even
if unlikely, for these types to be distinct.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Noozle
  Wed Dec 24 16:23:45 2025
  from Noozle City via Telnet
- Noozle
  Wed Dec 24 07:59:09 2025
  from Noozle City via Telnet
- Microbot
  Wed Dec 24 00:16:07 2025
  from Moore, Ok via Telnet
- Noozle
  Tue Dec 23 08:01:34 2025
  from Noozle City via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,090
Nodes:	10 (0 / 10)
Uptime:	45:24:52
Calls:	13,946
Calls today:	3
Files:	187,034
D/L today:	8,063 files (2,942M bytes)
Messages:	2,460,945

Re: u8"" c11 c23

Who's Online

Recent Visitors

System Info