Forum: War Ensemble BBS

AWK Technical Notes

From Ben Collver@bencollver@tilde.pink to comp.misc on Thu Dec 18 19:08:07 2025

From Newsgroup: comp.misc

AWK technical notes
===================

March 2023

In the previous article Fascination With AWK we discussed why AWK is
great for prototyping and is often the best alternative to the shell
and Python. In this article I want to show you some interesting
technical facts I learned about AWK.

Fasciation With Awk
<https://maximullaris.com/awk.html>

Lack of GC
==========

AWK was designed to not require a GC (garbage collector) for its implementation. By the way, just like sh/bash. (I learned this
remarkable fact from the oilshell blog–-rather interesting technical
blog, where author describes his progress in creating the
"better bash").

oilshell blog
<https://www.oilshell.org/blog/tags.html?tag=awk#awk>

The most substantial consequence is that it's forbidden to return an
array from a function, you can return only a scalar value.

function f() {
a[1] = 2
return a # error
}

However, you can pass an array to a function and fill it there

BEGIN {
fill(arr)
print arr[0] " " arr[1]
}
function fill(arr, i) {
arr[i++] = "hello"
arr[i++] = "world"
}

The thing is, in a lack of GC all heap allocations must be
deterministic. That is, array, declared locally in a function must be
destroyed at the moment when function returns. That's why it's
disallowed to escape the declaration scope of a function (via
return).

The absense of GC allows to keep the language implementation very
simple, thus fast and portable. Also, with predictable memory
consumption. To me, this qualifies AWK as perfect embeddable
language, although, for some reason this niche is firmly occupied by (GC-equipped) Lua.

Local variables
===============

All variables are global by default. However, if you add a variable
to the function parameters (like i above) it becomes local.
JavaScript works in a similar way, although there are more suitable var/let/const keywords. In practice, it is customary to separate
"real" function parameters from "local" parameters with additional
spaces for clarity.

Although Brian Kernighan (the K in AWK) regrets this design, in
practice it works just fine.

The notation for function locals is appalling (all my fault too,
which makes it worse).

So it appears, the use of local variables is also a mechanism for
automatic release of resources. Small example:

function NUMBER( res) {
return (tryParse1("-", res) || 1) &&
(tryParse1("0", res) || tryParse1("123456789", res) &&
(tryParseDigits(res) || 1)) &&
(tryParse1(".", res) ? tryParseDigits(res) : 1) &&
(tryParse1("eE", res) ?
(tryParse1("-+", res) || 1) && tryParseDigits(res) :
1) &&
asm("number") &&
asm(res[0])
}

The NUMBER function parses the number. res is a temporary array that
will be automatically deallocated when the function exits.

Autovivification
================

An associative array is declared simply by the fact of using the
corresponding variable arr as an array.

arr["a"] = "b"

Likewise, a variable that is treated as a number (i++) will be
implicitly declared as a numeric type, and so on.

To Perl connoisseurs, this feature may be known as Autovivification.
In general, AWK is quite unequivocally a prototype of Perl. You can
even say that Perl is a kind of AWK overgrowth on steroids... However,
we deviated.

Autovivification
<https://en.wikipedia.org/wiki/Autovivification>

This is done, obviously, in order to be able to write the most
compact code in one-liners, for which many of us are used to using
AWK.

About AWK syntax/grammar
========================

I want to tell about a couple of findings I encountered while
implementing the parser for AWK for intellij-awk project.

intellij-awk
<https://github.com/xonixx/intellij-awk>

$ is a unary operator
---------------------

If you used AWK, most likely you've used $0, $1, $2, etc. Some even
used $NF.

But did you know, that $ is an operator, that can apply to an
expression?

So it's perfectly valid to write

{ second = 2; print $second }

or

{ print $ (1 + 1) }

or

{ i = 1; print $++i }

With the same result as

{ print $2 }

Also, it's interesting to note, that $ is the only operator that is
allowed to appear on the left side of assignment, that is you can
write

{ $(7 - 5) = "hello" }

or

{ $length("xx")="hello" }

(same as)

{ $2 = "hello" }

Quiz. What will be the output of

echo "2 3 4 hello" | awk '{ print $$$$1 }'

and why? Try to answer without running. Try adding even more $.
Explain the behavior.

function calling f() doesn't allow space before ( ... -----------------------------------------------------

... but only for user-defined functions:

awk 'BEGIN { fff () } function fff(){ }' # syntax error
awk 'BEGIN { fff() } function fff(){ }' # OK

You can have space for built-in functions:

awk 'BEGIN { print substr ("abc", 1, 2) }' # OK, outputs ab

Why such strange inconsistency? It's because of AWK's decision to use
empty operator for strings concatenation

BEGIN { a = "hello"; b = "world"; c = a b; print c } # helloworld

it means that AWK tries to parse fff (123) as concatenation of
variable fff and string 123.

Obviously fff () is just a syntax error, the same as fff (1,2).

As for built-in functions, AWK knows beforehand that it's not a
variable name, so it can disambiguate.

built-in functions are parsed as part of syntax -----------------------------------------------

If you take a look at AWK specification at POSIX, at the Grammar
section (yes, AWK grammar is a part of POSIX standard!), you'll
notice that AWK functions are part of it. To be precise, they are
parsed at lexer step, so they enter parser step as ready to use
tokens.

Grammar section <https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html #tag_20_06_13_16>

The implication here is that you are disallowed to name your own
function of variable with the name of any built-in function. It will
be a syntax error!

BEGIN { length = 1 } # syntax error

Compare to python:

len = 1 # OK

Why is this? For flexibility. Remember, AWK's main goal was to be
extremely terse yet productive language well suited for one-liners.
So:

* it's allowed to omit () for built-in functions, when no arguments
passed, like in:
echo "hello" | awk '{ print length }'
same as:
echo "hello" | awk '{ print(length()) }'

* same function can be used with different number of arguments, like
sub(/regex/, "replacement", target) and
sub(/regex/, "replacement")
– omitted target is implied as $0

All these nuances require pretty ad-hoc parsing for built-in
functions. This is why they are part of grammar. If we take the
getline keyword, it's not even a function, but rather a very
versatile syntax construct.

syntax construct <https://www.gnu.org/software/gawk/manual/html_node/Getline.html>

ERE vs DIV lexing ambiguity
---------------------------

AWK ad-hoc syntax, optimized for succinct code, has some inherent
ambiguities in its grammar.

inherent ambiguities <https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html #:~:text=There%20is%20a%20lexical%20ambiguity%20between%20the%20
token>

The problem resides in lexing ambiguity of tokens ERE (extended
regular expression, /regex/) vs DIV (/). Naturally, lexer prefers the
longest matching term. This causes a problem for parsing a code like:

a(1 / 2, 3 / 4)

Because it can parse as

a(1 (/ 2, 3 /) 4)

instead of correct

a((1/2), (3/4))

This kind of problems is well-known, and usually the implementation
requires the Lexer hack:

The solution generally consists of feeding information from the
semantic symbol table back into the lexer. That is, rather than
functioning as a pure one-way pipeline from the lexer to the
parser, there is a backchannel from semantic analysis back to the
lexer. This mixing of parsing and semantic analysis is generally
regarded as inelegant, which is why it is called a "hack".

Lexer hack
<https://en.wikipedia.org/wiki/Lexer_hack>

the parser
<https://github.com/onetrueawk/awk/blob/ d62c43899fd25fdc4883a32857d0f157aa2b6324/awkgram.y#L289>

In the original AWK (sometimes called the One True Awk), identifying
regular expressions is the job of the parser, which explicitly sets
the lexer into "regex mode" when it has figured out that it should
expect to read a regex:

reg_expr:
'/' {startreg();} REGEXPR '/' { $$ = $3; }
;

(startreg() is a function defined in lex.c) The reg_expr rule itself
is only ever matched in contexts where a division operator would be
invalid.

lex.c
<https://github.com/onetrueawk/awk/blob/ d62c43899fd25fdc4883a32857d0f157aa2b6324/lex.c#L515>

However, in intellij-awk I managed to disambiguate this on the Lexer
level, but this required creating a (somewhat sophisticated) lexer
with multiple states (note the usage of state DIV_POSSIBLE).

lexer with multiple states <https://github.com/xonixx/intellij-awk/blob/main/src/main/java/ intellij_awk/Awk.flex>

You can check some other (Gawk-related) nuances I found in
parser_quirks.md.

parser_quirks.md
<https://github.com/xonixx/intellij-awk/blob/main/doc/
parser_quirks.md>

Overall, I noticed that many old programming languages have very
ad-hoc syntax, and so parsing.

I think, partially, because they wanted to make the programming
language very flexible (PL/1, Ada, C++, AWK, Perl, shell).

Partially, because some languages tried to be as close to human
language as possible (SQL, or even COBOL--almost every language
feature in them is a separate syntax construct).

Or maybe because parsing theory wasn't that strong back then. So it
was common to write ad-hoc parsers instead of using something like
lex & yacc.

Nowadays, programming languages tend to have much more regular
syntax. The most prominent example in this regard can be Go.

regular syntax
<https://softwareengineering.stackexchange.com/questions/316217/ why-does-the-type-go-after-the-variable-name-in-modern-programming-
languages>

From: <https://maximullaris.com/awk_tech_notes.html>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.misc on Thu Dec 18 20:23:33 2025

From Newsgroup: comp.misc

On Thu, 18 Dec 2025 19:08:07 -0000 (UTC), Ben Collver wrote:

In the previous article Fascination With AWK we discussed why AWK is
great for prototyping and is often the best alternative to the shell and Python.

Perl does everything Awk does, just as concisely, and more besides.

If you want more readable code, then Python is your best bet these days.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Noozle
  Thu Dec 25 09:36:37 2025
  from Noozle City via Telnet
- Noozle
  Thu Dec 25 08:35:55 2025
  from Noozle City via Telnet
- Djatropine
  Wed Dec 24 20:57:23 2025
  from Memphis, Tn via SSH
- Noozle
  Wed Dec 24 16:23:45 2025
  from Noozle City via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,090
Nodes:	10 (0 / 10)
Uptime:	64:52:19
Calls:	13,949
Calls today:	2
Files:	187,035
D/L today:	4,673 files (1,400M bytes)
Messages:	2,461,443

AWK Technical Notes

Who's Online

Recent Visitors

System Info