```Awk 技术笔记```

```Awk 技术笔记```
Awk Technical Notes (2023)

原始链接: https://maximullaris.com/awk_tech_notes.html

## AWK：一种强大的工具的技术见解 AWK 备受青睐于原型设计，并且常常是 shell 脚本和 Python 的有力替代品。由于它故意缺乏垃圾回收器 (GC)，类似于 `sh/bash`，因此实现出奇地简单。这种缺乏 GC 的特性限制了函数只能返回标量值——数组必须*传递给*函数进行修改，从而确保确定性的内存管理。所有 AWK 变量默认都是全局的，只有作为函数参数包含时才变为局部变量。尽管 AWK 的创建者批评这种奇特的系统，但它仍然有效，并有助于资源释放。AWK 还具有*自动活化*功能，根据使用情况自动声明变量——这种特性让人联想到 Perl，Perl 可以被认为是 AWK 的进化版本。有趣的是，`$` 是一个一元运算符，不仅用于字段访问 (`$0`, `$1`)，还可以应用于表达式。AWK 的语法设计得非常简洁，允许省略内置函数的括号，并提供灵活的参数处理。这导致了解析的复杂性，尤其是在区分正则表达式和除法运算符时，通常需要“词法分析技巧”来解决歧义。许多较旧的语言优先考虑灵活性而非严格的语法，从而导致临时解析，这种趋势在现代语言（如 Go）中不太常见。

## Awk：一种强大而不被低估的工具最近的Hacker News讨论强调Awk是一种令人惊讶的强大工具，经常被开发者低估。虽然通常用于简单的任务，例如替换`cut`，但Awk通常可以将整个管道整合到一个调用中，从而提高效率。然而，它的复杂性可能会使代码对于不熟悉该语言的团队来说可读性较差。尽管存在可读性问题，许多人仍然觉得Awk易于使用，欣赏其隐式控制流和定义的变量初始化。有人将其与Perl等其他脚本语言（Awk被认为是其原型）甚至现代工具如Nushell进行比较，一些人认为Awk在跨系统方面仍然具有高度的可移植性和可靠性。许多用户分享了使用Awk成功处理复杂任务的例子，从调试C代码到创建基于文本的应用程序。最终，讨论强调了Awk作为一种多功能且高效的文本处理工具的持续相关性，尤其是在可读性不是主要问题时。甚至有人认为它的学习价值超过了sed。

原文

March 2023

In the previous article Fascination with AWK we discussed why AWK is great for prototyping and is often the best alternative to the shell and Python. In this article I want to show you some interesting technical facts I learned about AWK.

Lack of GC

AWK was designed to not require a GC (garbage collector) for its implementation. By the way, just like sh/bash. (I learned this remarkable fact from the oilshell blog – rather interesting technical blog, where author describes his progress in creating the “better bash”).

The most substantial consequence is that it’s forbidden to return an array from a function, you can return only a scalar value.

function f() {
   a[1] = 2
   return a # error
}

However, you can pass an array to a function and fill it there

BEGIN {
   fill(arr)
   print arr[0] " " arr[1]
}
function fill(arr,   i) { arr[i++] = "hello"; arr[i++] = "world" }

The thing is, in a lack of GC all heap allocations must be deterministic. That is, array, declared locally in a function must be destroyed at the moment when function returns. That’s why it’s disallowed to escape the declaration scope of a function (via return).

The absense of GC allows to keep the language implementation very simple, thus fast and portable. Also, with predictable memory consumption. To me, this qualifies AWK as perfect embeddable language, although, for some reason this niche is firmly occupied by (GC-equipped) Lua.

Local variables

All variables are global by default. However, if you add a variable to the function parameters (like i above) it becomes local. JavaScript works in a similar way, although there are more suitable var/let/const keywords. In practice, it is customary to separate “real” function parameters from “local” parameters with additional spaces for clarity.

Although Brian Kernighan (the K in AWK) regrets this design, in practice it works just fine.

The notation for function locals is appalling (all my fault too, which makes it worse).

So it appears, the use of local variables is also a mechanism for automatic release of resources. Small example:

function NUMBER(    res) {
  return (tryParse1("-", res) || 1) &&
    (tryParse1("0", res) || tryParse1("123456789", res) && (tryParseDigits(res)||1)) &&
    (tryParse1(".", res) ? tryParseDigits(res) : 1) &&
    (tryParse1("eE", res) ? (tryParse1("-+",res)||1) && tryParseDigits(res) : 1) &&
    asm("number") && asm(res[0])
}

The NUMBER function parses the number. res is a temporary array that will be automatically deallocated when the function exits.

Autovivification

An associative array is declared simply by the fact of using the corresponding variable arr as an array.

Likewise, a variable that is treated as a number (i++) will be implicitly declared as a numeric type, and so on.

To Perl connoisseurs, this feature may be known as Autovivification. In general, AWK is quite unequivocally a prototype of Perl. You can even say that Perl is a kind of AWK overgrowth on steroids… However, we deviated.

This is done, obviously, in order to be able to write the most compact code in one-liners, for which many of us are used to using AWK.

About AWK syntax/grammar

I want to tell about a couple of findings I encountered while implementing the parser for AWK for intellij-awk project.

`$` is a unary operator

If you used AWK, most likely you’ve used $0, $1, $2, etc. Some even used $NF.

But did you know, that $ is an operator, that can apply to an expression?

So it’s perfectly valid to write

{ second=2; print $second }

With the same result as

Also, it’s interesting to note, that $ is the only operator that is allowed to appear on the left side of assignment, that is you can write

{ $length("xx")="hello" }

(same as)

Quiz. What will be the output of

echo "2 3 4 hello" | awk '{ print $$$$1 }'

and why? Try to answer without running. Try adding even more $. Explain the behavior.

function calling `f()` doesn’t allow space before `(` …

… but only for user-defined functions:

awk 'BEGIN { fff () } function fff(){ }' # syntax error
awk 'BEGIN { fff() }  function fff(){ }' # OK

You can have space for built-in functions:

awk 'BEGIN { print substr ("abc",1,2) }' # OK, outputs ab

Why such strange inconsistency? It’s because of AWK’s decision to use empty operator for strings concatenation

BEGIN { a = "hello"; b = "world"; c = a b; print c } # helloworld 

it means that AWK tries to parse fff (123) as concatenation of variable fff and string 123.

Obviously fff () is just a syntax error, the same as fff (1,2).

As for built-in functions, AWK knows beforehand that it’s not a variable name, so it can disambiguate.

built-in functions are parsed as part of syntax

If you take a look at AWK specification at POSIX, at the Grammar section (yes, AWK grammar is a part of POSIX standard!), you’ll notice that AWK functions are part of it. To be precise, they are parsed at lexer step, so they enter parser step as ready to use tokens.

The implication here is that you are disallowed to name your own function of variable with the name of any built-in function. It will be a syntax error!

BEGIN { length = 1 } # syntax error

Compare to python:

Why is this? For flexibility. Remember, AWK’s main goal was to be extremely terse yet productive language well suited for one-liners. So:

it’s allowed to omit () for built-in functions, when no arguments passed, like in echo "hello" | awk '{ print length }' – same as echo "hello" | awk '{ print(length()) }'
same function can be used with different number of arguments, like sub(/regex/, "replacement", target) and sub(/regex/, "replacement") – omitted target is implied as $0

All these nuances require pretty ad-hoc parsing for built-in functions. This is why they are part of grammar. If we take the getline keyword, it’s not even a function, but rather a very versatile syntax construct.

ERE vs DIV lexing ambiguity

AWK ad-hoc syntax, optimized for succinct code, has some inherent ambiguities in its grammar.

The problem resides in lexing ambiguity of tokens ERE (extended regular expression, /regex/) vs DIV (/). Naturally, lexer prefers the longest matching term. This causes a problem for parsing a code like

Because it can parse as

instead of correct

This kind of problems is well-known, and usually the implementation requires the Lexer hack:

The solution generally consists of feeding information from the semantic symbol table back into the lexer. That is, rather than functioning as a pure one-way pipeline from the lexer to the parser, there is a backchannel from semantic analysis back to the lexer. This mixing of parsing and semantic analysis is generally regarded as inelegant, which is why it is called a “hack”.

In the original AWK (sometimes called the One True Awk), identifying regular expressions is the job of the parser, which explicitly sets the lexer into “regex mode” when it has figured out that it should expect to read a regex:

reg_expr:
     '/' {startreg();} REGEXPR '/'     { $$ = $3; }
   ;

(startreg() is a function defined in lex.c) The reg_expr rule itself is only ever matched in contexts where a division operator would be invalid.

However, in intellij-awk I managed to disambiguate this on the Lexer level, but this required creating a (somewhat sophisticated) lexer with multiple states (note the usage of state DIV_POSSIBLE).

You can check some other (Gawk-related) nuances I found in parser_quirks.md.

Overall, I noticed that many old programming languages have very ad-hoc syntax, and so parsing.

I think, partially, because they wanted to make the programming language very flexible (PL/1, Ada, C++, AWK, Perl, shell).

Partially, because some languages tried to be as close to human language as possible (SQL, or even COBOL – almost every language feature in them is a separate syntax construct).

Or maybe because parsing theory wasn’t that strong back then. So it was common to write ad-hoc parsers instead of using something like lex + yacc.

Nowadays, programming languages tend to have much more regular syntax. The most prominent example in this regard can be Go.

If you noticed a typo or have other feedback, please email me at [email protected]