Erlang并非仅仅关于轻量级进程和消息传递 (2023)
Erlang's not about lightweight processes and message passing (2023)

原始链接: https://stevana.github.io/erlangs_not_about_lightweight_processes_and_message_passing.html

这篇文章论述了Erlang的核心优势不仅在于其轻量级进程和消息传递,还在于其名为“行为”(behaviours)的泛型组件。行为类似于其他语言中的接口,定义了标准接口(例如,`gen_server`,`gen_event`,`supervisor`),封装了复杂的并发基础设施。程序员只需要实现顺序的业务逻辑,而行为则负责并发性、容错性和最佳实践。 作者重点介绍了supervisor行为和“让它崩溃”(let it crash)的哲学思想,通过自动重启实现健壮的系统。行为提供了结构,简化了测试和形式化验证。文章建议其他语言应该采用这种结构化方法,而不仅仅是复制并发原语。 作者提出在Erlang的见解基础上进行仿真测试,效仿FoundationDB的成功经验,模拟分布式系统。文章最后概述了未来的研究方向,包括整合LMAX Disruptor和Aeron的概念,以实现更快的事件循环、异步I/O和详细的supervisor实现。

这篇Hacker News的讨论串围绕一篇文章展开,该文章认为Erlang的价值超越了轻量级进程和消息传递,强调了其深层特性,特别是其Behaviour/Interface概念以及在构建复杂系统方面的成本效益。评论者分享了他们的经验,将Elixir/Erlang与Node.js和其他平台进行了对比,赞扬了BEAM虚拟机和OTP平台在分布式系统中的可靠性。一个反复出现的主题是Erlang/Elixir尽管有其优势,但采用率相对较低。原因包括招聘困难,Erlang被认为是一种“另类”语言,以及更多“平庸但集成良好”工具的可用性。一些人认为其生态系统更关注进程生命周期和RPC,而另一些人则认为抢占式调度器提供了更高的稳定性。该讨论串涵盖了历史背景以及BEAM调度和内部机制的复杂性。最后,文章提到Erlang在1998年被“解雇”,但在2004年重新聘用了Armstrong。
相关文章
  • (评论) 2023-10-20
  • (评论) 2023-11-11
  • (评论) 2024-08-05
  • 切换到 Elixir 2023-11-11
  • Gleam:Erlang VM 上的类型安全语言 2023-11-09

  • 原文

    Posted on Jan 18, 2023

    I used to think that the big idea of Erlang is its lightweight processes and message passing. Over the last couple of years I’ve realised that there’s a bigger insight to be had, and in this post I’d like to share it with you.

    Erlang has an interesting history. If I understand things correctly, it started off as a Prolog library for building reliable distributed systems, morphed into a Prolog dialect, before finally becoming a language in its own right.

    The goal seemed to have always been to solve the problem of building reliable distributed systems. It was developed at Ericsson and used to program their telephone switches. This was sometime in the 80s and 90s, before internet use become widespread. I suppose they were already dealing with “internet scale” traffic, i.e. hundreds of millions of users, with stricter SLAs than most internet services provide today. So in a sense they were ahead of their time.

    In 1998 Ericsson decided to ban all use of Erlang. The people responsible for developing it argued that if they were going to ban it, then they might as well open source it. Which Ericsson did and shortly after most of the team that created Erlang quit and started their own company.

    One of these people was Joe Armstrong, which also was one of the main people behind the design and implementation of Erlang. The company was called Bluetail and they got bought up a couple of times but in the end Joe got fired in 2002.

    Shortly after, still in 2002, Joe starts writing his PhD thesis at the Swedish Institute of Computer Science (SICS). Joe was born 1950, so he was probably 52 years old at this point. The topic of the thesis is Making reliable distributed systems in the presence of software errors and it was finished the year after in 2003.

    It’s quite an unusual thesis in many ways. For starters, most theses are written by people in their twenties with zero experience of practical applications. Whereas in Joe’s case he has been working professionally on this topic since the 80s, i.e. about twenty years. The thesis contains no math nor theory, it’s merely a presentation of the ideas that underpin Erlang and how they used Erlang to achieve the original goal of building reliable distributed systems.

    I highly commend reading his thesis and forming your own opinion, but to me it’s clear that the big idea there isn’t lightweight processes and message passing, but rather the generic components which in Erlang are called behaviours.

    I’ll first explain in more detail what behaviours are, and then I’ll come back to the point that they are more important than the idea of lightweight processes.

    Erlang behaviours are like interfaces in, say, Java or Go. It’s a collection of type signatures which can have multiple implementations, and once the programmer provides such an implementation they get access to functions written against that interface. To make it more concrete here’s a contrived example in Go:

    Generic server behaviour

    Next lets have a look at a more complicated example in Erlang taken from Joe’s thesis (p. 136). It’s a key-value store where we can store a key value pair or lookup the value of a key, the handle_call part is the most interesting:

    Event manager behaviour

    Lets come back to the behaviours we listed above first. We looked at gen_server, but what are the others for? There’s gen_event which is a generic event manager, which lets you register event handlers that are then run when the event manager gets messages associated with the handlers. Joe says this is useful for, e.g., error logging and gives the following example of an simple logger (p. 142):

    State machine behaviour

    The gen_fsm behavior has been renamed to gen_statem (for state machine) since thesis was written. It’s very similar to gen_server, but more geared towards implementing protocols, which often are specified as state machines. I believe any gen_server can be implemented as a gen_statem and vice versa so we won’t go into the details of gen_statem.

    Supervisor behaviour

    The next interesting behavior is supervisor. Supervisors are processes which sole job is to make sure that other processes are healthy and doing their job. If a supervised process fails then the supervisor can restart it according to some predefined strategy. Here’s an example due to Joe (p. 148):

    report from a biased company. Notice per year vs per week, but as we don’t know how either reliability numbers are obtained its probably safe to assume that the truth is somewhere in the middle – still a big difference, but not 31.56 milliseconds (nine nines) of downtime per year vs 1.6 hours of downtime per week.

    Application and release behaviours

    I’m not sure if application and release technically are behaviours, i.e. interfaces. They are part of the same chapter as the other behaviours in the thesis and they do provide a clear structure which is a trait of the other behaviours though, so we’ll include them in the discussion.

    So far we’ve presented behaviours from the bottom up. We started with “worker” behaviours gen_server, gen_statem and gen_event which together capture the semantics of our problem. We then saw how we can define supervisor trees whose children are other supervisor trees or workers, to deal with failures and restarts.

    Next level up is an application which consists of a supervisor tree together with everything else we need to deliver a particular application.

    A system can consist of several application and that’s where the final “behaviour” comes in. A release packages up one or more applications. They also contain code to handle upgrades. If the upgrade fails, it must be able to rollback to the previous stable state.

    I hope that by now I’m managed to convince you that it’s not actually the lightweight processes and message passing by themselves that make Erlang great for building reliable systems.

    At best one might be able to claim that lightweight processes and supervisors are the key mechanisms at play, but I think it would be more honest to recognise the structure that behaviours provide and how that ultimately leads to reliable software.

    I’ve not come across any other language, library, or framework which provides such relatively simple building blocks that compose into big systems like the AXD301 (“over a million lines of Erlang code”, p. 167).

    This begs the question: why aren’t language and library designers stealing the structure behind Erlang’s behaviours, rather than copying the ideas of lightweight processes and message passing?

    Let’s take a step back. We said earlier that behaviours are interfaces and many programming languages have interfaces. How would we go about starting to implement behaviours in other languages?

    Lets start with gen_server. I like to think its interface signature as being:

    simulation testing distributed systems à la FoundationDB.

    Simulation testing in a nutshell is running your system in a simulated world, where the simulation has full control over which messages get sent when over the network.

    FoundationDB built their own programming language, or dialect of C++ with actors, in order do the simulation testing. Our team seemed to be able to get quite far with merely using state machines of type:

    talks he mentions how difficult it’s to correctly implement distributed leader election.

    I believe this is a problem that would be greatly simplified by having access to a simulator. A bit like I’d imagine having access to a wind tunnel would make building an airplane easier. Both lets you test your system under extreme conditions, such as unreliable networking or power loss, before they happen in “production”. Furthermore, this simulator can be generic in, or parametrised by, behaviours. Which means that the developer gets it for free while the complexity of the simulator is hidden away, just like the concurrent code of gen_server!

    FoundationDB is a good example of simulation testing working, as witnessed by this tweet where somebody asked Kyle “aphyr” Kingsbury to Jepsen test FoundationDB:

    “haven’t tested foundation[db] in part because their testing appears to be waaaay more rigorous than mine.”

    Formal verification is also made easier if the program is written a state machine. Basically all of Lamport’s model checking work with TLA+ assumes that the specification is a state machine. Also more recently Kleppmann has shown how to exploit the state machine structure to do proof by (structural) induction to solve the state explosion problem.

    So there you have it, we’ve gone full circle. We started by taking inspiration from Joe and Erlang’s behaviours, and ended up using the structure of the gen_server behaviour to make it easier to solve a problem that Joe used to have.

    There are a bunch of related ideas that I have started working on:

    • Stealing ideas from Martin Thompson’s work on the LMAX Disruptor and aeron to make a fast event loop, on top of which the behaviours run;
    • Enriching the state machine type with async I/O;
    • How to implement supervisors in more detail;
    • Hot code swapping of state machines.

    Feel free to get in touch, if you find any of this interesting and would like to get involved, or if you have have comments, suggestions or questions.

    联系我们 contact @ memedata.com