解析化学

解析化学
Parsing Chemistry

原始链接: https://re.factorcode.org/2025/10/parsing-chemistry.html

本文详细介绍了在Factor编程语言中实现化学式解析器的过程，其灵感来自Python的`chemparse`库。目标是将化学式（如“H2O”、“C1.5O3”或“K4[Fe(SCN)6]”）转换为类似于字典的结构，将元素映射到其计数。作者利用Factor的EBNF支持来定义解析语法，将问题分解为解析符号、数字和对（元素-计数组合，可能嵌套在括号或方括号内）。`split-formula`词将公式字符串解析为对的列表。 `flatten-formula`词然后递归地展开嵌套组，并相应地乘以计数。最后，`parse-formula`将这些步骤结合起来，生成所需的元素-计数映射。该实现通过几个单元测试进行了演示，成功解析了各种公式复杂性，包括分数计量和嵌套组，并且可在GitHub上找到。

In Python, the chemparse project is available as a “lightweight package for parsing chemical formula strings into python dictionaries” mapping chemical elements to numeric counts.

It supports parsing several variants of formula such as:

simple formulas like "H2O"
fractional stoichiometry like "C1.5O3"
groups such as "(CH3)2"
nested groups such as "((CH3)2)3"
square brackets such as "K4[Fe(SCN)6]"

I thought it would fun to build a similar functionality using Factor.

We are going to be using the EBNF syntax support to more simply write a parsing expression grammar. As is often the most useful way to implement things, we break it down into steps. We can parse a symbol as one or two letters, a number as an integer or float, and then a pair which is a symbol with an optional number prefix and postfix.

EBNF: split-formula [=[

symbol = [A-Z] [a-z]? => [[ sift >string ]]

number = [0-9]+ { "." [0-9]+ }? { { "e" | "E" } { "+" | "-" }? [0-9]+ }?

       => [[ first3 [ concat ] bi@ "" 3append-as string>number ]]

pair   = number? { symbol | "("~ pair+ ")"~ | "["~ pair+ "]"~ } number?

       => [[ first3 swapd [ 1 or ] bi@ * 2array ]]

pairs  = pair+

]=]

We can test that this works:

IN: scratchpad "H2O" split-formula .
V{ { "H" 2 } { "O" 1 } }

IN: scratchpad "(CH3)2" split-formula .
V{ { V{ { "C" 1 } { "H" 3 } } 2 } }

But we need to recursively flatten these into an assoc, mapping element to count.

: flatten-formula ( elt n assoc -- )
    [ [ first2 ] [ * ] bi* ] dip pick string?
    [ swapd at+ ] [ '[ _ _ flatten-formula ] each ] if ;

And combine those two steps to parse a formula:

: parse-formula ( str -- seq )
    split-formula H{ } clone [
        '[ 1 _ flatten-formula ] each
    ] keep ;

We can now test that this works with a few unit tests that show each of the features we hoped to support:

{ H{ { "H" 2 } { "O" 1 } } } [ "H2O" parse-formula ] unit-test

{ H{ { "C" 1.5 } { "O" 3 } } } [ "C1.5O3" parse-formula ] unit-test

{ H{ { "C" 2 } { "H" 6 } } } [ "(CH3)2" parse-formula ] unit-test

{ H{ { "C" 6 } { "H" 18 } } } [ "((CH3)2)3" parse-formula ] unit-test

{ H{ { "K" 4 } { "Fe" 1 } { "S" 6 } { "C" 6 } { "N" 6 } } }
[ "K4[Fe(SCN)6]" parse-formula ] unit-test

This is available in my GitHub.

解析化学 Parsing Chemistry

解析化学
Parsing Chemistry