（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41069829

文本描述了由两个名为 Gemini 和 AlphaZero 的网络组成的系统。 Gemini 是一种改进的语言模型，能够将自然语言翻译为正式语言。 AlphaZero 专注于解决复杂问题，不受自然语言的阻碍。它的方法避免了法学硕士遇到的潜在“幻觉”问题，因为它专注于使用精益定理证明者的命令在证明空间内玩游戏。 AlphaZero 在适应各种游戏时取得了令人印象深刻的结果，展示了其在不同战略领域的学习和卓越能力。例如，它击败了国际象棋程序 Stockfish 的强大版本，这证明了它的多功能性。当遇到数学问题时，用户通过要求“所有偶数”概念的形式化来测试 AlphaZero 的性能。 AlphaZero 生成的答案需要对导入和语法中的错误进行微调。然而，AlphaZero 未能提出问题的两个方向，这表明还有改进的空间。 Leaning 除了是一种形式语言之外，还可以充当定理证明者，允许用户证明陈述。与 IMO 参与者不同，用户不能仅仅依靠自动化工具来解决问题，因为他们必须展示自己的工作。解决 IMO 问题需要将最初的问题形式化，这是数学家乔治·波利亚 (George Pólya) 强调的问题解决过程的一个关键方面。在比赛中，金牌和银牌获得者代表了 IMO 参与者中前 25%，这意味着人工智能解决 IMO 问题的能力比四分之三的参赛高中生更好。尽管解决多个问题需要更长的时间，但人工智能在解决单个 IMO 问题方面的成功仍然很重要。与学生的完成时间进行比较可能会提供有关人工智能能力的不准确观点。相反，报告成功解决的问题的百分比或获得的分数可能会提供更清晰的评估。最终，人工智能在解决数学问题方面被证明是有效的，只要选择适当的问题——那些对人类和人工智能来说都足够困难的问题。考虑到人工智能可以毫不费力地克服容易暴力破解的任务，这有助于确保对人工智能优势的公平评估。相比之下，IMO 参与者需要独特的、具有挑战性的问题解决方案才能与人工智能竞争。总之，人工智能在解决数学挑战方面表现出了非凡的效率，尽管成功程度取决于任务的复杂性。虽然无意将人工智能与

So I am extremely hyped about this, but it's not clear to me how much heavy lifting this sentence is doing:

> First, the problems were manually translated into formal mathematical language for our systems to understand.

The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?

The computer did find the answers itself. I.e., it found "even integers" for P1, "{1,1}" for P2, and "2" for P6. It then also provided provided a Lean proof in each case.

It would make a lot of sense for the lean-code-formalisation of the problems done by the researchers fed to the AI to be provided. Not assuming bad intent in not providing them, but it would help understand better the results.

"AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."

Huh, so MCTS to find the ‘best’ token using a (relatively) small, quick language model? Sounds like an interesting approach to small model text generation too…

It's not an English LLM (Large Language Model).

It's a math Language Model. Not even sure it's a Large Language Model. (Maybe shares a foundational model with an English LLM; I don't know)

It learns mathematical statements, and generates new mathematical statements, then uses search techniques to continue. Similar to Alpha Go's neural network, what makes it new and interesting is how the NN/LLM part makes smart guesses that drastically prune the search tree, before the brute-force search part.

(This is also what humans do to solve math probrems. But humans are really, really slow at brute-force search, so we really almost entirely on the NN pattern-matching analogy-making part.)

These kind of LLMs are also very interesting for software engineering. It's just a matter of replacing Lean with something that is more oriented towards proving software properties.

For example, write a formal specification of a function in Dafny on Liquid Haskell and get the LLM to produce code that is formally guaranteed to be correct. Logic-based + probability-based ML.

All GOFAI ideas are still very useful.

This is really interesting. I would have expected the understanding to be that humans make a guess, test it, and learn from what did or did not work. The lessons learned from the prior tests would impact future guesses.

Do you know if a system like the OP is learning from failed tests to guide future tests, or is it a truly a brute force search as if it were trying to mine bitcoin?

This quote from the article sounds like it learns from failed tests:

>We trained AlphaProof for the IMO by proving or disproving millions of problems, covering a wide range of difficulties and mathematical topic areas over a period of weeks leading up to the competition. The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.

Reading between the lines a bit, that does answer the question I had though don't think I I clarified very well.

I read that to say the model's token weights are adjusted as it goes, so in an LLM sense it is kind of learning. It isn't reasoning through an answer in the way a human does though. Meaning, the model is still just statistically predicting what an answer may be and checking if it worked.

I wouldn't chalk that up to learning at all. An AI solving complex math doesn't even seem too impressive to me with the predictive loop approach. Computers are well adept at math, throwing enough compute hardware at it to brute force an answer isn't suprising. I'd be really impressed if it could reliably get there with a similar number of failed attempts as a human, that could indicate that it really learned and reasoned rather than rammed through a mountain of failed guesses.

yeah but it doesn't understand the exact syntax on an absolute level, does it...? I understood this to be the same as any language model applied to programming languages (aka Formal Languages). Is that mistaken?

As far as I understand, and I may be wrong here, the system is composed of two networks: Gemini and AlphaZero. Gemini, being an ordinary LLM with some fine-tunes, only does translation from natural to formal language. Then, AlphaZero solves the problem. AlphaZero, unburdened with natural language and only dealing with "playing a game in the proof space" (where the "moves" are commands to the Lean theorem prover), does not hallucinate in the same way an LLM does because it is nothing like an LLM.

formal definition of first theorem already contain answer of the problem "{α : ℝ | ∃ k : ℤ, Even k ∧ α = k}" (which mean set of even real numbers).if they say that they have translated first problem into formal definition then it is very interesting how they initially formalized problem without including answer in it

That wasn't very nice. Are you curious about anything? Happy to help. I'd proactively do it, but I don't want to guess at whats in your mind. My initial guess is you think I think that engaging with the public is an infinite loop. I don't!

I would expect that in their data which they train AlphaProof on they have some concept of a "vague problem" whoch could just look like

{Formal description of the set in question} = ?

And then Alphaproof has to find candidate descriptions of this set and prove a theorem that they are equal to the above.

I doubt they would claim to solve the problem if they provided half of the answer.

I think maybe parent comment is referring to it essentially just employing a zerg rush but with the speed and reaction time of an AI? Not 100% sure... Unrelated, iirc the starcraft functionality was an early example of generalizing a pretrained NN, alphaGO, and showing that it could adapt to learn and defeat games across strategic domains, especially after it learned so much strategy from the most difficult, widely played, and most strategically-varied physical game available.

> I doubt they would claim to solve the problem if they provided half of the answer.

This falls under extraordinary claims require extraordinary proof and we have seen nothing of the sort.

Exactly, a problem and its answer are just different ways of describing the same object. Every step of a proof is a transformation/translation of the same object. It would be disingenuous to say that some heavy lifting isn't done in formalizing a problem but it seems that step is also performed by a machine:

"We established a bridge between these two complementary spheres by fine-tuning a Gemini model to automatically translate natural language problem statements into formal statements, creating a large library of formal problems of varying difficulty."

I'm confused, is the formalization by Gemini or "manually"? Which is it?

To speak generally, that translation part is much easier than the proof part. The problem with automated translation is that the translation result might be incorrect. This happens a lot when even people try formal methods by their hands, so I guess the researchers concluded that they'll have to audit every single translation regardless of using LLM or whatever tools.

You'd think that, but Timothy Gowers (the famous mathematician they worked with) wrote (https://x.com/wtgowers/status/1816509817382735986)

> However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

So didn't actually solve autoformalization, which is why they still needed humans to translate the input IMO 2024 problems.

The reason why formalization is harder than you think is that there is no way to know if you got it right. You can use Reinforcement Learning with proofs and have a clear signal from the proof checker. We don't have a way to verify formalizations the same way.

> However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

A small detail wasn't clear to me: for these incorrectly formalized problems, how do they get the correct answer as ground truth for training? Have a human to manually solve them?

(In contrast to problems actually from "a huge database of IMO-type problems", they do have answers for these already).

> A small detail wasn't clear to me: for these incorrectly formalized problems, how do they get the correct answer as ground truth for training? Have a human to manually solve them?

Formal proofs can be mechanically checked if it's correct or not. We just don't know what's the answer. Think it as an extremely rigorous type system that typically requires really long type annotations, like annotation itself is a complex program. So if AlphaProof happens to generate a proof that passes this checker, we know that it's correct.

You write proofs in a formal language that can be machine checked. If the checker is happy, the proof is correct (unless there is a bug in the checker, but that is unlikely).

They said the incorrectly formalized ones are usually easier, so I assume they just hire humans to solve them in the old way until the AI is smart enough to solve these easier problems.

> I assume the just hire humans to solve…

An incorrectly formalized problem is a different problem and a solution to any formalized problem still useful for AI training because such solutions can be mechanically checked for correctness and this does not require the hire of humans. What requires humans is the initial formalization process since that is more a language translation task which requires nuance and judgment and is difficult to mechanically verify.

> We don't have a way to verify formalizations the same way.

While there is no perfect method, it is possible to use the agent to determine if the statement is false, has contradictory hypotheses, or a suspiciously short proof.

> To speak generally, that translation part is much easier than the proof part.

To you or me, sure. But I think the proof that it isn't for this AI system is that they didn't do it. Asking a modern LLM to "translate" something is a pretty solved problem, after all. That argues strongly that what was happening here is not a "translation" but something else, like a semantic distillation.

If you ask a AI (or person) to prove the halting problem, they can't. If you "translate" the question into a specific example that does halt, they can run it and find out.

I'm suspicious, basically.

The linked page says

> While the problem statements were formalized into Lean by hand, the answers within the problem statements were generated and formalized by the agent.

However, it's unclear what initial format was given to the agents that allowed this step

FWIW, GPT-4o transcribed a screenshot of problem 1 perfectly into LaTeX, so I don't think "munge the problem into machine-readable form" is per se a difficult part of it these days even if they did somehow take shortcuts (which it sounds like they didn't).

… no? After the LaTeX output, I told stock GPT4o that the answer was "all even integers", and asked for the statement in Lean. I had to make two changes to its output (both of which were compile-time errors, not misformalisations), and it gave me the formalisation of the difficult direction of the problem.

Both changes were trivial: it had one incorrect (but unnecessary) import, and it used the syntax from Lean 3 instead of Lean 4 in one lambda definition. A system that was trained harder on Lean would not make those mistakes.

The one actual error it made was in not proposing that the other direction of the "if and only if" is required. Again, I am quite confident that this formalisation failure mode is not hard to solve in a system that is, like, actually trained to do this.

Obviously formalising problems that a working mathematicican solves is dramatically harder than formalising IMO problems, and is presumably way ahead of the state of the art.

This is really not the kind of problem LLMs are bad at! But since you insist, given the LaTeX, Claude 3.5 Sonnet correctly stated the theorem in full while inventing notation for the floor operation (it did correctly note unprompted what the right function was and how to obtain it from mathlib, but it incorrectly attempted to define syntax sugar for it).

I think that's exagerating a bit. If you are familiar with both Lean and LaTeX then I think transcribing these problems to Lean only takes about twice as long as transcribing them to LaTeX.

Think of problems in NP - you can check the answer efficiently, but finding the answer to check is the hard part... This is basically what we're looking at here: The proof checker can quickly evaluate correctness, but we need something to produce the proof, and that's the hard part.

The AI is the "solver network", which is the (directed) search over solutions generated by Lean. The AI is in doing an efficient search, I suppose.

I'm also waiting for my answer on the role of the Gemini formalizer, but reading between the lines, it looks like it was only used during training the "solver network", but not used in solving the IMO problems. If so then the hyping is greatly premature, as the hybrid formalizer/solver is the whole novelty of this, but it's not good enough to use end-to-end?

You cannot say AlphaProof learned enough to solve problems if formalization made them easier to solve in the first place! You can say that the "solver network" part learned enough to solve formalized problems better than prior training methods.

> When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps in Lean.

To me, this sounds like Alphaproof receives a "problem", whatever that is (how do you formalize "determine all X such that..."? One is asked to show that an abstract set is actually some easily understandable set...). Then it generates candidate Theorems, persumably in Lean. I.e. the set is {n: P(n)} for some formula or something. Then it searches for proofs.

I think if Alphaproof did not find {foo} but it was given then it would be very outrageous to claim that it solved the problem.

I am also very hyped.

I as someone with a maths degree but who hasn't done this kind of thing for half a decade, was able to immediately guess the solution to (1) but actually proving it is much harder.

as a noob, i feel that formalizing is a major part of solving the problem by yourserlf. my assessment is that once you identify certain patterns, you can solve problems by memorizing some patterns. but people might me can struggle with the first stage and solve the wrong problem.

still good progress nonetheless. won't call the system sufficient by itself tho.

My mathematician friend said problem 5 (I think? With the monsters) seems hard to formulate, so I spent 15 minutes formulating it in pseudo-haskell.

Then he gave me a huge hint to the solution, after which it only took me a couple of hours to solve.

(Formalizing the solution is of course the hardest part, and might serve as a good masters dissertation I think)

As is often the case, creating a well formed problem statement often takes as much knowledge (if not work) as finding the solution.

But seriously, if you can't ask the LLM to solve the right question, you can't really expect it to give you the right answer unless you're really lucky. "I'm sorry, but I think you meant to ask a different question. You might want to check the homework set again to be sure, but here's what I think you really want."

Presenting this just as "translating into formal language" omits important information.

Lean isn't just a formal language, it is also a theorem prover, Could the IMO participants use the nlinarith tactic? Could they use other tactics?

Of course not, they had to show their work!

Could they have mathematicians translate the problem statements into the formal language for them?

Of course not, they had to do it themselves. In "How to solve it" Polya stresses multiple times that formalizing the initial question is an important part of the process.

Then, the actual computational resources expressed in time are meaningless if one has a massive compute cloud.

I'm a bit dissatisfied with the presentation, same as with the AlphaZero comparison to an obsolete Stockfish version that has been debunked multiple times.

This is certainly impressive, but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.

That means that "AI solves IMO problems better than 75% of the students", which is probably even more impressive.

But, "minutes for one problem and up to 3 days for each remaining problem" means that this is unfortunately not a true representation either. If these students were given up to 15 days (5 problems at "up to 3 days each") instead of 9h, there would probably be more of them that match or beat this score too.

It really sounds like AI solved only a single problem in the 9h students get, so it certainly would not be even close to the medals. What's the need to taint the impressive result with apples-to-oranges comparison?

Why not be more objective and report that it took longer but was able to solve X% of problems (or scored X out of N points)?

I have met IMO competitors. They are insanely smart. I wouldn't imagine it's possible to be that smart before I started hanging in these circles. So more like 25% of 0.01% of high school students.

Time is not a very interesting dimension here, because humans don't use the same CPU as huge GPU clusters. The binary "is it able to reach a solution given enough resources?" is more interesting (for GPT/Claude the answer is a clear negative).

But many more of those students (including many not in the contest) could solve those problems given more time than the 4.5hr sessions.

If these problems were important to solve, redundantly by thousands or millions of people (like the real work that most people do), far more people would put in the effort to learn how to solve these problems.

It's just a weird comparison. Contests are very artificially structured in ways that don't make sense for comparing to computers.

I think what we need to look at is that AI systems were not able to do it before and now they are able to do it. Sooner these systems with millions of dollars of compute will just scale the algorithms and beat humans at this as well - just like they did for chess, go, shogi etc.

> medals are awarded to 50% of the participants (high school students)

In case this confuses anyone: the high school students in question are not a standard sample of high school students. AFAIK, they are teams of the ~6 strongest competitive problem solving high school students from each country.

In my opinion (not Google s) the only reason they didn't get gold this year (apart from being unlucky on problem selection) is that they didn't want to try for any partial credit in P3 and P5. They are so close to the cut off and usually contestants with a little bit of progress can get 1 point. But i guess they didn't want to get a gold on a technicality--it would be bad press. So they settled in a indisputable silver

The AI took a day on one of the problems so it must have generated and discarded a lot of proofs that didn't work. How could it choose which one to submit as the answer, except the objective fact of the proof passing in Lean.

When tackling IMO problems, the hard part is coming up with a good approach to the proof. Verifying your proof (and rejecting your false attempts) is much easier. You'll know which one to submit.

(Source: I am a two-time IMO silver medalist.)

You are a human, not an AI. You know whether your idea seems related to the solution. The AI has thousands of ideas and doesn't know which are better. Graders shouldn't accept a thousand guesses grubbing for 1 point.

If verifying a good idea is easy, then the evidence shows that the AI didn't have good ideas for the other 2 problems.

we are talking about lean proofs. Given a formal statement and a proof - the lean can verify whether it's correct or not. It's like generating computer programs to solve a problem - the problem lied in generating useful solutions/sub-solutions so that the search is effective. They achieve this via using gemini as a lean proof generator aka. using a world model LLM fine tuned to generate lean proofs in a more effective manner.

Humans are even better at this as you mention - but effectively the approach is similar. Come up with lot of ideas and see what proves it.

Well, he does have twice the amount of silver medals... And can speak the English language... Although, an AI attempting to speak with the human race entirely through an esoteric math-proofing language would be an interesting take on the whole, "humans make ET contact, interact through universal language.... pause for half of the movie, military hothead wanting to blow it out of the sky, until pretty lady mathematician runs into the oval office waving a sheet of paper... OF MATH!" trope.... But now, it's a race between you and I, to see who can write the screenplay first!

Yeah, the progress has to be quite significant, no points are awarded for trivial observations. Thus scores are usually bimodal around 0 and 7. In the linked stats you can see that 1 point for P3/P5 was less common than full score on other problems.

Problem 2 and 5 have a lot of 1s. Sometimes thre is an interesting advance that is easirr than the full solution. Also in problem 6 the only hope for most was to get that 1 point.

I don't believe anything was graded by the IMO, Google is just giving itself 7 for anything proved in Lean (which is reasonable IMO), so they can't really try for partial credit so much as choose not to report a higher self-graded number.

From the article: "Our solutions were scored according to the IMO’s point-awarding rules by prominent mathematicians Prof Sir Timothy Gowers, an IMO gold medalist and Fields Medal winner, and Dr Joseph Myers, a two-time IMO gold medalist and Chair of the IMO 2024 Problem Selection Committee."

Fair enough. Although I think the question is whether P3/P5 were given zero points vs not evaluated vs evaluated but not published. I don't think it is surprising that Lean-verified proofs get a 7.

One key difference between giving humans more time and giving computer programs more time is that historically we have had more success making the latter run faster than the former.

> What's the need to taint the impressive result with apples-to-oranges comparison?

Most of DeepMind’s research is a cost-centre for the company. These press releases help justify the continued investment both to investors and to the wider public.

> Most of DeepMind’s research is a cost-centre for the company.

The effect of establishing oneself as the thought leader in a field is enormous.

For example, IBM's stock went up 15% the month after they beat Kasparov.

The complexity of truth doesn't fit in a headline, let alone attract enough clicks to win the competition for eyeballs and virality - which means that even if someone somehow succeeded in telling the whole truth in a headline, that article would lose in the race for eyeballs to the clickbait-headline version, which means your eyeballs would (statistically) never see it, and only the clickbait version.

I'd say "welcome to the web" but this was true in 1800s newspapers as well.

Just bigger chip area at lower clock could do it too if there is no threshold on cost anyway and they were limited by power. Google would hit enough accumulated production of AI chip area and cluster build out even if Moore's law stopped. There will likely be big algorithmic improvments too.

We need to up the ante: Getting human-like performance on any task is not impressive in itself, what matters is superhuman, orders of magnitude above. These comparisons with humans in order create impressive sounding titles are disguising the fact that we are still at the stone age of intelligence.

This is the real deal. AlphaGeometry solved a very limited set of problems with a lot of brute force search. This is a much broader method that I believe will have a great impact on the way we do mathematics. They are really implementing a self-feeding pipeling from natural language mathematics to formalized mathematics where they can train both formalization and proving. In principle this pipeline can also learn basic theory building like creating auxilliary definitions and Lemmas. I really think this is the holy grail of proof-assistance and will allow us to formalize most mathematics that we create very naturally. Humans will work podt-rigorously and let the machine asisst with filling in the details.

Agreed, this is a big step forward. Geometry problems are in a different class, since you can translate them into systems of polynomial equations and use well known computer algebra algorithms to solve them.

By contrast, this kind of open ended formalization is something where progress used to be extremely slow and incremental. I worked in an adjacent field 5 years ago and I cannot stress enough that these results are simply out of reach for traditional automated reasoning techniques.

Real automatic theorem proving is also useful for a lot more than pure mathematics. For example, it's simple to write out an axiomatic semantics for a small programming language in lean and pose a question of the form "show that there exists a program which satisfies this specification".

If this approach scales it'll be far more important than any other ML application that has come out in the last few years.

> Agreed, this is a big step forward. Geometry problems are in a different class, since you can translate them into systems of polynomial equations and use well known computer algebra algorithms to solve them.

The blog post indicates the opposite. The geometry problem in the IMO problem set was solved by AlphaGeometry 2, which is an LLM based on Google's Gemini. LLMs are considered relatively general systems. But the other three solved problems were proved by AlphaProof, which is a narrow RL system that is literally based on AlphaZero, the Go and Chess AI. Only its initial (bootstrapping) human training data (proofs) were formalized and augmented by an LLM (Gemini).

Only slightly more general. It only works for games that are zero-sum, deterministic, have no hidden information, and discrete game state and moves. Other examples include connect-4.

So finding Lean proofs can be conceptualized as a zero-sum game?

Another basic requirement is that valid moves / inference steps and the winning condition can be efficiently verified using some non-AI algorithm. Otherwise there would not be a reward signal for the reinforcement learning algorithm. This is different from answering most natural language questions, where the answer can't be checked trivially.

Theorem proving can be formulated as a game, see e.g., https://plato.stanford.edu/entries/logic-games/ and interactive theorem provers can verify that a proof is correct (and related sub problems, such as that a lemma application is valid).

Conceptually, if you're trying to show a conjunction, then it's the other player's turn and they ask you for a proof of a particular case. If you're trying to show a disjunction then it's your turn and you're picking a case. "Forall" is a potentially infinite conjunction, "exists" is a potentially infinite disjunction.

In classical logic this collapses somewhat, but the point is that this is still a search problem of the same kind. If you want to feel this for yourself, try out some proofs in lean or coq. :)

I don't think AlphaZero is related to this work, apart from both being NN-based. AlphaZero and its training pipeline fundamentally only works for "chess-like" two-player games, where the agent can play against itself and slowly improve through MCTS.

"AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."

> a lot of brute force search

Don't dismiss search, it might be brute force but it goes beyond human level in Go and silver at IMO. Search is also what powers evolution which created us, also by a lot of brute forcing, and is at the core of scientific method (re)search.

And I understand the upper time limit for each question was 4.5 hours. So it solved one question almost immediately, two well over the allotted time (60 hrs), and two not at all. No medal for you, Grasshopper.

Contestants get 4.5 hours for each of the two days of competition. They have to solve three problems in that time, so on average you can spend 1.5 hours per problem (if you're aiming to finish all three).

That said, the gap from "can't do it at all" to "can do it in 60 hours" is probably quite a bit larger than the gap from 60 hours to 1.5 hours.

My old AI professor used to say that every problem is a search problem.

The issue is that to find solutions for useful problems you're often searching through highly complex and often infinite solution spaces.

For some problems validation is expensive. Like the particle collider or space telescope, or testing the COVID vaccine. It's actually validation that is the bottleneck in search not ideation.

I would argue that no actually searchable solution space is really infinite (if only because infinite turing machines can't exist). Finite solution spaces can get more than large enough to be intractable.

What about ℕ? Seems pretty infinite to me, unless with "actually" you mean finite in time and space, which would make your argument a tautology. Or am I missing something?

Almost every "number" "in" N doesn't actually exist. In the search for numbers that exist, we will most likely only ever find a finite set of numbers before the Universe or humanity dies.

("Scare quotes")

Searches happen in finite time an space and, more importantly, systems performing those searches have practical finite limits on parameters that determine size of the space within which that search can take place (such as available memory).

Even within fairly modest finite limits, you can produce a solution space that cannot be significantly searched with the available finite matter and time available in the observable universe.

Thus, the problem with using search isn't that solution spaces can be infinite, but that finite solution spaces can be unimaginably large.

You mean that by improving search we can solve any problem? What if solution field is infinite, even if we make search algo 10x100 more performant, solution field will still be infinite, no?

What makes solving IMO problems hard is usually the limits of human memory, pattern-matching, and search, not creativity. After all, these are problems that are already solved, and it is expected that many people can solve the problems in about 1 hour's time.

That makes it, in principle, similar or even easier than a champsionship-level chess move, which often take more than 1 hour for a professional human (with more training than an IMO high school student) to solve.

Another interesting concern is that when posing a problem to humans, it's fine to pose an "easy" brute-forceable problem, but humans, being slow brute-searchers, need to find more clever solutions. But if you give such a problem to a computer, it can trivialize it. So to test a computer, you need to pose non- easily-brute-forceable problems, which are harder for the computer than the others, but equally difficult for the humans as the other problems are.

Ok but if you read the actual solutions they aren't a bizarre mess of brute force.

They look like what a human would write if they were trying to come up with a formal proof (albeit it does some steps in a weird order).

Why do you say these are problems that are already solved? Sure, they're often variations on existing themes, but the same is true for chess positions and, honestly, almost everything else in any field of human endeavor.

Agreed that the absolute upper tier of chess players have trained longer and harder than most or all IMO contestants. Though I do wonder which (top-tier chess or the IMO) draws on a larger talent pool. To my understanding, a significant fraction of all high school students on Earth take some form of qualifying exam which can channel them into an IMO training program.

And as far as the being amenable to brute force (relative difficulty for humans vs. computers): it seems that chess was comparatively easier for computers, IMO problems are comparatively easier for humans, and the game of Go is somewhere in between.

These problems are literally already solved? Of course, the IMO problem designers make sure the problems have solutions before the use them. That's very different than math research, where it's not known in advance what the answer is, or even that there is good answer.

I imagine a system like this to be vastly more useful outside the realm of mathematics research.

You don't need to be able to prove very hard problems to do useful work. Proving just simple things is often enough. If I ask a language model to complete a task, organize some entries in a certain way, or schedule this or that, write a code that accomplishes X, the result is typically not trustworthy directly. But if the system is able to translate parts of the problem to logic and find a solution, that might make the system much more reliable.

As resident strident AI skeptic, yeah, this is real.

But MCTS was always promising when married to large NNs and DeepMind/Brain were always in front on it.

I don’t know who fucked up on Gemini and it’s concerning for Alphabet shareholders that no one’s head is on a spike. In this context “too big to fail” is probably Pichai.

But only very foolish people think that Google is lying down on this. It’s Dean and Hassabis. People should have some respect.

The lede is a bit buried: they're using Lean!

This is important for more than Math problems. Making ML models wrestle with proof systems is a good way to avoid bullshit in general.

Hopefully more humans write types in Lean and similar systems as a much way of writing prompts.

And while AlphaProof is clearly extremely impressive, it does give the computer an advantage that a human doesn't have in the IMO: nobody's going to be constructing Gröbner bases in their head, but `polyrith` is just eight characters away. I saw AlphaProof used `nlinarith`.

> I want my AI to use all the advantages it has to reinvent the landscape of mathematics

The interesting thing about math, or science, and art in general, comparing it to games like chess or go is that science gives you the freedom to continue to excel as a human while in games we have lost the game and/or the league.

Science and art are infinite and no AI can produce infinity.

This does sound like a lot of fun. Since this AI only reaches silver level, presumably such a contest could be quite competitive, with it not yet being clear cut whether a human or a tool-assisted human would come out on top, for any particular problem.

Can you give some context on how using Lean benefits?

In my understanding, proofs are usually harder to transcribe into Lean which is nobody _writes_ proofs using Lean.

What is a nlinarith?

Don't think of lean as a tool that the ai is using (ie, cheating), think of lean plus AlphaProof as a single system. There's no reason to draw artificial boundaries around where the AI is and where the tools that the AI is using are. Lean itself is a traditional symbolic artificial intelligence system.

People want always knock generative AIs for not being able to reason, and we've had automated systems that reason perfectly well for decades, but for some reason that doesn't count as AI to people.

I think you might be thinking of the recent project to start Fermat's Last Theorem? The Riemann hypothesis has been easy to state (given what's in Mathlib) for years.

> ... but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.

yes, it is true, but getting to the country specific team is itself an arduous journey, and involves brutal winnowing every step of the way f.e. regional math-olympiad, and then national math-olympiad etc.

this is then followed by further trainings specifically meant for this elite bunch, and maybe further eliminations etc.

suffice it to say, that qualifying to be in a country specific team is imho a big deal. getting a gold/silver from amongst them is just plain awesome !

Some countries pull these kids out of school for an entire year to focus on training for it, while guaranteeing them entry into their nation's top university.

Source: a friend who got silver on the IMO

I'm seriously jealous of the people getting paid to work on this. Sounds great fun and must be incredibly satisfying to move the state of the art forward like that.

I don't know about that. A lot of the work that should have been very satisfying turned out to be boring as hell, if not toxic, while at the same time, some apparently mundane stuff turned out to be really exciting.

I found the work environment to be more important than the subject when it comes to work satisfaction. If you are working on a world changing subject with a team of assholes, you are going to have a bad time, some people really have a skill for sucking the fun out of everything, and office politics are everywhere, especially on world changing subjects.

On the other hand, you can have a most boring subject, say pushing customer data to a database, and have the time of your life: friendly team, well designed architecture, time for experimentation and sharing of knowledge, etc... I have come to appreciate the beauty of a simple thing that just works. It is so rare, maybe even more rare than scientific breakthroughs.

Now, you can also have an awesome work environment and an awesome subject, it is like hitting the jackpot... and a good reason to be envious.

Awesome work environment for one person can be not ideal for another.

Pretty much all the top AI labs are both intensely competitive and collaborative. They consist of many former IMO and IOI medalists. They don't believe in remote work, either. Even if you work at Google DeepMind, you really need to be in London for this project.

The open-source software projects these companies critically depend on are developed by collaborators who have never met in person, and yet these companies still believe you can only do great work in the office.

I work in this space (pretraining LLMs). It looks fancier than it really is. It does involve wrangling huge ymls and writing regular expressions at scale (ok I am oversimplifying a bit). I should be excited (and grateful) that I get to work on these things but shoddy tooling takes the joy out of work.

Huh. So I tried to look it up just now and I'm not sure if I understand the difference. (To the extent that there is one - apparently one can mean the other, but I imagine they're usually used as follows.)

It looks like "jealous" is more being afraid of losing something you have (most commonly e.g. a spouse's affection) to someone else, whereas "envious" is wanting what someone else has?

No worries. I learned this too a while ago (was also using jealous instead of envious and vice versa myself). From my understanding the use of jealous is when you have something but that is threatened by some external factor, eg a partner, a friend having more fun with somebody else. Envious is when you covet something that you do not have currently but wish to, which is in this case playing with exciting tech.

It's funny that if I could describe my entire career, it would probably be something similar to software janitor/maintenance worker.

I guess I should have pursued a PhD when I was younger.

Theorem proving is a single-player game with an insanely big search space, I always thouht it would be solved long before AGI.

IMHO, the largest contributors to AlphaProof were the people behind Lean and Mathlib, who took the daunting task of formalizing the entirety of mathematics to themselves.

This lack of formalizing in math papers was what killed any attempt at automation, because AI researcher had to wrestle with the human element of figuring out the author's own notations, implicit knowledge, skipped proof steps...

> Theorem proving is a single-player game with an insanely big search space, I always thouht it would be solved long before AGI.

This seems so weird to me - AGI is undefined as a term imo but why would you expect "producing something generally intelligent" (i.e. median human level intelligence) to be significantly harder than "this thing is better than Terrence Tao at maths"?

Because being good/competent at EVERYTHING is harder than being great at any one thing for a computer, since generality requires fluid intelligence but a single specialist skill can be achieved through brute force search, as we've seen over and over again from DeepBlue 30 years ago to the present time.

In the meantime, while DeepBlue beat the world chess champion, Kasparov, at chess, our best efforts at generalism - LLMs than many (not me!) think are the path to AGI - struggle to play tic tac toe.

My intuition tells me we humans are generally very bad at math. Proving a theorem, in an ideal way, mostly involves going from point A to point B in the space of all proofs, using previous results as stepping stones. This isn't particularly a "hard" problem for computers which are able to navigate search spaces for various games much more efficiently than us (chess, go...).

On the other hand, navigating the real world mostly consists in employing a ton of heuristics we are still kind of clueless about.

At the end of the day, we won't know before we get there, but I think my reasons are compelling enough to think what I think.

Oh for sure, when I say "soon" it's only relative to AGI.

What I meant to convey, is that theorem proving at least is a well-defined problem, and computers have had some successes in similar-ish search problems before.

Also I don't think pure brute-force was ever used to solve any kind of interesting problem.

Chess engines make use of alpha-beta pruning plus some empirical heuristics people came up with over the years. Go engines use Monte-Carlo Tree Search with straight deep learning models node evaluation. Theorem proving, when it is solved, will certainly use some kind of neural network.

Replace "this thing is better than Terrence Tao at maths" with "this thing is better than Garry Kasparov at chess" and your statement would sound equally reasonable in the early 90s.

Because "Generally Intelligent" is a very broad and vague term.

"Better than Terrence Tao at solving certain formalized problems" (not necessarily equal to "Better that Terrence Tao at maths) isn't.

Not to join the hate train but it probably isn't better than Terrence Tao at solving the problems either, since this Tao would have probably been able to solve the problem himself and would have probably taken less time. There were participants in this contest that were better than the AI.

Still, an impressive result for sure.

Seems fair - I strongly think that "certain formalized problems" is very very far away from doing actual maths! And that actual maths is sometimes much less about solving known conjectures and much more about developing new theories.

They didn't formalize the entirety of math. Good thing imo doesn't need the entirety. But they didn't even formalize enough for imo--this is probably why combo wasn't solved

Machines have been better than humans at chess for decades.

Yet no one cares. Everyone's busy watching Magnus Carlsen.

We are human. This means we care about what other humans do. We only care about machines insofar as it serves us.

This principle is broadly extensible to work and art. Humans will always have a place in these realms as long as humans are around.

I'm sure humans will always enjoy chess and art regardless of how much better AI is at it. In the same way, there will probably always be mathematic hobbyist who study math for fun. But I seriously doubt that in the near future there will be mathematicians who will be publishing new advancements that aren't mostly or entirely discovered by AI. A human might get credit for a proof for asking the initial question, but there is pretty much no world where a computer can easily solve a meaningful mathematical problem but we insist on a human solve it more slowly and expensively instead.

Funny how people don't understand basic logic. If it is a proof in a logic, and the machine checked that proof, it is a proof, no matter that no human actually understands it.

A human doesn't need to understand the proof, they just have to understand why the proof is a proof.

The useful thing about proofs is that they are written in English (or another language), not formal logic. In general they can be mapped to formal logic, though. This means that people can digest them on an intuitive level. The actual goal of a proof is to create new knowledge (via the proof) which can be distributed across the mathematical community. If proofs exist but are not easily comprehensible, then they don’t accomplish this goal.

I think that is unlikely to be the case - the classic example of a proof that human's "can't understand" is the Four Colour Theorem, but thats because the proof is a reduction to like 100000 special cases which are checked by computer.

To what extent is the proof of Fermat's Last Theorem "incomprehensible to humans" because only like a dozen people on the planet could truly understand it - I don't know.

The point of new proofs is really to learn new things about mathematics, and I'm sure we would learn something from a proof of Goldbach's conjecture.

Finally if it's not peer reviewed then its not a real proof eh.

I think the 4-color theorem is rather different though. It reduced to a large number of cases that can in principle each be verified by a human, if they were so inclined (indeed a few intrepid mathematicians have done so over the years, at least partially.) The point of using a computer was to reduce drudgery, not to prove highly non-obvious things.

Thinking back to Wiles' proof of FLT, it took the community several years of intense work just to verify/converge on the correct result. And that proof is ~130 pages.

So, what if the computer produced a provably correct, 4000-page proof of the Goldbach conjecture?

> This principle is broadly extensible to work and art

Nah, as a consumer it makes no difference to me if a meat packing factory or Amazon warehouse employs 5000 or 5 people. To art, this principle is totally real, but for work, it only applies to some/most of it.

Read the next sentence: "We only care about machines insofar as it serves us."

Imagine a machine doing "work" that only serves itself and other machines that does no service to humanity. It would have no economic value. In fact the whole concept of "work" only makes sense if it is assigned economic value by humans.

Then we agree, but your chess example made it sound like if a machine could automatically pack meat with little human intervention, people wouldn't want it. That's also not the next sentence. How is it broadly applicable to work?

I don't think this principle extends to math proofs. It's much, much easier to verify a proof than to create it, and a second proof will just be a footnote. Not many mathematicians will want to work on that. That said, there is a lot of distance between IMO and the frontiers of research math.

There are people that believe that mathematics is actually useful, in ways that chess or art are not. I know, most mathematicians don't think so. But let us just entertain this crazy thought for a moment. Then a proof is just a tool that tells us, oh, we have applied this piece of mathematics right. No understanding of the proof is actually required for that, and no one cares if some mathematician somewhere actually fully understands the proof. It will be OK, even expected, that the machine is better than us at finding and checking proofs.

> Everyone's busy watching Magnus Carlsen.

Actually, I was looking up Elo ratings of the top computer chess players, and learned that it is not that trivial to compare these, due to differences in hardware requirements and whatnot.

Tangentially: I found it fascinating to follow along the solution to Problem 6: https://youtu.be/7h3gJfWnDoc (aquaesulian is a node to ancient name of Bath). There’s no advanced math and each step is quite simple, I’d guess on a medium 8th grader level.

Note that the 6th question is generally the hardest (“final boss”) and many top performers couldn’t solve it.

I don’t know what Lean is or how see AI’s proofs but an AI system that can explain such a question on par with the YouTuber above would be fantastic!

> First, the problems were manually translated into formal mathematical language for our systems to understand. In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.

Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.

I haven't read TFA as I'm at work, but I would be very interested to know what the system was doing in those three days. Were there failed branches it explored? Was it just fumbling its way around until it guessed correctly? What did the feedback loop look like?

I can't find a link to an actual paper, that just seems to be a blog post. But from what I gather the problems were manually translated to Lean 4, and then the program is doing some kind of tree search. I'm assuming they are leveraging the proof checker to provide feedback to the model.

Yeah, as a former research mathematician, I think “fumbling around blindly” is not an entirely unfair description of the research process.

I believe even Wiles in a documentary described his search for the proof of Fermat’s last theorem as groping around in a pitch black room, but once the proof was discovered it was like someone turned the lights on.

They just write "it's like alpha zero". So presumably they used a version of MCTS where each terminal node is scored by LEAN as either correct or incorrect.

Then they can train a network to evaluate intermediate positions (score network) and one to suggest things to try next (policy network).

The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.

So they had three days to keep training the model, on synthetic variations of each IMO problem.

Don't confuse interpolation with extrapolation. Curing cancer will require new ideas. IMO requires skill proficiency in tasks where the methods of solving are known.

The methods are know, but the solutions to the IMO problems weren't. So the AI did extrapolate a solution.

Also, there's no reason to affirm that an eventual cure for cancer requires fundamentally new methods. Maybe the current methods are sufficient, it's just that nobody has been "smart" enough to put the pieces together. (disclaimer: not an expert at all)

I think you are correct though. We don't need new physics to cure cancer. But we may need information-handling, reasoning and simulation systems which are orders of magnitude bigger and more complex than anything we have this year. We also need to stop pussy-footing and diddling with ideologies and start working on the root cause of cancer and almost every other disease, which is aging.

Search is extrapolation. Learning is interpolation. Search+Learn is the formula used by AZ. Don't forget AZ taught us humans a thing or two about a game we had 2000 years head start in, and starting from scratch not from human supervision.

no, search is not extrapolation. Extrapolation means taking some data and projecting out beyond the limits of that data. For example, if my bank account had $10 today and $20 tomorrow, then I can extrapolate and say it might have $30 the day after tomorrow. Interpolation means taking some data and inferring the gaps of that data. For example, if I had $10 today and $30 the day after tomorrow, I can interpolate and say I probably had $20 tomorrow.

Search is different from either of those things, it's when you have a target and a collection of other things, and are trying to find the target in that collection.

Search can go from a random init model to beating humans at Go. That is not interpolation.

- Search allows exploration of the game tree, potentially finding novel strategies.

- Learning compresses the insights gained from search into a more efficient policy.

- This compressed policy then guides future searches more effectively.

Evolution is also a form of search, and it is open-ended. AlphaProof solved IMO problems, those are chosen to be out of distribution, simple imitation can't solve them. Scientists do (re)search, they find novel insights nobody else discovered before. What I want to say is that search is on a whole different level than what neural nets do, they can only interpolate their training data, search pushes outside of the known data distribution.

It's actually a combo of search+learning that is necessary, learning is just the little brother of search, it compresses novel insights into the model. You can think of training a neural net also as search - the best parameters that would fit the training set.

new doesn't necessarily mean "an extremal point that's not the average of two existing points". The set of existing knowledge is not necessarily continuous; the midpoint between two known points may be unknown, and thus would be a "new" point that could be obtained by interpolation.

I think this is kinda false actually on the cancer side. We have reached a point where we have known approaches that work. It's "just" a matter of putting them into practice which will of course require solving many little details, which is very important and time-consuming work, but it doesn't require super-human genius level of lateral thinking, just a few millions man years of grinding away at it.

The problem solved "within minutes" is also interesting. I'd interpret that as somewhere between 2 and 59 minutes. Given the vagueness probably on the higher end, otherwise they'd celebrate it more. The students had 6 tasks in 9 hours, so on average 1.5h per task. If you add the time a student would take to (correctly!) translate the problems to their input format, their best-case runtime is probably about as fast as a silver-medalist would take to solve the problem on their own.

But even if they aren't as fast as humans yet this is very valuable. Both as a stepping stone, and because at a certain scale compute is much easier to scale than skilled mathematicians.

They say "our systems" (presumably meaning AlphaProof and AlphaGeometry 2) solved one problem "within minutes", and later on the page they say that the geometry question (#4) was solved by AlphaGeometry in 19 seconds.

So either (1) "within minutes" was underselling the abilities of the system, or (2) what they actually meant was that the geometry problem was solved in 19 seconds, one of the others "within minutes" (I'd guess #1 which is definitely easier than the other two they solved), and the others in unspecified times of which the longer was ~3 days.

I'd guess it's the first of those.

(Euclidean geometry has been a kinda-solved domain for some time; it's not super-surprising that they were able to solve that problem quickly.)

As for the long solve times, I would guess they're related to this fascinating remark:

> The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.

There are known algorithms that can solve _all_ problems in euclidean (ruler-and-compasses) geometry, no intuition required. The most effective algorithms of this type are quite inefficient, though, and (at least according to DeepMind) don't do as well as AlphaGeometry does at e.g. IMO geometry problems.

And say they did use 10% of all GCP? Would it be less impressive? This is a result that was considered by experts to be far beyond the state of the art; it's absolutely ok if it's not very efficient yet.

Also, for what it's worth, I'm pretty sure that I wouldn't have been able to solve it myself in three days, even if I had access to all of GCP, Azure and AWS (except if I could mine crypto to then pay actual IMO-level mathematicians to solve it for me).

The writing was on the wall for the last year and a half (in fact I lost a bet to an IMO medalist about AI getting IMO gold by 8/2023) but three years ago this was unimaginable.

The thing is though, once we have a benchmark that we pass, it’s pretty typical to be able to bring down time required in short order through performance improvements and iterating on ideas. So if you knew you had GAI but it took 100% of all GCP for 3 years to give a result, within the next 5 years that would come down significantly (not least of which you’d build HW dedicated to accelerating the slow parts).

That's patently false for many classes of problems. We know exactly how to solve the traveling salesman problem, and have for decades, but we're nowhere close to solving a random 1000 city case (note: there are approximate methods that can find good, but not optimal, results on millions of cities). Edit: I should say 1,000,000 city problem, as there are some solutions for 30-60k cities from the 2000s.

And there are good reasons to believe that theorem finding and proof generation are at least NP-hard problems.

We're not talking about mathematical optimality here, both from the solution found and for the time taken. The point is whether this finds results more cheaply than a human can and right now it's better on some problems while others it's worse. Clearly if a human can do it, there is a way to solve it in a cheaper amount of time and it would be flawed reasoning to think that improving the amount of time would be asymptotically optimal already.

While I agree that not all problems show this kind of acceleration in performance, that's typically only true if you've already spent so much time trying to solve it that you've asymptoted to the optimal solution. Right now we're nowhere near the asymptote for AI improvements. Additionally, there's so many research dollars flowing into AI precisely because the potential upside here is nowhere near realized and there's lots of research lines still left to be explored. George Hinton ended the AI winter.

> The point is whether this finds results more cheaply than a human can

If you need to solve 1000 problems in 3 days you wouldn't find the humans that can do it. So it would not be cheaper if it's not possible.

Well if it takes 10% of all of Google’s servers 3 days to solve, you may find it difficult to scale out to solving 1000 problems in 3 days as well.

As for humans, 100 countries send 6 students to solve these problems. It also doesn’t mean that these problems aren’t solvable by anyone else. These are just the “best 6” where best = can solve and solve most quickly. Given a three day budget, 1000 problems could reasonably be solvable and you know exactly who to tap to try to solve them. Also, while the IMO is difficult and winners tend to win other awards like Field Medals, there’s many professional mathematicians who never even bother because that type of competition isn’t interesting to them. It’s not unreasonable to expect that professional mathematicians are able to solve these problems as well if they wanted to spend 3 days on it.

But in terms of energy per solve, humans are definitely cheaper. As you note the harder part is scaling it out but so far the AI isn’t solving problems that are impossible for humans, just that given enough time it managed to perform the same task. That’s a very promising result but supremacy is slightly a ways off for now (this AI can’t win the competition for now)

Still waiting for the first one. I'm not holding my breath - just like fuzzing found a lot of vulnerabilities in low-level software, I expect novel automated analysis approaches will yield some vulnerabilities - but that won't be a catastrophic event just like fuzzing wasn't.

Why don't you think that AI models will, perhaps rather soon, surpass human capabilities in finding security vulnerabilities? Because an AI that's even equally competent would be a fairly catastrophic event.

It's rumored that the NSA has 600 mathematicians working for them. If they are the ones finding the exploits you will probably never hear about them until they are independently discovered by someone who can publish.

I hope it doesn't find a new class of bug. Find another thing like Spectre could be problematic.

EDIT - I hope if that new class of bug exists that it is found. I hope that new class of bug doesn't exist.

It feels pretty disingenuous to claim silver-medal status when your machine played by significantly different rules. The article is light on details, but it says they wired it up to a theorem prover, presumably with feedback sent back to the AI model for re-evaluation.

How many cycles of guess-and-check did it take over the course of three days to get the right answer?

If the IMO contestants were allowed to use theorem provers and were given 3 days (even factoring in sleep) would AlphaProof still have gotten silver?

> let's be real I'd be okay waiting a month for the cure to cancer.

I don't think these results suggest that we're on the brink of knowledge coming at a substantially faster rate than before. Humans have been using theorem provers to advance our understanding for decades. Now an LLM has been wired up to one too, but it still took 8x as long to solve the problems as our best humans did without any computer assistance.

I believe you are misreading this.

First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.

Secondly, this is now some hacky trick where Brute force and some theorem prover magic are massaged to solve a select few problems and then you'll never hear about it again. They are building a general pipeline which turns informal natural lamguage mathematics (of which we have ungodly amounts available) into formalized mathematics, and in addition trains a model to prove such kinds of mathematics. This can also work for theory building. This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.

> First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.

If this were the case then the headline would be "AI solves 4/6 IMO 2024 problems", it wouldn't be claiming "silver-medal standard". Medals are generally awarded by comparison to other contestants, not to the challenges overcome.

> This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.

This is great, and I'm not complaining about what the team is working on, I'm complaining about how it's being sold. Headlines like these from lab press releases will feed the AI hype in counterproductive ways. The NYT literally has a headline right now: "Move Over Mathematicians, Here Comes AlphaProof".

At the IMO "silver medal" afaik is define as some tange of points, which more or less equals some range of problems solved. For me it is fair to say that "silver-medal performance" is IMO langauge for about 4/6 problems solved. And what's the problem if some clickbait websites totally spin the result? They would've done it anyways even with a different title, and I also don't see the harm. Let people be wrong.

> They are building a general pipeline which turns informal natural lamguage mathematics

but this part currently sucks, because they didn't trust it and formalized problems manually.

and we don't have 100% accuracy in translation in ambiguous texts, because system often need some domain knowledge, context etc. And math has 0% tolerance to mistakes.

I also expect that math formalized by machine will be readable by machine and hardly understandable by humans.

I'm not sure it matters that it had access to a theorem prover. The fact that it's possible to build a black box that solves hard proofs on its own at all is the fascinating bit.

> it still took 8x as long to solve the problems as our best humans did without any computer assistance.

Give it a year and that ratio will be reversed. At least. But also it matters less how long it takes if doubling the number of things reasoning at a best-human level is pronounced "ctrl-c, ctrl-v".

I am so exhausted of the AI hype nonsense. LLMs are not fucking curing cancer. Not now, not in five years, not in a hundred years. That's not what they do.

LLM/ML is fascinating tech that has a lot of legitimate applications, but it is not fucking intelligent, artificial or otherwise, and I am sick to death of people treating it like it is.

It seems unlikely people will employ only ML models, especially LLM, to achieve great results: they will combine it with human insights (through direction and concrete algorithms).

It's obvious that's happening with LLMs even today to ensure they don't spew out too much bullshit or harmful content. So let's get to a point where we can trust AI as-is first, and let's talk about what's needed to achieve the next milestone after and if we get there.

And I love asking every new iteration of ChatGPT/Gemini something along the lines of "What day was yesterday if yesterday was a Thursday?" It just makes me giggle.

A significant part of the problem is that a majority of people are unaware of just how simple what they consider "intelligence" really is. You don't need actual intelligence to replace the public-facing social role of a politician, or a talking head, or a reactive-only middle manager. You just need words strung together that fit a problem.

（评论） (comments)

（评论）
(comments)