LLM Performance in Elementary Arithmetic: A Study on Mathematical Reasoning

Can LLMs Really Do Math? An Investigation into the Mathematical Understanding of Large Language Models

Large language models (LLMs) continually impress with their capabilities in various areas, from text generation to translation. But how deep is their understanding of more complex concepts, especially in the field of mathematics? A recent study investigates this question using elementary addition and puts the supposed computational abilities of LLMs to the test.

The research focuses on the addition of two integers in the range of 0 to 264 and examines two fundamental mathematical properties: commutativity (A+B=B+A) and the ability for compositional generalization. The latter refers to whether LLMs can transfer learned rules to isomorphic, symbolic representations, e.g., whether they understand that "7" and "y" can have the same mathematical meaning in a given context.

While state-of-the-art LLMs achieve an accuracy of 73.8-99.8% in numerical addition, their performance drops to below 7.5% with symbolic representation. This suggests that the models cannot generalize the underlying rules of addition. This assumption is supported by further observations: The performance of the LLMs does not scale monotonically with the number of digits, and frequent violations of the commutative law occur (over 1,700 cases of A+B ≠ B+A).

Surprisingly, the performance of the LLMs deteriorates by an average of 81.2% when they are explicitly given the rules of addition. At the same time, accuracy remains at the same level when the models are asked to explain their calculations. These results suggest that the way LLMs process arithmetic tasks does not align with human-defined mathematical principles.

The study concludes that current LLMs rely more on pattern recognition and retrieval of stored information than on a true understanding of mathematical rules. This highlights the limitations of the current architecture of LLMs and underscores the need for new approaches to enable true mathematical reasoning in these models. The research findings raise important questions about the future development of LLMs and emphasize the importance of further investigation to fully understand the limits and potential of this technology.

The results of this study are particularly relevant for companies like Mindverse, which specialize in the development and application of AI solutions. A deeper understanding of the functionality and limitations of LLMs is crucial for the development of robust and reliable AI applications, whether in chatbots, language assistants, AI search engines, or knowledge databases. The insights from this study can help advance the development of LLMs and fully realize their potential for future applications.

Bibliographie: https://arxiv.org/abs/2504.05262 https://arxiv.org/html/2504.05262v1 https://www.researchgate.net/publication/390601823_Do_PhD-level_LLMs_Truly_Grasp_Elementary_Addition_Probing_Rule_Learning_vs_Memorization_in_Large_Language_Models https://www.themoonlight.io/review/do-phd-level-llms-truly-grasp-elementary-addition-probing-rule-learning-vs-memorization-in-large-language-models https://deeplearn.org/arxiv/593967/do-phd-level-llms-truly-grasp-elementary-addition?-probing-rule-learning-vs.-memorization-in-large-language-models https://www.themoonlight.io/fr/review/do-phd-level-llms-truly-grasp-elementary-addition-probing-rule-learning-vs-memorization-in-large-language-models https://paperreading.club/page?id=297979 https://chatpaper.com/chatpaper/fr?id=3&date=1744041600&page=1 https://github.com/Xuchen-Li/llm-arxiv-daily https://www.trendingpapers.com/similar?id=2412.07386