1 Introduction

The current state-of-the-art for creation of multibody dynamics simulation models can be classified into graphics user interface (GUI)-based methods, such as with commercial codes Adams (Hexagon), or RecurDyn (FunctionBay), specialized input files or script languages [9, 17] or using the underlying programming language of the simulation code [10, 18].

Natural Language Processing (NLP), an integral part of artificial intelligence, empowers computers to comprehend, interpret, and generate human language [15]. It explores diverse aspects of language, such as syntax – the structure of word arrangement in sentences – and semantics, which focuses on the meaning of words in context. NLP has been impacted by the introduction of LLM, particularly the transformer architecture [25]. These LLMs, like Generative Pre-trained Transformer (GPT), have a massive number of parameters and have shown capabilities beyond NLP, such as code completion and generation [1]. Furthermore, they have potential in transforming natural-language descriptions into programmed simulation models.

Recent developments show an upward trend in the size of LLM, see also Fig. 1. Kaplan et al. show in their study on scaling laws that the crossentropy loss correlates with dataset size, computational power, and the number of parameters [16]. The research team of Google Deepmind found that training data should be scaled together with the model size [12]. Notably, GPT-3 showcased the improved few-shot performance of LLM [3], and public attention peaked with the introduction of ChatGPT in 2022. A competing chat system based on LLM is Google Bard, which initially used Language Model for Dialogue Applications (LaMDA) [24], a 137-billion parameter model. As of May 2023, Bard has transitioned to PaLM-2, whose dimensions remain undisclosed but are smaller than its predecessor PaLM, which had 540 billion parameters, due to compute-optimal scaling of parameter size. Chowdhery et al. [5] noted that in the development of their 540-billion parameter Pathways Language Model (PaLM) the effect of the model scale does not seem to be saturated yet. According to a leak the GPT-4 model has 1760 billion parameters, but it is not officially disclosed.

Fig. 1
figure 1

Number of parameters of LLM over the past five years. Significant advances were made by Megatron in 2019 and GPT-3 in 2020. The number of parameters of GPT-4 is not officially disclosed but was presumably leaked

A key challenge for LLM development is the curation and preparation of datasets. For instance, GPT-3 utilized a diverse mix of sources, including WebText2 and Wikipedia, totaling \(499\cdot 10^{9}\) tokens [3]. Similarly, PaLM was trained on \(780 \cdot 10^{9}\) tokens from various sources, such as Github and social-media conversations [5]. Ensuring test-tasks are not part of the training dataset is crucial to prevent dataset contamination.

Due to the large amount of data required for training of LLM, HuggingFace has emerged as a platform in the world of Large Language Models (LLM), offering a rich repository of training datasets essential for LLM development [28]. The more than 69k datasets, ranging from tiny to huge ones, include comprehensive sets such as Wikipedia as well as GitHub.

1.1 LLM in science and engineering

Specialized LLM have been created, e.g., for science, see Galactica [23], for software engineering [13], or a method focusing on micromechanics, see Ref. [4]. SciBERT [2], based on a pretrained BERT (Bidirectional Encoder Representations from Transformers) [6] base model, has been specifically trained on scientific texts. SciBERT has been trained on a broad corpus of scientific publications and therefore outperforms BERT and surpasses previous models in several categories. Notably, models like BERT and SciBERT have a relatively compact size, allowing for their training without the need for computers with petaFLOPS range performance. Furthermore, the BERT model has been expanded into MatSciBERT [11], a Language Learning Model (LLM) specifically focused on materials science.

1.2 Aims and methods

This paper introduces an approach to generate (multibody) dynamics simulation models using natural language. We briefly mention GPT and their current status, relying to open-source models and closed models as far as data is accessible. Our exploration seeks to understand the capabilities of LLM that have been trained on codes, in particular simulation codes to showcase the current capabilities of existing LLMs in reflecting future advancements. We provide a brief overview of the Exudyn Python interface for setting up simulation models and its status at the time that training data had been created, which is September 2021. To validate our approach, we present a dedicated test set and evaluate its performance across various LLM, with full-text responses available in the supplementary material. Using LLM to create Python code for geometrical modeling has been applied, for example, with Blender [7]. In the present paper, we only focus on evaluation of existing LLM. We apply in-context learning [22], which is widely accepted as a means to improve the performance of LLM, but do not attempt full-scale training or fine tuning of open-source LLM.

1.3 Mechanical and simulation models

It should be mentioned that the creation of simulation models also means that a mechanical model is underlying the simulation model, which is why the term (mechanical) model is used in the following as a synonym for both kinds of simulation and mechanical models. To produce a (mechanical) model from natural language, an LLM needs more than just general NLP capabilities. It also requires foundational knowledge in mechanical engineering, including geometry, kinematics, statics, and dynamics. Training an LLM solely on the documentation of a simulation code would not be sufficient.

2 Brief notes on LLM and transformers

In this research we look into different LLM, being GPT-3.5 and GPT-4 [19], which are proprietary models developed by OpenAI, well known by the ChatGPT application.Footnote 1 Furthermore, Google’s Bard, which currently uses PaLM-2,Footnote 2 as well as LLaMA-2Footnote 3 from Meta AI are considered in the present paper.

There are only technical reports on GPT-3.5, GPT-4, and PaLM-2 with little information on the underlying model. Many more details are available for LLaMA, which is why we focus here on more detailed descriptions of LLaMA and LLaMA-2. However, the basic transformer structure is the same for all models used here, as far as technical reports disclose.

2.1 Transformers and large language models

The transformer architecture is a neural-network design introduced in 2017 [25], primarily for sequence-to-sequence tasks in natural-language processing. The evolution of transformers departs from recurrent neural networks and long short-term memory (LSTM)-based models by relying entirely on attention mechanisms to draw global dependencies between input and output. The architecture consists of an encoder and/or a decoder, each comprising multiple layers of self-attention and feedforward neural networks. The key innovation, attention [25], allows the model to weigh the significance of different words in a sequence, enabling it to capture long-range dependencies and context effectively. This design has become foundational in the subsequent development of large-scale language models and is shown in the originally proposed architecture in Fig. 2.

Fig. 2
figure 2

The original transformer architecture and the operations performed in the multi-headed attention block [25]

Transformers have then been further developed, such as the early model BERT [6, 21] and its subsequent iterations. Generative Pre-trained Transformers (GPT) are a subset of models specifically designed for language tasks. Their core structure is the transformer, which uses a decoder-only variant. Characterized by their massive number of parameters, usually between a billion and a trillion, these models excel in generating coherent, contextually relevant text over lengthy passages. Their foundational neural networks, which are key elements in the various attention heads in transformers, encompass tens to hundreds of millions of parameters, necessitating extensive training data.

The central feature of the transformer is the self-attention mechanism, which enables a word to compute its context by checking the relevance of all other words in the sentence, see Fig. 2. This is done using Query (Q), Key (K), and Value (V) vectors. Herein, the query represents the current word or token one attention mechanism is focusing on and determines how much attention should be paid to other tokens. The key represents all tokens in the input and scores each token’s relevance to the current focus. The value provides the content from the input tokens, which, when weighted by the attention scores, gives the output for the current token. In the multi-head attention mechanism of the transformer, multiple attention heads work in parallel, each producing its own output. These outputs are then concatenated and linearly transformed to produce the final output. The model also includes positional encodings to understand the sequence of words, as transformers do not inherently recognize order. The “pretrained” aspect means that GPT models are first trained on vast amounts of text to predict the next word in a sequence. This foundational training equips them with knowledge about grammar, context, and general information. They can be further fine-tuned for specific tasks, such as chat or code completion. To generate text, a GPT inputs a user-defined text (prompt) and predicts subsequent words in sequence.

In the context of programming, LLM can assist developers by predicting the next lines of code, suggesting code optimizations, or even generating entire code snippets based on a given prompt. LLM showed enormous capabilities in code completion and code generation rather early [1]. By training on vast repositories of code, including code examples, documentation, issue trackers, bug reports, discussion forums, and Q&A pages, these models have gained an understanding of various programming languages and their idiomatic patterns. This not only accelerates the coding process [20] but also aids in reducing bugs and enhancing code quality. If all code resources are available on open-source platforms, such as GitHub, larger LLM may have already performed training on that code.

In addition to direct code completion, LLM have demonstrated the ability to translate natural-language descriptions into functional code. This means that a developer can provide a plain English request, such as “create a function that calculates all prime numbers smaller than 20 in C++,” and the model can generate the corresponding code. This capability bridges the gap between domain experts without coding expertise and software development, enabling more intuitive and collaborative software-design processes.

In particular, LLM may also be trained with the repository of a simulation code, including documentation, the code itself, as well as code examples. However, in order to generate a simulation model from natural language, besides general NLP capabilities, the LLM also requires some mechanical-engineering knowledge such as geometry, kinematics, statics or dynamics. Therefore, particular training of LLM solely with the documentation of a simulation code could be insufficient.

2.2 The LLaMA model

LLaMA-2, introduced in July 2023 as a collaboration between Meta AI and Microsoft, represents a further refinement of LLaMA-1, often denoted as LLaMA.Footnote 4 It was launched in three model sizes: 7B, 13B, and 70B parameters. While retaining the architecture of LLaMA-1, LLaMA-2 boasted a 40% increase in training data, with its foundational models being trained on a 2-trillion token dataset. This dataset was curated to omit sites that might disclose personal information and emphasized the inclusion of trustworthy sources and allows a maximum context length of 4096 tokens. While LLaMA-2 is freely available for many commercial applications, debates persist regarding its status as open source. LLaMA is based on the transformer architecture, which has been a cornerstone for language models since 2018. Notably, in contrast to GPT-3, LLaMA uses the SwiGLU activation function instead of ReLU, adopts rotary positional embeddings over absolute positional embedding, and applies root-mean-squared layer normalization.

To operate LLaMA-2, sufficient memory, either through RAM or VRAM on a GPU, is essential for model inference (obtain the output from a trained model). The 7-billion parameter model demands 14 GB since the parameters are in half-precision floating-point format (16 bit). The larger models, with 13 billion and 70 billion parameters, require 26 GB and 138 GB, respectively. It is worth noting that these memory requirements are for inference; training demands more memory to store the gradients.

In addition to the vast memory requirements, as previously mentioned, the computational resources for training or fine tuning are also greater than most researchers have available. For LLM, the number of operations needed for training is also called compute or training compute and is often given in floating-point operations (FLOPS), GPU-hours, or petaFLOPS days as the number of operations are becoming very high. The state-of-the-art model LLaMA-2 70B, which is still too small to solve the problems mentioned in the present paper, required 1.7 million GPU hours of training and very large data sources, requiring at least a cluster with 1000 GPUs and appropriate data-transfer rates in order to perform similar training tasks. For GPT-3 approximately \(3.14 \cdot 10^{23}\) FLOPS or 3634 petaFLOPS – days of compute were used. This leads to the conclusion that training an existing model is fully out of sight for the present research.

An alternative approach is Low-Rank Adaption (LoRA) [14] of LLM. With LoRA the weights of the pretrained model are frozen and additional trainable rank decomposition matrices are injected into each layer of the transformer architecture. The authors claimed that this approach reduced the trainable parameters by 10 000 times and the GPU memory requirement by a factor of 3 for the 175-billion version of the GPT-3 model with similar performance than full fine tuning.

3 Multibody simulation models built from Python code

Since 2019, the Python library Exudyn [10] has been developed for creation and simulation of multibody system models. Several popular LLM have been trained occasionally on this code and have some basic capabilities to create models from natural language. In particular, the training date of our evaluated versions of GPT-3.5 and GPT-4 ended in September 2021, which is of particular interest for subsequent investigations.

The Exudyn repository has been uploaded on GitHub and made publicly available on January 9, 2020. On September 14, 2021 (Exudyn version 1.1.0), the library had only 4 forks and 15 stars, the latter being a common criterion to select repositories from GitHub for training. OpenAI (GPT-3.5 and GPT-4) does not provide information about which GitHub repositories have been used for training. In the literature on different LLM focusing on code, rather large thresholds for number of GitHub stars are mentioned, being at least 50 stars for the dataset used by PolyCoder [29]. An explanation for several LLM, as shown later, to chose Exudyn for training may be the completely open and clearly stated BSD-3 license, a larger set of files as compared to many higher-ranked repositories and a large number of annotated examples. In particular, a significant amount of documentation, throughout code commenting and file headers may have further influenced the decision to train on this code, as this information is essential in the context of code completion. The highly structured data and widely documented code in the repository included 88 examples and 56 test models in version 1.1.0 may have been a further reason to be chosen for training by several LLM.Footnote 5

3.1 Setup of models in Exudyn

Exudyn models are created solely using the Python language. Python is also available in other multibody codes, such as ProjectChrono [18] or PyDy [8]. As a difference from the latter codes, Exudyn includes a large set of annotated examples. Furthermore, the setup of a rigid-body model, studied within the present paper, follows a simple and systematic approach based on redundant coordinates and constraints.

After import of basic Python libraries, a simulation model is setup by creating a new system, usually denoted as mbs. Hereafter, different items are added, such as nodes, objects, markers, loads, and sensors. Nodes are added for definition of (unknown) kinematic quantities. Computational objects are then added to represent bodies, connectors or joint constraints. The relation between nodes and bodies is rather simple, such as mass points requiring point nodes or rigid bodies requiring rigid-body nodes, e.g., based on Euler parameters. Connectors, such as spring-dampers as well as constraints are attached to markers, which need to be attached to bodies or nodes. Hereafter, a spring-damper is created by providing two markers, a spring constant, a damping constant, and a reference length.

All of the items can be added to the mbs by using simple commands such as n=mbs.AddNode(...) or o=mbs.AddObject(...), in which n is the returned node index and o is the according object index. Finally, the structure for definition of a node or object is embedded into a class structure, such that a mass point with 5 kg is attached to node n by writing:

  • o=mbs.AddObject(MassPoint(nodeNumber = n, physicsMass=5))

Finally, the according definition of a multibody system highly depends on the geometry, joint constraints, inertia, and mechanical parameters. In order to create the Python code, the underlying mechanical model needs to be known by the LLM.

The subsequent functions for performing a transient simulation are straightforward and are well reproduced by state-of-the-art LLM. The basic steps for starting the simulation are:

  1. 1)

    finalizing the multibody system: mbs.Assemble();

  2. 2)

    setting up simulation settings: sims = exu.SimulationSettings();

  3. 3)

    adjustment of the simulation settings parameters;

  4. 4)

    calling the solver: mbs.SolveDynamic(sims).

There are some variants such as static solvers as well as special explicit and implicit solver types, not mentioned here. Most of the examples also include commands to add visualization and to start the 3D visualization during or after simulation, which is therefore also reproduced by LLM.

3.2 In-context learning

Even though the approach in Exudyn is highly systematic and versatile, it seems to be more complicated than in other multibody simulation software. One reason for the higher complexity is the availability of flexible bodies, such as beams or modally reduced bodies, which require different approaches, e.g., to represent finite-element nodes. As a main difference from other Python modeling codes that most LLM have been trained on, we mention that rigid bodies require underlying nodes, and cannot be created by one single function. Furthermore, joints are attached to markers, but cannot be directly attached to bodies, similar to loads. This systematic, but more complex approach often leads to wrong assumptions by LLM (and also confuses human users). As a solution, simplified functions have been added to Exudyn since May 2023, all of them available in version 1.7.0. Simplified functions obtain a prefix Create and can be directly called for the multibody system, such as mbs.CreateRigidBody(...), which adds a rigid body to the system with simple arguments and also allows to add gravity without defining loads or markers.

In order to provide the information on the simplified commands, in-context learning is used in many of our tests. The following listing shows the first 22 lines of the information, later denoted as context information for mass points, which are pasted into the chat prompt at the beginning of each session:

figure c
figure d

This context contains only information on systems with mass points, distance constraints, and spring-dampers and is represented by 680 tokens in GPT-3. For rigid-body systems, a more comprehensive file is used, see the supplementary material, which is represented by 2842 tokens and therefore requires a minimum context length of 4096 tokens to read the input and additionally generate some useful output. The specific comments at the beginning of the context information were added because initial tests resulted in regular syntax errors, such as using 2D vectors instead of 3D.

The objective of the remainder of the paper is to evaluate the capabilities of different GPT regarding accurate creation of multibody dynamics models. In particular, the number of errors in Python models are used to evaluate and compare different approaches. In order to improve the performance, some recent simplifications in Exudyn’s Python interface have been added and are made available to the LLM in the local context. As we will show, the simplified modeling, as well as the additional context, boosts the performance in particular for rigid-body systems.

4 Examples and tests

In the present research, we presents six categories of examples for the evaluation of the performance of different LLM. The list of examples is summarized in Table 2, related to knowledge of Exudyn and the creation of basic dynamic and multibody systems. All examples are tested and evaluated with several LLM, using GPT-3.5, GPT-4, Bard (PaLM-2), and LLaMA-2,Footnote 6 see Table 1 for details. GPT-3.5 and GPT-4 are accessed using ChatGPT.

Table 1 Overview of the used LLM. References are given to release notes and homepages and an asterisk marks undisclosed values. The number of parameters are only officially known for GPT-3 as well as LLaMA-2. The parameters for GPT-4 are from discussion forums and unofficial leaks. The exact number of parameters and max. tokens for Bard are according to PaLM-2, as it is said that Bard is powered by PaLM-2

Note that the maximum number of tokens are crucial for our tests, as they limit the input as well as the output text. In order to be processed more efficiently, LLM convert text into tokens, often with approximately 32 000 different tokens. Many common English words are represented by only one token. The number of tokens provided in Table 2 are obtained with Tokenizer.Footnote 7 We assume that all LLM in this paper represent text by a similar number of tokens.

Table 2 Examples and context information for the creation of basic dynamic and multibody dynamics systems; number of tokens counted with Tokenizer. For Examples 4–6.1 the additional context information is not included in the tokens count

All examples are given in text form only, using natural language with no or only few technical instructions, e.g., there is no hint on which solver or which method to be used. As all LLM generate nondeterministic results, tests are usually repeated with three trials. The expectation of the tests is to obtain a Python code from the LLM that can be directly processed in Python using the Exudyn package.Footnote 8 Tests are evaluated based on correct code syntax and on correct modeling of the dynamic system. In order to clearly distinguish between the two error types, syntax errors (counted by \(e_{syn}\)) are all errors that raise a Python error when the code is executed, except for Exudyn’s solver failures due to modeling errors. All remaining errors are model errors (counted by \(e_{mod}\)), which are more severe because the user has to detect such errors. Note that syntax errors could be even less severe, because they could be resolved by feeding the error back to the LLM – as we will show in some of the examples.

4.1 Example 1: Create a mass–spring-damper in Python/SciPy

Example 1 is aiming to create the mathematical model and compute the solution of a linear mass–spring-damper undergoing a constant force. The example is used to compare the different LLM as all of them are able to generate Python code and to use SciPy [26]. It is not fully known to what extent the considered LLM have been trained on SciPy. However, looking at larger datasets available on HuggingFace [28], we find typical datasetsFootnote 9 that are solely related to Python, comprising of 18k instructions, of which 67 instructions are directly related to SciPy.

The definition of the model in text form is given in Table 3a). For clarity, the expected model is shown in Fig. 3a), which is not available to the LLM. Example 1.1 uses a slight variation of the input, see Table 3b), in order to evaluate the sensitivity on the specific input text. The results of these tests are summarized in Table 4 and some of the responses are given in Appendix A.1. It is clearly shown that all LLM have been trained for SciPy, however, LLaMA-2 generated many errors and even wrong Python syntax.

Fig. 3
figure 3

Test examples based on mass points; a) is used for Example 1, Example 1.1, Example 3, and Example 4; b) for Example 4.1, c) for Example 5 and d) for Example 5.1

Table 3 Prompts for mass–spring-damper system, see Fig. 3a) using SciPy. a) is used for Example 1; b) for Example 1.1
Table 4 Number of modeling errors (\(e_{mod}\)) and syntax errors (\(e_{syn}\)) of Example 1 and Example 1.1 with SciPy

4.2 Example 2: Do you know Exudyn?

Before continuing with further Exudyn examples, we assess whether the LLM is trained on Exudyn. While a definitive answer is elusive, we can infer from responses to specific questions. Representatively, the question “Do you know Exudyn?”  is asked in a new prompt without any context. While all models say yes, we evaluate for the following keywords that are specific for Exudyn, namely multibody, simulation, rigid and flexible bodies, connectors, Python and C++. If most of these keywords are present in the answer, we consider the LLM to know the library, meaning that the according GitHub files have been used for training.

The results of these tests are summarized in Table 5 and the responses are given in Appendix A.2. It is clearly shown that only GPT-3.5, GPT-4, and Bard have been trained with sources and documentation of Exudyn, while LLaMA-2 cannot generate information clearly related to Exudyn, in particular containing a wrong focus on powder dynamics.

Table 5 Results of Example 2 for the keyword question “Do you know Exudyn”. The LLM was trained (yes) or not trained (no) with sources and documentation of Exudyn

4.3 Example 3: Create a mass–spring-damper in Exudyn

This example is aiming to create the simulation model of a linear mass–spring-damper undergoing a constant force with Exudyn. The example is used to compare the different LLM, which have been trained on Exudyn. The definition of the model in text form is given in Table 6. For clarity, the model is given in Fig. 3a), which is not available to the LLM. The results of these tests are summarized in Table 7. It is clearly shown in Table 7 that only GPT-3.5 and GPT-4 have been trained with sufficient codes and documentation of Exudyn. Only one trial has been evaluated in detail, but we did not achieve any fully correct output within several trials.

Table 6 Prompt for mass–spring-damper in Exudyn, see Fig. 3
Table 7 Number of modeling errors (\(e_{mod}\)) and syntax errors (\(e_{syn}\)) of Example 3 with Exudyn. If the code is riddled with syntax or modeling errors the number of modeling or syntax errors is not available (n.a)

4.4 Examples 4, 4.1, 5, and 5.1 based on mass points in Exudyn with context

The examples of this section summarize results of LLM regarding mass points with spring-dampers and distance constraints modeled in Exudyn. Since none of the LLM could correctly solve the task posed in Sect. 4.3, in-context learning is applied from now on by prompting appropriate text, see Sect. 3.2, prior to the queries given in Appendix A.3. For clarity, the expected models are shown in Fig. 3, which are not available to the LLM. The results of these tests are summarized in Table 8 and Table 9. We observe the excellent performance of GPT-4, as it only produced one model error. The performance of GPT-3.5 and Bard is worse, while for GPT-3.5 in Example 4, it would be possible to feed back syntax errors within a second iteration. Remarkably, while we observed that Bard did not know Exudyn’s way of modeling, it could easily learn from the context and produced only a few errors as compared to Example 3. LLaMA-2 could not sufficiently learn from the provided context and even produced highly erroneous Python syntax.

Table 8 Number of modeling errors (\(e_{mod}\)) and syntax errors (\(e_{syn}\)) of Example 4 and Example 4.1 with Exudyn with context. If the simple task of Example 4 was erroneous, we skipped (skp) Example 4.1. If the code is riddled with syntax or modeling errors the number of modeling or syntax errors is not available (n.a)
Table 9 Number of modeling errors (\(e_{mod}\)) and syntax errors (\(e_{syn}\)) of Example 5 and Example 5.1 with Exudyn with context. If the simple task of Example 5 was erroneous, we skipped (skp) Example 5.1

Clearly, the solution to Example 4 is almost included in the context information. The LLM only has to select the right commands and adjust input parameters, which is already too difficult for LLaMA-2. Example 4.1 goes further beyond the context information, as, for example, spring-dampers are added between mass points, which is not described in the context information.

We also observe variations of GPT-4, using a gravity constant of 10 in Example 5 (and 5.1), trial 1, versus 9.81 in the other trials of the same example. As the specification in Example 5.1 was intentionally a little unclear, the solutions differ considerably. For example, the distances of the 10 mass points are different in the trials, but we marked them as correct. Furthermore, the ways to create bodies and joints vary notably in Example 5.1, showing the abilities of GPT-4 to work with the learned context.

4.5 Examples 6 and 6.1 based on rigid bodies in Exudyn with context

Similar to the examples with mass points, in-context learning based on information for the creation of rigid bodies, joints, spring-dampers, and mass points is used also for Examples 6 and 6.1. For details of the context information for rigid bodies, see the supplementary material, but we note that it does not contain particular information to create chain-like or slider-crank mechanisms. As shown in Table 2, the number of tokens is more than 2048, which would theoretically only work with GPT-4, Bard and LLaMA-2. However, evaluation shows that GPT-3.5 is also able to generate some correct output.

The definitions of the models in text form are given in Appendix A.4. For clarity, the models are given in Fig. 4, but are not available to the LLM. The results of these tests are summarized in Table 10 and some exemplary outputs are given in Appendix A.4. We observe that GPT-4 produced two wrong lines of code, which could be fixed by putting the error message back as a prompt. Still, GPT-3.5 and GPT-4 performed comparatively well, mostly using wrong position vectors for joints relative to the bodies’ reference positions.

Fig. 4
figure 4

Test examples based on rigid bodies; a) is used for Example 6 and b) for Example 6.1

Table 10 Number of modeling errors (\(e_{mod}\)) and syntax errors (\(e_{syn}\)) of Example 6 and Example 6.1 in Exudyn with context. If the simple task of Example 6 was erroneous, we skipped (skp) Example 6.1. If the code is riddled with syntax or modeling errors the number of modeling or syntax errors is not available (n.a)

In some cases, feeding error back in a second iteration, appropriate training, improved fine tuning or vision-based inputs (with multimodal LLM) could resolve such problems in the future.

Figure 5 shows a screenshot of the triple-pendulum (Example 6) and of the slider-crank mechanism (Example 6.1) created from the output of GPT-4. Note that the geometry of the slider-crank mechanism would not work for full revolutions of the crank, because the crank and conrod have the same lengths. Nevertheless, the direct extension from the triple-pendulum to a slider-crank mechanism has been performed correctly.

Fig. 5
figure 5

Visualization in Exudyn using the Python scripts created by GPT-4: triple pendulum with rigid bodies (left, Example 6) and slider-crank mechanism (right, Example 6.1); the visualization parameters (shadow, colors, loads, font size) and the drawing size of joints have been slightly adapted in order to improve visibility (Color figure online)

5 Conclusions and outlook

The performed experiments can be summarized as follows. Even the smallest tested LLM are able to sketch basic simulation codes for dynamic systems, for example in Python/SciPy, nevertheless producing many syntax and some model errors. Advanced LLM, such as GPT-4 that is currently leading in many LLM benchmarks, are highly reliable in writing basic simulation codes for dynamic systems. Furthermore, as shown in [3], simplified modeling language and appropriate information provided in the context greatly improves the reliability of LLM in producing correct simulation code. Regarding multibody system dynamics, advanced LLM show potentialities in creating even advanced rigid-body multibody systems with joints. In general, we observe an unsurpassed speed, e.g., a chain-like multibody system can be created by the LLM in less than a minute. Within software designed for structured data representation in tabular formats, the integration of AI-driven functionalities becomes evident, offering automated recommendations for data visualization strategies, and intelligent formula suggestions derived from the inherent dataset characteristics. Consequently, multibody models generated from natural language could be positioned as a supplementary tool augmenting GUI modeling, rather than a complete replacement.

While the simulation models generated by LLM are not always error free, they can provide significant relief in practice by providing a quick initial approach to implementation so that users do not need to master the complex description language. While not shown here, also parts of models could be created by the LLM.

Due to the way Exudyn data is available, many LLM could learn solely from code examples. This means it could learn blindly, not having any description of kinematics by images. This may, in particular, explain the limitations of geometric and kinematic understanding that became evident during the tests performed. The latter limitation is about to be remedied, as ChatGPT-4V (vision) became available in September 2023, which also supports graphical input. This could enable us to create multibody models from (hand) sketches.

As a main result, we could demonstrate the large differences in quality of the investigated LLM, showing a strong relation of quality and size of the LLM. This allows us to conclude that future LLM, not only increasing in size, but also with longer training time, advanced model structures such as chain-of-thoughts [27], improved hyperparameters, larger training datasets, and detailed training on Exudyn or similar libraries, could perform much better than available models today. Therefore, we conclude that future LLM could create highly complex rigid or flexible multibody systems from natural language.