Cluster energy prediction based on multiple strategy fusion whale optimization algorithm and light gradient boosting machine

Background Clusters, a novel hierarchical material structure that emerges from atoms or molecules, possess unique reactivity and catalytic properties, crucial in catalysis, biomedicine, and optoelectronics. Predicting cluster energy provides insights into electronic structure, magnetism, and stability. However, the structure of clusters and their potential energy surface is exceptionally intricate. Searching for the global optimal structure (the lowest energy) among these isomers poses a significant challenge. Currently, modelling cluster energy predictions with traditional machine learning methods has several issues, including reliance on manual expertise, slow computation, heavy computational resource demands, and less efficient parameter tuning. Results This paper introduces a predictive model for the energy of a gold cluster comprising twenty atoms (referred to as Au20 cluster). The model integrates the Multiple Strategy Fusion Whale Optimization Algorithm (MSFWOA) with the Light Gradient Boosting Machine (LightGBM), resulting in the MSFWOA-LightGBM model. This model employs the Coulomb matrix representation and eigenvalue solution methods for feature extraction. Additionally, it incorporates the Tent chaotic mapping, cosine convergence factor, and inertia weight updating strategy to optimize the Whale Optimization Algorithm (WOA), leading to the development of MSFWOA. Subsequently, MSFWOA is employed to optimize the parameters of LightGBM for supporting the energy prediction of Au20 cluster. Conclusions The experimental results show that the most stable Au20 cluster structure is a regular tetrahedron with the lowest energy, displaying tight and uniform atom distribution, high geometric symmetry. Compared to other models, the MSFWOA-LightGBM model excels in accuracy and correlation, with MSE, RMSE, and R2 values of 0.897, 0.947, and 0.879, respectively. Additionally, the MSFWOA-LightGBM model possesses outstanding scalability, offering valuable insights for material design, energy storage, sensing technology, and biomedical imaging, with the potential to drive research and development in these areas. Graphical Abstract


Background
Clusters represent a novel material structure and serve as an intermediate transition states during the transformation of substances from atoms and molecules to macroscopic objects.They are assemblies formed by the bonding of numerous atoms, molecules, or ions, driven by physical or chemical forces, with sizes ranging between one-tenth and a hundred nanometres [1][2][3].Clusters [4][5][6][7][8][9], unlike individual atoms, molecules, and macroscopic solids, exhibit distinctive chemical reactivity and catalytic performance due to characteristics [10] such as quantum size effects, surface effects, and a higher surface area-to-volume ratio.Therefore, they find wide applications in fields including catalysis [11,12], materials adsorption [13], biomedical applications [14], optic and optoelectronics [15][16][17].The energy of clusters is pivotal for comprehending their stability and characteristics.Analysing and comparing these energies enhances our comprehension of the energy differences and relative levels among various clusters [18][19][20].This in-depth understanding aids in predicting and explaining the electronic structure, magnetism, and optical properties of clusters.It leads to the optimization of the energy band structure of materials and the active sites of catalysts, and also enables advanced predictions of cluster formation and stability under experimental conditions.This guidance in experimental design saves time, reduces costs, and further promotes the development of new materials, new catalysts and new energy technologies.In particular, gold clusters are crucial in Surface-Enhanced Raman Spectroscopy (SERS) and photothermal therapy [21,22].However, their structure is exceptionally complex and possesses an abundance of isomers [23][24][25][26].Considering the Au20 cluster [27][28][29], we display six of its isomer structures, as shown in Fig. 1.Thus, the search for the globally optimal structure among various isomers presents a substantial challenge, and establishing theoretical computational models for cluster energy holds significant research value and promising applications.
The theoretical calculation models of cluster energy can be mainly divided into two categories.One category is the ab initio method, rooted in the first principles of quantum mechanics [30,31], which includes Density Functional Theory (DFT) [32,33], Hartree-Focktheory (MF) [34], Second-order Moller-Plesset Perturbation theory [35,36], Complete Active Space Perturbation theory (CASPT2) [37], Multi-Reference Configuration Interaction (MRCI) [38] and others.These methods predict the energy of clusters by describing the electronic structure and interactions among electrons.Their main challenge lies in analysing wave functions in a high-dimensional space, requiring complex computations with many degrees of freedom and parameters.As the number of atoms increases, so does the computational complexity, exponentially increasing demands on computing resources and time.Therefore, it is not an ideal or efficient solution in practical applications.The other category relies on empirical potential energy function methods, primarily including Lennard-Jones [39,40], Morse [20,41], Gupta [42,43], Sutton-Chen [44,45] and Reactive empirical bond order (REBO) [46,47].These methods provide approximate predictions of cluster energy by constructing empirical potential energy functions to describe atomic interactions.While offering fast computational speed and low computational cost, their precision and applicability are significantly limited by the selected potential energy function and its associated parameters.The rapid advancement of artificial intelligence has prompted researchers to increasingly utilize machine learning techniques for challenges in regression and classification.Machine learning methods, being data-driven in nature, make decisions by discerning patterns and associations within datasets.Consequently, they find widespread application across various domains, including computational physics, chemistry, and materials science.Hansen et al. [48] employed the linear regression method to establish the relationship model between the structure information and energy of clusters, successfully achieving energy prediction.The model demonstrated good performance, and the experiment illustrated that the application of machine learning methods to describe atomic interactions can accelerate the energy prediction process.
However, there are two issues that need to be explored and addressed: (1) The energy prediction model, which relies on traditional machine learning methods, encounters challenges.It depends on human expertise and struggles with slower processing speed and greater computational requirements when dealing with the intricate relationship between cluster structure and energy in high-dimensional nonlinear data.(2) The performance of the model is closely linked to the hyperparameters' values.Previous methods for setting hyperparameters through exhaustive searches were not only less efficient but also produced unsatisfactory results, particularly in scenarios with a large number of hyperparameters.
This paper puts forward feasible solutions for tackling the two aforementioned issues.Firstly, we draw inspiration from several advanced machine learning methods, including Random Forest (RF), Gradient boosting decision tree (GBDT), eXtreme Gradient Boosting (XGBoost), and LightGBM.These methods have demonstrated impressive performance in handling highly nonlinear feature problems [49][50][51][52][53][54].Given LightGBM's strong robustness and resilience [55,56], we propose employing it for cluster energy prediction.Secondly, swarm intelligence optimization algorithms offer notable benefits in the field of optimization.Specifically, the WOA stands out for its fewer parameters, ease of implementation and adjustment, and efficient global search capability [57][58][59][60][61]. Therefore, we utilize the WOA to search for hyperparameters.
The main contributions of this paper are as follows: (1) We employ an advanced machine learning technique to predict the energy of Au20 clusters.By analysing the relationship between atoms, we transform the spatial structure information of the cluster into a numerical matrix and extract its features.By utilizing this feature sequence as input and energy

Feature representation
Choosing the appropriate cluster representation method is crucial for the performance of machine learning methods.For a cluster containing N atoms, we require a function E N : R 3N × N → R to convert the cluster's structural information into a numerical vector.Thus, we adopt the Coulomb representation proposed by Rupp et al. [62].
Based on the atom's nuclear charge and three-dimensional coordinates, the cluster data is encoded into a N × N dimensional matrix.The equation for the Cou- lomb matrix is as follows: where Z i and R i denote the nuclear charge and the three- dimensional spatial coordinates of atom i , respectively.The matrix's diagonal elements are derived by fitting the total energy of free atoms and the nuclear charge using a polynomial, while the off-diagonal elements signify the Coulombic repulsion between two atoms within the cluster.Subsequently, we compute eigenvalues for the N × N Coulomb matrix to extract its characteristics.Eigenvalues are crucial in representing matrix information and are commonly used for matrix dimensionality reduction.The equation for eigenvalue computation is as follows: (1) The eigenvalues obtained from Eq. ( 2) are used as the feature sequence for the Au20 cluster.

Light gradient boosting machine
LightGBM [63] represents a highly efficient distributed ensemble algorithm that evolved from GBDT [64] in 2017.The core idea of GBDT is to substitute the output residuals of the previous tree with the direction of the steepest descent in the loss function (negative gradient direction) to generate a new decision tree.During the iteration, GBDT keeps the current model unchanged and relearns a function to approximate the actual values more accurately.The ultimate prediction results are obtained by combining the outputs of multiple decision trees. Given where f k (X) denotes the k-th decision tree, and K is the total number of decision trees.
Initialize the model with F 0 (X) = 0 .At the t-th iter- ation, the model and loss function are expressed as Eqs.(4,5) respectively: The first derivative L ′ and the second derivative L ′′ of the loss function L are calculated by Eqs. (6, 7).
According to Eq. ( 4) and the first-order Taylor expansion f (x + �x) = f (x) + f ′ (x) × �x , the first derivative L ′ is modified to Eq. ( 8). (3) L ′ (Y , F t (X)) = 0 , the t-th decision tree is the Eq. ( 9): Substituting it into Eq.( 4), the t-th learner is the Eq. ( 10): Ultimately, obtaining the strong learner, which represents the optimal solution of the model.
During the construction of the decision tree, GBDT calculates information gain values for all data points (9) and employs a level-wise growth strategy (as shown in Fig. 2a), leading to the challenge of slower model execution speed and higher complexity.LightGBM adopts a histogram-based algorithm (as shown in Fig. 2c), which discretizes continuous features into multiple bins and records information such as the number of samples in each bin.This approach enables the discovery of the optimal split point with just a single pass through the feature data.Additionally, LightGBM employs a leafwise growth strategy (as shown in Fig. 2b).This strategy selects only the nodes with the highest gain for splitting through a layer-wise traversal.Consequently, it reduces model complexity and accelerates the training speed.Moreover, LightGBM utilizes Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) for data preprocessing.GOSS retains all high-gradient samples while randomly selecting only a few low-gradient samples, significantly reducing the sample size.After sampling is completed, EFB effectively reduces the feature count by bundling a group of features, which are not exclusively non-zero values, into a new feature package.This process merges features with almost no loss.Furthermore, LightGBM is also optimized for parallel computation.In summary, LightGBM has faster speed than deep neural networks and higher precision than other machine learning methods.As a result, we choose Light-GBM for the energy prediction task of the Au20 cluster.

Whale optimization algorithm
The WOA, a novel intelligent search algorithm that simulates the hunting behavior of humpback whales, was proposed by Mirjalili et al. [65] in 2016.It is.The algorithm has three phases: encircling prey, bubblenet attacking, and random search.During the whale's search process, there is a 50% random probability p of choosing to encircle or attack the prey.Meanwhile, the value of parameter A determines whether to expand the predatory search circle or to shrink the encircling circle, changing based on the convergence factor a.
where A is a coefficient vector with values ranging of [− a, a] .The value of the convergence factor a linearly decreases from 2 to 0, r is a random number between 0 and 1, t and T represents the current iteration number and the maximum number of iterations, respectively.
(1) Encircling prey When P < 0.5 and |A| < 1 , the whale's position is continuously updated based on the optimal individual whale position calculated by the fitness function, thus encircling the prey.
where X t i and X t+1 i represent the positions of the i-th whale in the t-th and ( t + 1)-th iterations, respectively.X Best denotes the current optimal whale's position (i.e., the prey's position).D t iB is the distance between the i-th whale and the prey in the t-th iteration, and C is a coefficient vector.
(2) Random searching prey When P < 0.5 and |A| ≥ 1 , we randomly select one whale from the whale group as a search proxy, then update the positions of other whales based on the search proxy's location. ( where X rand represents the position of the ran- domly selected whale.D denotes the distance between the i-th whale and the selected one. (3) Bubble-net attacking prey When P ≥ 0.5 , the whale selects the bubble-net feeding mechanism, a unique predation method of the whale.It moves upward in a spiral path while updating the whale's position and conducting a "bubble net attack" to capture its prey.
where D ′ t is the distance between the i-th whale and the current prey in the t-th iteration, b is a constant that defines the shape of the logarithmic spiral, and l is a ran- dom number between 0 and 1.

Multiple strategy fusion whale optimization algorithm
(1) Tent chaotic map initialization In the WOA, the initial whale positions are randomly generated, which may result in a non-uniform distribution of the whales and an increased risk of falling into local optima.To address this concern, we introduced the Tent chaotic mapping.The chaotic sequence generated by the Tent mapping exhibits both exploratory and random traits.Consequently, using it for population initialization can lead to a more uniform distribution of initial solutions within the solution space, thereby enhancing the algorithm's exploration capabilities and making it easier to find the global optimum.
where n represents the number of mappings, and Z n is the value of the n-th mapping.(2) Cosine convergence factor In the optimization process, changes in the search range of the whale swarm play a crucial role in the algorithm's convergence accuracy and efficiency.As illustrated by Eq. ( 13), the convergence factor a linearly decreases with an increase in the num- ber of iterations, which may lead to an imbalance in the algorithm's search capabilities during the (18 early and later stages of iteration.Having a value of parameter A can balance the algorithm's ability between global exploration and local optimization.Therefore, we design a cosine convergence factor to dynamically adjust the search range and the value of parameter A. (3) Inertia weight updating strategy The weight of whales is a fixed value, which is insufficient to handle the complex nonlinear variations during the optimization process.Therefore, based on the cosine nonlinear variation characteristics of the convergence factor a , we introduce the iner- tia weight factor w to adjust the proportion of the global and local search of the WOA so that the algorithm can quickly converge to the local optimal solution while also having a high probability to jump out of the local optimal and perform global search, which helps to improve the search efficiency and quality of the algorithm.
where w min and w max represent the minimum and maximum values of the weight, which are 0.4 and 0.9.Since a ∈ [0, 2] , the value of weight w is [0.4,1.4] .There's a 3/5 probability for local search and a 2/5 probability for global search.
Substituting the inertia weight factor w into Eqs.(14,17,19), we get an updated equation for the whale's position:

Model architecture
The MSFWOA-LightGBM model comprises three stages: feature preparation, model construction, and prediction and analysis, as illustrated in Fig. 3.The first stage is the feature preparation.Using the Coulomb representation, we calculate the atomic coordinates of clusters in the dataset to obtain the Coulomb matrix.Next, we extract features from the Coulomb matrix and compute its eigenvalues to generate a feature sequence.In the second stage, which focuses on model construction, (22) we employed MSFWOA to optimize the LightGBM improved by GBDT, thereby establishing the MSFWOA-LightGBM model.The third stage is the prediction and analysis of the model.We input the feature sequence into the MSFWOA-LightGBM model for training and testing using ten-fold cross-validation, ultimately outputting the energy of Au20.We then conducted a comparative analysis the sample values and the predicted values to assess the model's performance.Meanwhile, we also compare it in multiple aspects with other optimization algorithms and machine learning algorithms to verify the superiority of this model.

Model evaluation criterion
We where N represents the number of samples, y i and ŷi are experimental and predicted values of the i-th sample, respectively, and y is the average of all samples.

Data preparation and processing
The experimental data come from the MathorCup University Mathematical Modeling Challenge, which includes structural files of 999 Au20 clusters.Each file contains a cluster's energy and the three-dimensional Cartesian coordinates of its twenty atoms.By analyzing the dataset statistically, we obtain values for various indicators, such as the maximum value, minimum value, average value, standard deviation, variance, lower quartile, median, and upper quartile, as shown in Table 1.The (25 absolute differences between the upper and lower quartiles and the median for all columns do not exceed 2. The extreme values for the X, Y, and Z axes deviate from the median by less than 10.However, in the energy column, this deviation is 5.553403 from the maximum value and as much as 20.747694 from the minimum value.Figure 4 shows a notably anomalous data point with an energy value of -1530.908363,which deviates significantly from the overall data.Given that this point might influence subsequent research, we consider it an outlier  and exclude it from our analysis, focusing solely on the remaining data of 998 gold cluster isomers for further study and analysis.

Results
The operating environment is Windows 10 64-bits OS (16 GB of memory and Intel ® Core ™ i7-8700 processor).The Software is Spyder with python3.7.Initially, based on the nuclear charge and the number of atoms in the Au20 cluster, we calculate the Coulomb matrix and obtain the feature sequence through eigenvalue decomposition.
Next, we train and validate the MSFWOA-LightGBM model, using the ten-fold cross-validation and employed MSFWOA to search for the hyperparameter of Light-GBM to ensure the model achieves its best performance.Lastly, we evaluate the predictive performance of MSF-WOA-LightGBM and analyze the relationship among cluster atom distribution, energy, and structure.
In this experiment, we utilize the MSFWOA algorithm to optimize seven key hyperparameters of Light-GBM.The fitness function is the RMSE.During the iterations, we consistently update and track the position of the optimal whale (the prey).Ultimately, we identify a set of hyperparameters that resulted in the lowest value of RMSE, as presented in Table 2.The table presents descriptions, corresponding values, and search ranges for all parameters.
According to the experimental results, we compare the difference between the experimental and the predicted value, analyzing the errors, as shown in Fig. 5.The fitting performance of the MSFWOA-LightGBM model for the experimental and predicted values in the training and test sets is illustrated in Fig. 5a, b, respectively.The diagonal indicates that the experimental value is equal to the predicted value.There is a significant count of samples with energy levels ranging from − 1555 to − 1545, whereas the number of samples between − 1545 and − 1540 is limited, displaying a discrete distribution.In Fig. 5a, the data points are all distributed near the diagonal, indicating that the error between predicted and experimental values in the training set is small, demonstrating good model performance.In Fig. 5b, the majority of data points are close to the diagonal, with only a small number of data points in sparsely distributed areas being distant from the line.This observation shows that the non-uniform data distribution has a discernible impact on the model's performance.Overall, the model exhibits a high degree of fitting and performs well.
In Fig. 5c, d, we construct error curve graphs with the average of experimental and predicted results on the vertical axis and used the standard deviation to create error bars.Additionally, we draw violin plots based on error values.In the training set, most error bars ranged from 0 to 1, and only six data points exhibited errors within [1,2].In the test set, there are fewer than ten data points with relatively larger error bar lengths, indicating the presence of significant errors.According to the violin plots, 85% of the data had error values within [0, 1], with only three data points having errors greater than 2. Overall, the training and test sets exhibit low errors and few outliers, so the model has high accuracy and good stability.
To assess the effectiveness of the features extracted by the model, we record the evaluation metric values on the best-performing training and test sets, as well as the values of SHapley Additive exPlanations (SHAP) for the twenty features in this experiment, as shown in Fig. 6.From Fig. 6a-c  We utilize the Visual Molecular Dynamics (VMD) software to visualize the Au20 cluster with the lowest energy and find it to be a regular tetrahedral structure, as depicted in Fig. 7.The yellow nodes represent atoms, and the purple segments are connections between adjacent atoms when the Distance Cutoff of DynamicBonds is 2.8.The stereogram consists of twenty atoms and sixty bonds, with each atom bonded to multiple neighboring atoms.Within it, there are four atoms connected to three neighboring atoms, four atoms to nine, and twelve atoms to six, respectively.In three views, it's clear that each face of the tetrahedron is an identical equilateral triangle.Hence, the tetrahedron is a regular tetrahedron, and the atoms on each face are equivalent, exhibiting tetrahedral symmetry.In summary, the structure of the Au20 cluster with the lowest energy primarily exhibits a dense and uniform atomic distribution.The structure is highly symmetrical, rotationally invariant, and tightly packed tetrahedral.

Analysis of results with various parameter optimization algorithms
We compared the Bayesian optimization algorithm (BO), the Grey Wolf optimization algorithm (GWO), and the WOA with our proposed MSFWOA algorithm, as referenced from Qiu et al. [66].To maintain consistency and fairness in the comparison across different optimization algorithms, we utilized a uniform LightGBM parameter search range.Additionally, in order to visually demonstrate the performance differences among the algorithms, we conducted statistical analyses of the errors and various evaluation metrics, as shown in Fig. 8. Figure 8a, d show the Boxplot of sample value errors and the error count statistics for various parameter optimization algorithms on the test set.The rectangular box represents 50% of the data, with the line inside the box indicating the median.The Upper whisker and Lower whisker depict the range of 80% of the data, while the diamond-shaped data points indicate significant outliers.The evaluation metric data for various algorithms on the test dataset is respectively displayed in Fig. 8b, c.
In Fig. 8a, the LightGBM model has more outlier data points than other models, and these outliers exhibit the greatest deviation.This phenomenon indicates that the LightGBM model has a higher number of errors with larger error values.The BO-LightGBM, GWO-LightGBM and WOA-LightGBM models have roughly the same number of outlier data points.However, the WOA-LightGBM model has smaller outlier values, indicating that the LightGBM model, optimized through WOA, exhibits more stable predictive performance.Therefore, WOA demonstrates stronger optimization capabilities than BO and GWO, showing a certain advantage.Furthermore, for the MSFWOA-LightGBM model, the median of the boxplot is the smallest, and the majority of data points are clustered around the median.This phenomenon suggests that this model has the smallest overall error, with most errors falling within the range of [− 3, 3].In Fig. 8d, it can be observed that MSFWOA-LightGBM has the highest bar near 0, suggesting that the majority of data points exhibit errors centered around 0. From the two bar charts, the values of MAE, MSE, and RMSE for the LightGBM model with parameter optimization algorithm are all smaller than those of the original Light-GBM model, while the R 2 value is greater than that of the original LightGBM model.Furthermore, the MSF-WOA-LightGBM model exhibits the smallest value for MAE, MSE and RMSE, as well as the largest value for R 2 .Overall, it is evident that the LightGBM model with parameter optimization algorithm performs more robustly than the one without optimization, and the improved MSFWOA-LightGBM model excels at minimizing prediction errors, delivering superior performance.

Analysis of results with different machine learning methods
Following the methodologies described by Li et al. [67], we selected four machine learning algorithms, including RF, GBDT, XGBoost and LightGBM, for comparison with the model proposed in this paper.These algorithms were selected based on their proven effectiveness and relevance as extensively detailed in Li et al. 's work.And we created prediction distribution charts and performance comparison charts based on the experimental results from all models on the test dataset, as depicted in Fig. 9.
In Fig. 9a, the variation in experimental data is displayed.Data points are the predictions of five models, and the vertical distance between the data points and the points on the line represents the error between the experimental and the predicted values.And the MSFWOA-LightGBM model exhibits a smaller vertical distance than other models.In Fig. 9b  When combining the results from Fig. 9e and Table 3, it becomes evident that the WSFWOA-LightGBM model exhibits a significant advantage in terms of model accuracy and correlation.While GBDT, XGBoost and Light-GBM demonstrate similar performance, RF not only performs the worst performance but also requires the longest processing time.In comparison to GBDT and XGBoost, LightGBM has reduced its processing time by almost threefold, even when the differences in model performance are not substantial.Therefore, WSFWOA-LightGBM excels in terms of time efficiency and outperforms LightGBM in overall performance.

Analysis of cluster's energy and structure
By comparing different isomers of Au20 clusters, we analyze the relationship between atomic distribution, energy, and structure.The twelve isomers of the Au20 cluster, as shown in Fig. 10, are arranged in ascending order of energy with an energy difference of approximately 0.5 between them, and each isomer exhibits a certain degree of symmetry.In Fig. 10a, a regular tetrahedral structure is evident, consisting of four faces, four vertices, and six edges.The distribution of atoms in space is uniform, demonstrating significant rotational symmetry.It can rotate around four different axes.For three of these axes, each pass through two diagonally opposite vertices.When rotated 180 degrees around these axes, any two vertices coincide, and the entire tetrahedral structure remains unchanged.Another axis is perpendicular to one face of the tetrahedron and passes through the centroid of that face.When rotated 120 degrees around this axis, three vertices coincide in space.The regular tetrahedron also has three planes of symmetry, through which it can be divided into two symmetrical parts.Additionally, it possesses inversion symmetry in space.In Fig. 10b, the  atomic distribution is relatively uniform.However, the distance between each pair of atoms is greater than that in a regular tetrahedron, with only one rotation operation.In Fig. 10h, three atoms that are relatively distant from the other 17 atoms.The number of planes of symmetry is limited, with only one present.In Fig. 10l, all atoms are distributed on the same plane.The difference in the number of neighboring atoms for each atom is significant, leading to a non-uniform atomic distribution.
The tetrahedral structure is more stable and has lower energy than other isomers, mainly explained by the following four points: (1) High degree of geometric symmetry.The regular tetrahedral structure, due to its high degree of symmetry, ensures a uniform distribution of atomic spacing.Such a configuration results in a uniform electron cloud distribution, reducing its localization and thus establishing a stable electronic environment.Furthermore, this structure minimizes electron repulsion and instability.The uniform atomic spacing and angles further alleviate structural distortion and internal stress, enhancing stability.As a closed configuration, the regular tetrahedron, with fewer surface atoms and a smaller surface area, significantly reduces surface energy and total energy.(2) Multi-center bonding and balanced coordination environment.The tetrahedral structure allows each atom to form strong Au-Au bonds with multiple neighboring atoms, establishing a multi-center bonding system that enhances the structure's overall stability.This uniform coordination environment leads the system to a more harmonious and balanced state, thereby reducing the system's total energy.It generates a more even charge distribution, reducing the repulsion between electrons and associated instability.4) Lower entropy effect.The highly symmetrical structure limits how atoms can arrange, resulting in a reduction in configurational entropy.It is related to other thermodynamic properties of the system, especially as it lowers the free energy, making this structure more stable in various physical and chemical processes.
Overall, the tetrahedral structure of the Au20 cluster exhibits the lowest energy, due to its high degree of geometric symmetry and optimized electronic structure.This configuration fosters strong interatomic bonding and a balanced electron cloud distribution, establishing it as the most stable isomer.

Conclusions
In this paper, we utilized the Coulomb representation method and eigenvalue solutions for feature extraction.By incorporating Tent chaos mapping, cosine convergence factor, and inertia weight updating strategy to enhance the WOA algorithm, we developed the MSF-WOA.Employing the feature sequence as input and energy as output, with MSFWOA as the parameter search method, we constructed the MAFWOA-Light-GBM model for predicting the energy of the Au20 cluster.The experiment demonstrates that Au20 clusters with a regular tetrahedral structure exhibit the lowest energy.
In this structure, the atoms are evenly distributed, and each atom forms strong Au-Au bonds with multiple neighboring atoms.This ensures high symmetry and an optimized electronic structure, provides strong interatomic bonds, and contributes to a uniform electron cloud distribution, thus imparting the structure with the highest stability.The MSFWOA-LightGBM model not only demonstrates excellent predictive performance in energy prediction but also outperforms other comparative models in terms of prediction accuracy, correlation, stability, and computational efficiency.It also gives valuable insights into using swarm intelligence optimization algorithms for parameter tuning.Furthermore, it offers helpful guidance for applying clusters in chemistry, condensed matter physics, and new energy materials.While the MSFWOA-LightGBM model has yielded satisfactory results in the experiments, there are still some issues to explore when investigating the intricate relationship between energy and clusters.For instance, we can develop novel feature encoding methods to better capture cluster information.We can also explore the integration of machine learning techniques with potential energy functions to enhance predictive performance.In the future, we will continue to deepen our understanding of the correlation between atomic distribution, energy, and structure to promote the development of innovative functional materials.

Fig. 2
Fig. 2 Decision Tree Algorithm Schematic.a Level-wise growth strategy; b Leaf-wise growth strategy; c Histogram-based algorithm

Fig. 3
Fig. 3 Overview of the method used in this work , the bars in the training set are shorter than those in the test set, with the training set exhibiting smaller MAE, MSE, and RMSE values, indicating lower errors.Additionally, in Fig. 6d, the bars in the training set are slightly taller than those in the test set, and the training set has a higher value of R 2 compared to the test set, showing a stronger correlation coefficient.Therefore, the

Fig. 5
Fig. 5 Prediction results and error distribution for the MSFWOA-LightGBM model on the training and test sets.a Scatter of predicted and experimental values on the training set; b Scatter of predicted and experimental values on the test set; c The error distribution for the training set; d The error distribution for the test set

Fig. 6
Fig. 6 The performance and feature effectiveness of the MSFWOA-LightGBM model.a MAE; (b) MSE; (c) RMSE; (d) R 2 ; (e) SHAP value , the diagonal is experimental values equal predicted values and the distribution of predictions for five models around this diagonal is displayed.The predicted values of the MSF-WOA-LightGBM model are closely clustered around the diagonal.In the bottom-right corner of Fig. 9b, we can see the error distribution for the five models.The line segments represent the distribution of absolute error values, and the central point represents the average error.The MSFWOA-LightGBM model exhibits the smallest overall average absolute error.As shown in Fig. 9c, the

Fig. 7
Fig. 7 Structure of the Au20 cluster with the lowest energy.a Atomic distribution diagram; b Frontal stereogram; c Side steregram; d Top view; e Front view; f Left view

Fig. 8
Fig. 8 Performance Comparison of Various Parameter Optimization Algorithms. a Distribution of errors ; b Values of evaluation metrics; c Comparison of evaluation metrics; d Count distribution of errors

Fig. 9
Fig. 9 Result and performance of various machine learning models in the test set.a Vertical distance distribution of predicted and experimental values; b Scatter distribution of predicted and experimental values; c Density distribution of absolute errors; d Distribution of absolute errors; e Line plots of evaluation metrics for different methods

( 3 )
Closed-shell structure.The regular tetrahedral structure achieves a fully filled electron shell through the closed-shell configuration, eliminating instability caused by unpaired electrons.It harmonizes with the geometric construct of the tetrahedron, together forming a highly stable and low-energy

Table 1
The values of the eight statistical indicators about the experimental data

Table 3
Performance of Various Machine Learning Algorithms