ML_AB file has H O, training for system with H O and C. Just terminates early, not even an SCF gets done.
OK with ML_ISTART = 0.
ML_ISTART = 1 doesn't work with different element types - v6.4.1
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 3
- Joined: Mon Jun 05, 2023 2:01 pm
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1
Please send all necessary files to be able to run and check the calculation.
This means POSCAR, POTCAR, KPOINTS, INCAR, OUTCAR, ML_AB, ML_LOGFILE and stdout.
This means POSCAR, POTCAR, KPOINTS, INCAR, OUTCAR, ML_AB, ML_LOGFILE and stdout.
-
- Newbie
- Posts: 3
- Joined: Mon Jun 05, 2023 2:01 pm
Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1
thanks for the quick reply. See attached files.
You do not have the required permissions to view the files attached to this post.
-
- Newbie
- Posts: 3
- Joined: Mon Jun 05, 2023 2:01 pm
Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1
I found the reason, there's a significant jump in memory requirements from ML_ISTART = 0 to ML_ISTART = 1.
I increased mem/cpu to 9200 MB, and then it worked.
Hopefully, memory allocation can be improved in the future for ML_ISTART = 1? Or am I missing something?
I increased mem/cpu to 9200 MB, and then it worked.
Hopefully, memory allocation can be improved in the future for ML_ISTART = 1? Or am I missing something?
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 473
- Joined: Mon Nov 04, 2019 12:44 pm
Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1
It's hard to do anything about the memory allocation.
Here some explanations:
At the moment we have to statically allocate memory at the beginning, this is mainly due to the use of shared memory MPI. We saw several times that one gets problems if shared memory MPI needs to be reallocated. I don't know if this problem will be ever solved for all compilers.
So how can the memory grow so much in your case:
1) New element types entered the calculations. We use multidimensional allocatable arrays in fortran. So the local reference dimension will be allocated with the same maximum for all element types. Ideally one wants to have the same number of local reference configurations for all element types. Of course this is often hard to achive for dopands where we are limited by few atoms as local reference canditates from the training structures. In this case we waste some memory. Your case might belong to that.
2) It's a continuation run and if you don't specify anything then then on top of the already available data min(1500, NSW) is added. Please see the documentation of ML_MB and ML_MCONF (https://www.vasp.at/wiki/index.php/ML_MB and https://www.vasp.at/wiki/index.php/ML_MCONF). This default has worked until now quite nicely but if it turns out it's problematic for the majority of users then we will change that.
What can you do?
1) Check if you compiled with shared memory MPI ("-Duse_shmem").
2) Adjust ML_MB and ML_MCONF.
3) Go to a larger number of compute nodes since the design matrix which needs the most memory is distributed linearly over the number of cores.
Here some explanations:
At the moment we have to statically allocate memory at the beginning, this is mainly due to the use of shared memory MPI. We saw several times that one gets problems if shared memory MPI needs to be reallocated. I don't know if this problem will be ever solved for all compilers.
So how can the memory grow so much in your case:
1) New element types entered the calculations. We use multidimensional allocatable arrays in fortran. So the local reference dimension will be allocated with the same maximum for all element types. Ideally one wants to have the same number of local reference configurations for all element types. Of course this is often hard to achive for dopands where we are limited by few atoms as local reference canditates from the training structures. In this case we waste some memory. Your case might belong to that.
2) It's a continuation run and if you don't specify anything then then on top of the already available data min(1500, NSW) is added. Please see the documentation of ML_MB and ML_MCONF (https://www.vasp.at/wiki/index.php/ML_MB and https://www.vasp.at/wiki/index.php/ML_MCONF). This default has worked until now quite nicely but if it turns out it's problematic for the majority of users then we will change that.
What can you do?
1) Check if you compiled with shared memory MPI ("-Duse_shmem").
2) Adjust ML_MB and ML_MCONF.
3) Go to a larger number of compute nodes since the design matrix which needs the most memory is distributed linearly over the number of cores.