VASP linear response problem keeps failing. Out of memory?
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 3
- Joined: Mon Jul 15, 2013 12:11 pm
VASP linear response problem keeps failing. Out of memory?
I'm running linear response problem III, on a spin-polarized system of 42 atoms (Fe36N6 supercell). The computations keeps failing without telling exactly what went wrong.
I suspect that the job is running out of memory. What's puzzles me, however, is that this happens after 12 hours or more, and that there's no indication of memory problems in form of error messages. The job just fails with a segmentation fault. I guess if the job failed to allocate requested memory, it would know that and be able to print it in the output.
I'm running it on my local workstation (32 GB memory) and on a cluster (single node, 8 processors, 32 GB memory).
Any ideas of how to debug this problem?
I'm not specifying parallelization (NPAR), as that's not supported with linear response problems. VASP fails if I try that.
[INCAR]
ISMEAR = 1
VOSKOWN = 1
ISPIN = 2
MAGMOM = 36*3 6*0.5
PREC = HIGH
EDIFF = 1E-05
LCHARG = .FALSE.
LWAVE = .FALSE.
RANDOM_SEED = 1
IBRION = 8
[KPOINTS]
K-Points
0
Auto
45 ! Length
[Console output]
running on 6 total cores
distrk: each k-point on 6 cores, 1 groups
distr: one band on 1 cores, 6 groups
using from now: INCAR
vasp.5.3.3 18Dez12 (build Aug 09 2013 13:42:53) complex
POSCAR found type information on POSCAR Fe N
POSCAR found : 2 types and 42 ions
scaLAPACK will be used
WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
LDA part: xc-table for Pade appr. of Perdew
generate k-points for: 6 6 5
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ...
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
DAV: 1 0.304592402548E+04 0.30459E+04 -0.12490E+05 26568 0.160E+03
DAV: 2 0.987209509215E+02 -0.29472E+04 -0.27687E+04 26568 0.371E+02
DAV: 3 -0.337402588496E+03 -0.43612E+03 -0.35865E+03 28626 0.169E+02
DAV: 4 -0.389066717707E+03 -0.51664E+02 -0.47665E+02 39024 0.567E+01
DAV: 5 -0.390885138501E+03 -0.18184E+01 -0.17954E+01 35520 0.126E+01 0.729E+01
DAV: 6 -0.357217883508E+03 0.33667E+02 -0.44892E+02 32604 0.985E+01 0.532E+01
DAV: 7 -0.341369795827E+03 0.15848E+02 -0.67836E+01 31938 0.436E+01 0.239E+01
DAV: 8 -0.343553785063E+03 -0.21840E+01 -0.14914E+01 33168 0.782E+00 0.125E+01
DAV: 9 -0.342857664658E+03 0.69612E+00 -0.21096E+00 37692 0.554E+00 0.351E+00
DAV: 10 -0.342966148829E+03 -0.10848E+00 -0.66436E-01 30054 0.247E+00 0.145E+00
DAV: 11 -0.342967212784E+03 -0.10640E-02 -0.70143E-02 34548 0.695E-01 0.491E-01
DAV: 12 -0.342972814370E+03 -0.56016E-02 -0.30386E-02 33516 0.474E-01 0.414E-01
DAV: 13 -0.342972013272E+03 0.80110E-03 -0.10009E-03 34116 0.103E-01 0.245E-01
DAV: 14 -0.342971841916E+03 0.17136E-03 -0.14985E-03 37620 0.752E-02 0.763E-02
DAV: 15 -0.342971872052E+03 -0.30136E-04 -0.15219E-04 27000 0.398E-02 0.347E-02
DAV: 16 -0.342971875239E+03 -0.31864E-05 -0.85503E-06 16662 0.850E-03
1 F= -.34297188E+03 E0= -.34297948E+03 d E =0.228175E-01 mag= 87.5534
Linear response reoptimize wavefunctions to high precision
DAV: 1 -0.342971877135E+03 -0.18958E-05 -0.61269E-06 36432 0.698E-03
DAV: 2 -0.342971877147E+03 -0.12173E-07 -0.12102E-07 26946 0.136E-03
DAV: 3 -0.342971877147E+03 -0.18190E-09 -0.10727E-09 14460 0.995E-05
Linear response DOF= 4
Linear response progress:
Degree of freedom: 1/ 4
generate k-points for: 6 6 5
N E dE d eps ncg rms rms(c)
RMM: 1 -0.171116802305E+00 -0.17112E+00 -0.12199E-01172816 0.754E-01
RMM: 2 -0.164793427051E+00 0.63234E-02 -0.41623E-03 99925 0.281E-01 0.829E-01
RMM: 3 -0.169079119482E+00 -0.42857E-02 -0.72983E-03117291 0.252E-01 0.111E+00
RMM: 4 -0.171211599611E+00 -0.21325E-02 -0.85577E-03 93259 0.366E-01 0.122E+00
RMM: 5 -0.164667327575E+00 0.65443E-02 -0.21559E-03 92782 0.191E-01 0.236E-01
RMM: 6 -0.164582709631E+00 0.84618E-04 -0.26977E-04 94502 0.645E-02 0.141E-01
RMM: 7 -0.164622748492E+00 -0.40039E-04 -0.89592E-05 99835 0.384E-02 0.136E-01
RMM: 8 -0.164585936436E+00 0.36812E-04 -0.21273E-06 96321 0.310E-02 0.488E-02
RMM: 9 -0.164633815294E+00 -0.47879E-04 0.43418E-05110349 0.151E-02 0.511E-02
RMM: 10 -0.164632097392E+00 0.17179E-05 0.55945E-05 98532 0.105E-02 0.128E-02
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp 0000000000E62448 Unknown Unknown Unknown
vasp 000000000113FBD7 Unknown Unknown Unknown
vasp 0000000000473791 Unknown Unknown Unknown
vasp 00000000004420DC Unknown Unknown Unknown
libc.so.6 00002B6BA955AEAD Unknown Unknown Unknown
vasp 0000000000441FB9 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 13725 on node wheezy2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I suspect that the job is running out of memory. What's puzzles me, however, is that this happens after 12 hours or more, and that there's no indication of memory problems in form of error messages. The job just fails with a segmentation fault. I guess if the job failed to allocate requested memory, it would know that and be able to print it in the output.
I'm running it on my local workstation (32 GB memory) and on a cluster (single node, 8 processors, 32 GB memory).
Any ideas of how to debug this problem?
I'm not specifying parallelization (NPAR), as that's not supported with linear response problems. VASP fails if I try that.
[INCAR]
ISMEAR = 1
VOSKOWN = 1
ISPIN = 2
MAGMOM = 36*3 6*0.5
PREC = HIGH
EDIFF = 1E-05
LCHARG = .FALSE.
LWAVE = .FALSE.
RANDOM_SEED = 1
IBRION = 8
[KPOINTS]
K-Points
0
Auto
45 ! Length
[Console output]
running on 6 total cores
distrk: each k-point on 6 cores, 1 groups
distr: one band on 1 cores, 6 groups
using from now: INCAR
vasp.5.3.3 18Dez12 (build Aug 09 2013 13:42:53) complex
POSCAR found type information on POSCAR Fe N
POSCAR found : 2 types and 42 ions
scaLAPACK will be used
WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
LDA part: xc-table for Pade appr. of Perdew
generate k-points for: 6 6 5
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ...
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
DAV: 1 0.304592402548E+04 0.30459E+04 -0.12490E+05 26568 0.160E+03
DAV: 2 0.987209509215E+02 -0.29472E+04 -0.27687E+04 26568 0.371E+02
DAV: 3 -0.337402588496E+03 -0.43612E+03 -0.35865E+03 28626 0.169E+02
DAV: 4 -0.389066717707E+03 -0.51664E+02 -0.47665E+02 39024 0.567E+01
DAV: 5 -0.390885138501E+03 -0.18184E+01 -0.17954E+01 35520 0.126E+01 0.729E+01
DAV: 6 -0.357217883508E+03 0.33667E+02 -0.44892E+02 32604 0.985E+01 0.532E+01
DAV: 7 -0.341369795827E+03 0.15848E+02 -0.67836E+01 31938 0.436E+01 0.239E+01
DAV: 8 -0.343553785063E+03 -0.21840E+01 -0.14914E+01 33168 0.782E+00 0.125E+01
DAV: 9 -0.342857664658E+03 0.69612E+00 -0.21096E+00 37692 0.554E+00 0.351E+00
DAV: 10 -0.342966148829E+03 -0.10848E+00 -0.66436E-01 30054 0.247E+00 0.145E+00
DAV: 11 -0.342967212784E+03 -0.10640E-02 -0.70143E-02 34548 0.695E-01 0.491E-01
DAV: 12 -0.342972814370E+03 -0.56016E-02 -0.30386E-02 33516 0.474E-01 0.414E-01
DAV: 13 -0.342972013272E+03 0.80110E-03 -0.10009E-03 34116 0.103E-01 0.245E-01
DAV: 14 -0.342971841916E+03 0.17136E-03 -0.14985E-03 37620 0.752E-02 0.763E-02
DAV: 15 -0.342971872052E+03 -0.30136E-04 -0.15219E-04 27000 0.398E-02 0.347E-02
DAV: 16 -0.342971875239E+03 -0.31864E-05 -0.85503E-06 16662 0.850E-03
1 F= -.34297188E+03 E0= -.34297948E+03 d E =0.228175E-01 mag= 87.5534
Linear response reoptimize wavefunctions to high precision
DAV: 1 -0.342971877135E+03 -0.18958E-05 -0.61269E-06 36432 0.698E-03
DAV: 2 -0.342971877147E+03 -0.12173E-07 -0.12102E-07 26946 0.136E-03
DAV: 3 -0.342971877147E+03 -0.18190E-09 -0.10727E-09 14460 0.995E-05
Linear response DOF= 4
Linear response progress:
Degree of freedom: 1/ 4
generate k-points for: 6 6 5
N E dE d eps ncg rms rms(c)
RMM: 1 -0.171116802305E+00 -0.17112E+00 -0.12199E-01172816 0.754E-01
RMM: 2 -0.164793427051E+00 0.63234E-02 -0.41623E-03 99925 0.281E-01 0.829E-01
RMM: 3 -0.169079119482E+00 -0.42857E-02 -0.72983E-03117291 0.252E-01 0.111E+00
RMM: 4 -0.171211599611E+00 -0.21325E-02 -0.85577E-03 93259 0.366E-01 0.122E+00
RMM: 5 -0.164667327575E+00 0.65443E-02 -0.21559E-03 92782 0.191E-01 0.236E-01
RMM: 6 -0.164582709631E+00 0.84618E-04 -0.26977E-04 94502 0.645E-02 0.141E-01
RMM: 7 -0.164622748492E+00 -0.40039E-04 -0.89592E-05 99835 0.384E-02 0.136E-01
RMM: 8 -0.164585936436E+00 0.36812E-04 -0.21273E-06 96321 0.310E-02 0.488E-02
RMM: 9 -0.164633815294E+00 -0.47879E-04 0.43418E-05110349 0.151E-02 0.511E-02
RMM: 10 -0.164632097392E+00 0.17179E-05 0.55945E-05 98532 0.105E-02 0.128E-02
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp 0000000000E62448 Unknown Unknown Unknown
vasp 000000000113FBD7 Unknown Unknown Unknown
vasp 0000000000473791 Unknown Unknown Unknown
vasp 00000000004420DC Unknown Unknown Unknown
libc.so.6 00002B6BA955AEAD Unknown Unknown Unknown
vasp 0000000000441FB9 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 13725 on node wheezy2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Last edited by bakkedal on Sat Nov 09, 2013 5:55 pm, edited 1 time in total.
-
- Hero Member
- Posts: 593
- Joined: Tue Nov 16, 2004 2:21 pm
- License Nr.: 5-67
- Location: Germany
VASP linear response problem keeps failing. Out of memory?
Hi,
I would try
ALGO = N
in the INCAR. But I'm not sure.
Cheers,
alex
I would try
ALGO = N
in the INCAR. But I'm not sure.
Cheers,
alex
Last edited by alex on Mon Nov 11, 2013 5:11 pm, edited 1 time in total.
-
- Newbie
- Posts: 3
- Joined: Mon Jul 15, 2013 12:11 pm
VASP linear response problem keeps failing. Out of memory?
Hi,
That is the default value (http://cms.mpi.univie.ac.at/vasp/vasp/ALGO_tag.html), so it should already be enabled for this run. However, I now have reasons to be believe that it was actually running out of memory. I did some memory statistics in the background, and it was indeed running very low on memory:
[Before the job crashed]
Sat Nov 9 15:15:01 CET 2013
total used free shared buffers cached
Mem: 32169 31677 492 0 155 759
-/+ buffers/cache: 30761 1407
Swap: 0 0 0
[Just after the job crashed]
Sat Nov 9 15:16:01 CET 2013
total used free shared buffers cached
Mem: 32169 3876 28292 0 155 760
-/+ buffers/cache: 2960 29208
Swap: 0 0 0
I'm still puzzled why it fails with a segmentation fault. If the process tries to allocate memory, and the operating system isn't able to deliver any more, the process should be able to diagose that condition and show a proper error message. Maybe this is a bug? This is a linear response DPFT job.
That is the default value (http://cms.mpi.univie.ac.at/vasp/vasp/ALGO_tag.html), so it should already be enabled for this run. However, I now have reasons to be believe that it was actually running out of memory. I did some memory statistics in the background, and it was indeed running very low on memory:
[Before the job crashed]
Sat Nov 9 15:15:01 CET 2013
total used free shared buffers cached
Mem: 32169 31677 492 0 155 759
-/+ buffers/cache: 30761 1407
Swap: 0 0 0
[Just after the job crashed]
Sat Nov 9 15:16:01 CET 2013
total used free shared buffers cached
Mem: 32169 3876 28292 0 155 760
-/+ buffers/cache: 2960 29208
Swap: 0 0 0
I'm still puzzled why it fails with a segmentation fault. If the process tries to allocate memory, and the operating system isn't able to deliver any more, the process should be able to diagose that condition and show a proper error message. Maybe this is a bug? This is a linear response DPFT job.
Last edited by bakkedal on Mon Dec 02, 2013 11:20 am, edited 1 time in total.
-
- Administrator
- Posts: 2921
- Joined: Tue Aug 03, 2004 8:18 am
- License Nr.: 458
VASP linear response problem keeps failing. Out of memory?
I suppose the diagnose should be written by the OS (libc.so.6) rather that by VASP itself
Last edited by admin on Mon Dec 02, 2013 6:49 pm, edited 1 time in total.
-
- Newbie
- Posts: 15
- Joined: Tue Jan 08, 2008 7:58 pm
VASP linear response problem keeps failing. Out of memory?
I am having the same problem. The program will just stuck for two days without report any problem. It occurs after the first DOF of linear response is finished.
Last edited by abalone on Sun Jan 12, 2014 3:06 pm, edited 1 time in total.
-
- Hero Member
- Posts: 593
- Joined: Tue Nov 16, 2004 2:21 pm
- License Nr.: 5-67
- Location: Germany
VASP linear response problem keeps failing. Out of memory?
Hi there again,
I would try
ALGO = N
in the INCAR. It look's like that DFPT uses the fast algorithm. Check the 'RMM' in the cycles.
Cheers,
alex
I would try
ALGO = N
in the INCAR. It look's like that DFPT uses the fast algorithm. Check the 'RMM' in the cycles.
Cheers,
alex
Last edited by alex on Mon Jan 13, 2014 10:05 am, edited 1 time in total.
-
- Newbie
- Posts: 1
- Joined: Sat Jan 18, 2014 2:36 pm
- License Nr.: NO LICENSE
VASP linear response problem keeps failing. Out of memory?
You mean Pullay stress? Could you explain more?
Last edited by salina on Sat Jan 18, 2014 2:38 pm, edited 1 time in total.
salina