Fnetdata: Generating a Dataset#

[Input: recipes/fortformat/fnetdata/]

This chapter should serve as a tutorial guiding you through your first dataset creation using the Fnetdata Python class. A distinction is made between two different types of target values; with separate sections being dedicated to:

  • global system properties

  • and atomic properties

  • a mixture of global and atomic properties

After this tutorial, you will be able to create a Fortnet compatible dataset, based on the output files of your simulation package of choice (e.g. VASP).

Global Properties#

[Input: recipes/fortformat/fnetdata/globalTargets/]

If a training on system-wide, global properties (e.g. the total energy) is desired, this section is an ideal introduction to generating a suitable dataset.

As an application example, the \(E\)-\(V\) scan of a primitive silicon unitcell in the diamond phase is used. The calculations were carried out by the famous Vienna Ab initio Simulation Package (VASP) [6, 7, 8, 9] for next neighbor distances in the interval \([2.00,3.50]\,\mathrm{Å}\) and a stepsize of \(0.05\,\mathrm{Å}\). The raw data is stored at recipes/fnetdata/globalTargets/vaspdata/ in the form of structure information (POSCAR) and simulation output (OUTCAR). The following Python script shows one possible way to get a dataset containing the total energy of the respective system, based on this raw data.

#!/usr/bin/env python3

'''
Application example of the Fnetdata class, based on a dataset
that provides global system properties as target values to fit.
'''

import os
import numpy as np
from fortformat import Fnetdata
from ase.io.vasp import read_vasp, read_vasp_out

def main():
    '''Main driver routine.'''

    nndists = np.arange(2.00, 3.50 + 0.05, 0.05)

    inpaths = [os.path.join(os.getcwd(), 'vaspdata', entry)
               for entry in sorted(os.listdir('vaspdata'))]

    strucs = []
    energies = np.empty((len(inpaths), 1))

    for ii, inpath in enumerate(inpaths):
        strucs.append(read_vasp(os.path.join(inpath, 'POSCAR')))
        props = read_vasp_out(os.path.join(inpath, 'OUTCAR'))
        energies[ii, 0] = props.get_total_energy()

    fnetdata = Fnetdata(atoms=strucs, globaltargets=energies)
    fnetdata.dump('fnetdata.hdf5')

if __name__ == '__main__':
    main()

Following the necessary imports, the main method first generates the corresponding next neighbor distances as already mentioned above. A simple list comprehension further establishes a list containing the input paths. While iterating over all input paths, an ASE Atoms object gets appended to an empty list of structures. The individual total energies of the datapoints are stored in an empty Numpy array, where the number of rows being determined by the number of datapoints and the columns by the number of global targets per datapoint. Finally, an Fnetdata object gets instantiated using the gathered information.

Atomic Properties#

[Input: recipes/fortformat/fnetdata/atomicTargets/]

If training on atom specific properties (e.g. atomic forces or charges) is desired, then this section is an ideal introduction to generating a suitable dataset.

As an application example, the \(E\)-\(V\) scan of a primitive silicon unitcell in the diamond phase is used. The calculations were carried out by the famous Vienna Ab initio Simulation Package (VASP) [6, 7, 8, 9] for next neighbor distances in the interval \([2.00,3.50]\,\mathrm{Å}\) and a stepsize of \(0.05\,\mathrm{Å}\). The raw data is stored at recipes/fnetdata/atomicTargets/vaspdata/ in the form of structure information (POSCAR) and simulation output (OUTCAR). The following Python script shows one possible way to get a dataset containing the total energy per atom of the respective system, based on this raw data. Please note that this is for demonstration purposes only and has no direct physical relevance. A more sensible dataset could, for example, contain the atomic forces or charges as targets.

#!/usr/bin/env python3

'''
Application example of the Fnetdata class, based on a dataset
that provides atomic system properties as target values to fit.
'''

import os
import numpy as np
from fortformat import Fnetdata
from ase.io.vasp import read_vasp, read_vasp_out

def main():
    '''Main driver routine.'''

    nndists = np.arange(2.00, 3.50 + 0.05, 0.05)

    inpaths = [os.path.join(os.getcwd(), 'vaspdata', entry)
               for entry in sorted(os.listdir('vaspdata'))]

    strucs = []
    energies = []

    for ii, inpath in enumerate(inpaths):
        struc = read_vasp(os.path.join(inpath, 'POSCAR'))
        strucs.append(struc)
        props = read_vasp_out(os.path.join(inpath, 'OUTCAR'))
        tmp = np.empty((len(struc), 1))
        tmp[:, 0] = props.get_total_energy() / 2.0
        energies.append(tmp)

    fnetdata = Fnetdata(atoms=strucs, atomictargets=energies)
    fnetdata.dump('fnetdata.hdf5')

if __name__ == '__main__':
    main()

The procedure is nearly analogous to the global target example above: Following the necessary imports, the main method first generates the corresponding next neighbor distances as already mentioned above. A simple list comprehension further establishes a list containing the input paths. While iterating over all input paths, an ASE Atoms object gets appended to an empty list of structures. Since each of those structures will in general have a different number of atoms, the target values are stored in a list of Numpy arrays, where the number of rows being determined by the number of atoms and the columns by the number of targets per atom. Finally, an Fnetdata object gets instantiated using the gathered information.

Weighting Datapoints#

[Input: recipes/fortformat/fnetdata/weighting/datapoints/]

There are conceivable situations in which weighting individual datapoints makes sense. The detour via the increased insertion of a datapoint is not only cumbersome but also inefficient, since exactly the same input features (e.g. ACSF) and gradients would be calculated multiple times. To elegantly circumvent this, Fnetdata and Fortnet offer the possibility of individually weighting certain datapoints of a dataset. After a Fortformat object has been instantiated, the desired weights can be handed over via a setter function. The following code snippet shows what this could look like:

# start with homogeneous weighting
weights = np.ones((31,), dtype=int)
# possibly, certain datapoints are more important
weights[4:13] = 3

fnetdata = Fnetdata(atoms=strucs, globaltargets=energies)
fnetdata.weights = weights
fnetdata.dump('fnetdata.hdf5')

For Fortformat to correctly recognize the weights, they must be specified as a onedimensional list or Numpy array of positive integers. If these requirements are not met, an error message is issued, so that nothing can terribly go wrong (fingers crossed).

Weighting Atomic Gradients#

[Input: recipes/fortformat/fnetdata/weighting/gradients/]

Further, there might be a need for different weighting of atomic contributions in the training process. This allows to change the contribution of specific atoms to the training process, as well as to completely switch off atoms, if the respective target would not be defined. Therefore, Fnetdata and Fortnet offer the possibility of setting atom-resolved weights after a Fortformat object has been instantiated. The desired weights can be handed over via a setter function. The following code snippet shows what this could look like:

# fix random seed for reproduction purposes
np.random.seed(42)

atomicweights = []
  .
  .
for ii, atom in enumerate(atoms):
        .
        .
    # float-valued atomic gradient weighting in interval [1, 10]
    atomicweights.append(np.asfarray(
        np.random.randint(1, 10, len(atom), dtype=int)))

fnetdata = Fnetdata(atoms=strucs, globaltargets=energies)
fnetdata.atomicweights = atomicweights
fnetdata.dump('fnetdata.hdf5')

Alternatively, the weights can be boolean-valued. This allows the contributions of individual atoms to be switched on or off. Currently, these are internally converted to floats 0.0 (False) and 1.0 (True), so there is no performance advantage. In the future, however, the corresponding gradient calculations will be skipped and thus a significant performance gain achieved:

# fix random seed for reproduction purposes
np.random.seed(42)
sample = [True, False]

atomicweights = []
  .
  .
for ii, atom in enumerate(atoms):
        .
        .
    # randomly activate/deactivate atomic contributions
    atomicweights.append(np.random.choice(sample, size=len(atom)))

fnetdata = Fnetdata(atoms=strucs, globaltargets=energies)
fnetdata.atomicweights = atomicweights
fnetdata.dump('fnetdata.hdf5')

For Fortformat to correctly recognize the weights, they must be specified as a onedimensional list of lists/numpy arrays of values \(\geq\) 0. If these requirements are not met, an error message is issued, so that nothing can terribly go wrong (again, fingers crossed).

External Atomic Features#

[Input: recipes/fortformat/fnetdata/extfeatures/]

Since currently only the Atom-Centered Symmetry Functions are available as a mapping of the geometries to infer suitable network inputs, Fnetdata, as well as Fortnet, offer the possibility of processing user-specified external atomic features. Thus every kind of imaginable input features can be used in the training and prediction process, which significantly expands the versatility of Fortnet. Of course, the user is responsible for checking the suitability of the features handed over. Charge specifications like the Mulliken populations of the individual atoms are conceivable.

The transfer of the selected features to the Fortformat class is straightforward. The keyword argument features expects a list of Numpy arrays, where the first dimension corresponds to the number of atoms of the associated geometry and the second to the number of features per atom. The example below shows how such a specification could look like. Random numbers are used as features, which therefore only serve a demonstrative purpose.

#!/usr/bin/env python3

'''
Application example of the Fnetdata class, based on a dataset
that provides global system properties as target values to fit.
The dataset is extended by user specified external atomic features.
'''

import os
import numpy as np
from fortformat import Fnetdata
from ase.io.vasp import read_vasp, read_vasp_out

def main():
    '''Main driver routine.'''

    np.random.seed(42)

    inpaths = [os.path.join(os.getcwd(), '../globalTargets/vaspdata', entry)
               for entry in sorted(os.listdir('../globalTargets/vaspdata'))]

    strucs = []
    features = []
    energies = np.empty((len(inpaths), 1))

    for ii, inpath in enumerate(inpaths):
        struc = read_vasp(os.path.join(inpath, 'POSCAR'))
        strucs.append(struc)
        props = read_vasp_out(os.path.join(inpath, 'OUTCAR'))
        energies[ii, 0] = props.get_total_energy()
        features.append(np.random.random_sample((len(struc), 3)))

    fnetdata = Fnetdata(atoms=strucs, globaltargets=energies,
                        features=features)
    fnetdata.dump('fnetdata.hdf5')

if __name__ == '__main__':
    main()