Generating a Dataset

[Input: recipes/fnetdata/]

This chapter should serve as a tutorial guiding you through your first dataset creation using the Fortformat Python class. A distinction is made between two different types of target values; with separate sections being dedicated to:

  • global system properties

  • and atomic properties

After this tutorial, you will be able to create a Fortnet compatible dataset, based on the output files of your simulation package of choice (e.g. VASP).

Fortformat: Basic Fortnet Input Format Class

This Python package provides a basic class which implements the associated Fortnet input file format. The input features and targets, stored in lists of Numpy arrays, conveniently get dumped to disk as fnetdata.xml files.

Installation

Please note, that this package has been tested for Python 3.X support. It additionally needs Numerical Python (the Numpy module).

Since the Fortformat class, among others, expects the so-called Atoms objects of the Atomic Simulation Environment (ASE) as an input, sooner or later this dependency will also have to be satisfied.

System install

You can install the script package via the standard Python setup mechanism. If you want to install it system-wide into your normal Python installation, you simply issue

python setup.py install

at tools/fortformat/ with an appropriate level of permission.

Local install

Alternatively, you can install it locally in your home space, e.g.:

python setup.py install --user

If the local Python install directory is not in your path, you should add this. For the bash shell you should include the following line in the .bashrc:

export PATH=$PATH:/home/user/.local/bin

Global Properties

[Input: recipes/fnetdata/globalTargets/]

If a training on system-wide, global properties (e.g. the total energy) is desired, this section is an ideal introduction to generating a suitable dataset.

As an application example, the \(E\)-\(V\) scan of a primitive silicon unitcell in the diamond phase is used. The calculations were carried out by the famous Vienna Ab initio Simulation Package (VASP) [6, 7, 8, 9] for next neighbor distances in the interval \([2.00,3.50]\,\mathrm{Å}\) and a stepsize of \(0.05\,\mathrm{Å}\). The raw data is stored at recipes/fnetdata/globalTargets/vaspdata/ in the form of structure information (POSCAR) and simulation output (OUTCAR). The following Python script shows one possible way to get a dataset containing the total energy of the respective system, based on this raw data.

#!/usr/bin/env python3

'''
Application example of the Fortformat class, based on a dataset
that provides global system properties as target values to fit.
'''

import os
import numpy as np
from fortformat import Fortformat
from ase.io.vasp import read_vasp, read_vasp_out

def main():
    '''Main driver routine.'''

    nndists = np.arange(2.00, 3.50 + 0.05, 0.05)

    inpaths = [os.path.join(os.getcwd(), 'vaspdata', entry)
               for entry in sorted(os.listdir('vaspdata'))]
    outpaths = [os.path.join(os.getcwd(), 'dataset', 'nndist_{:.3f}'
                             .format(nndist)) for nndist in nndists]

    strucs = []
    energies = np.empty((len(inpaths), 1))

    for ii, inpath in enumerate(inpaths):
        strucs.append(read_vasp(os.path.join(inpath, 'POSCAR')))
        props = read_vasp_out(os.path.join(inpath, 'OUTCAR'))
        energies[ii, 0] = props.get_total_energy()

    fnetdata = Fortformat(strucs, outpaths, targets=energies,
                          atomic=False, frac=True)
    fnetdata.dump()

if __name__ == '__main__':
    main()

Following the necessary imports, the main method first generates the corresponding next neighbor distances as already mentioned above. Two simple list comprehensions further establish lists with the in- and output paths. While iterating over all input paths, an ASE Atoms object gets appended to an empty list of structures. The individual total energies of the datapoints are stored in an empty Numpy array, where the number of rows being determined by the number of datapoints and the columns by the number of global targets per datapoint. Finally, a Fortformat object gets instantiated using the gathered information, as well as providing keyword arguments to determine if atomic properties are present (default: False) and whether the coordinates should be saved in fractional or absolute format (default: False).

Atomic Properties

[Input: recipes/fnetdata/atomicTargets/]

If training on atom specific properties (e.g. atomic forces) is desired, then this section is an ideal introduction to generating a suitable dataset.

As an application example, the \(E\)-\(V\) scan of a primitive silicon unitcell in the diamond phase is used. The calculations were carried out by the famous Vienna Ab initio Simulation Package (VASP) [6, 7, 8, 9] for next neighbor distances in the interval \([2.00,3.50]\,\mathrm{Å}\) and a stepsize of \(0.05\,\mathrm{Å}\). The raw data is stored at recipes/fnetdata/atomicTargets/vaspdata/ in the form of structure information (POSCAR) and simulation output (OUTCAR). The following Python script shows one possible way to get a dataset containing the total energy per atom of the respective system, based on this raw data. Please note that this is for demonstration purposes only and has no direct physical relevance. A more sensible dataset could, for example, contain the atomic forces as targets.

#!/usr/bin/env python3

'''
Application example of the Fortformat class, based on a dataset
that provides atomic system properties as target values to fit.
'''

import os
import numpy as np
from fortformat import Fortformat
from ase.io.vasp import read_vasp, read_vasp_out

def main():
    '''Main driver routine.'''

    nndists = np.arange(2.00, 3.50 + 0.05, 0.05)

    inpaths = [os.path.join(os.getcwd(), 'vaspdata', entry)
               for entry in sorted(os.listdir('vaspdata'))]
    outpaths = [os.path.join(os.getcwd(), 'dataset', 'nndist_{:.3f}'
                             .format(nndist)) for nndist in nndists]

    strucs = []
    energies = []

    for ii, inpath in enumerate(inpaths):
        struc = read_vasp(os.path.join(inpath, 'POSCAR'))
        strucs.append(struc)
        props = read_vasp_out(os.path.join(inpath, 'OUTCAR'))
        tmp = np.empty((len(struc), 1))
        tmp[:, 0] = props.get_total_energy() / 2.0
        energies.append(tmp)

    fnetdata = Fortformat(strucs, outpaths, targets=energies,
                          atomic=True, frac=True)
    fnetdata.dump()

if __name__ == '__main__':
    main()

The procedure is nearly analogous to the global target example above: Following the necessary imports, the main method first generates the corresponding next neighbor distances as already mentioned above. Two simple list comprehensions further establish lists with the in- and output paths. While iterating over all input paths, an ASE Atoms object gets appended to an empty list of structures. Since each of those structures will in general have a different number of atoms, the target values are stored in a list of Numpy arrays, where the number of rows being determined by the number of atoms and the columns by the number of targets per atom. Finally, a Fortformat object gets instantiated using the gathered informations, as well as providing keyword arguments to determine if atomic properties are present (default: False) and whether the coordinates should be saved in fractional or absolute format (default: False).

Weighting Datapoints

[Input: recipes/fnetdata/weighting/]

There are countless conceivable situations in which weighting individual datapoints makes sense. The detour via the increased insertion of a datapoint is not only cumbersome but also inefficient, since exactly the same input features (e.g. ACSF) and gradients would be calculated multiple times. To elegantly circumvent this, Fortformat and Fortnet offer the possibility of individually weighting certain datapoints of a dataset. After a Fortformat object has been instantiated, the desired weights can be handed over via a setter function. The following code snippet shows what this could look like:

# start with homogeneous weighting
weights = np.ones((31,), dtype=int)
# possibly, certain datapoints are more important
weights[4:13] = 3

fnetdata = Fortformat(strucs, outpaths, targets=energies,
                      atomic=False, frac=True)
fnetdata.weights = weights
fnetdata.dump()

For Fortformat to correctly recognize the weights, they must be specified as a onedimensional list or Numpy array of positive integers. If these requirements are not met, an error message is issued, so that nothing can terribly go wrong (fingers crossed).

Contiguous Dataset File

[Input: recipes/fnetdata/contiguous/]

If the dataset contains several hundred thousand or even millions of data points, individual files, each containing one data point, become impractical. To resolve this fact, both Fortformat and Fortnet support a contiguous format. If, instead of a list of output paths, only a single path is specified as a string, there is an automatic change to the contiguous format and an appropriate file gets written to the specified location. The modified line of code of the former section regarding global targets would look like this:

fnetdata = Fortformat(strucs, 'fnetdata.xml', targets=energies,
                      atomic=False, frac=True)
fnetdata.dump()

For Fortnet to correctly recognize the related format, the path or file containing the primary dataset (Dataset) must end with fnetdata.xml and that of the secondary dataset (Validset) would have to end with fnetvdata.xml. A correct specification in the fortnet_in.hsd input file could therefore be similar to this:

Data {
     .
     .
     .
  Dataset = '/home/user/training/fnetdata.xml'
  Validset = '/home/user/validation/fnetvdata.xml'
}

External Atomic Features

[Input: recipes/fnetdata/extfeatures/]

Since currently only the Atom-Centered Symmetry Functions are available as a mapping of the geometries to suitable network inputs, Fortformat, as well as Fortnet, offers the possibility of processing user-specified external atomic features. Thus every kind of imaginable input features can be used in the training and prediction process, which significantly expands the versatility of Fortnet. Of course, the user is responsible for checking the suitability of the features handed over. Charge specifications like the Mulliken populations of the individual atoms are conceivable.

The transfer of the selected features to the Fortformat class is straightforward. The keyword argument features expects a list of Numpy arrays, where the first dimension corresponds to the number of atoms of the associated geometry and the second to the number of features per atom. The example below shows how such a specification could look like. Random numbers are used as features, which therefore only serve a demonstrative purpose.

#!/usr/bin/env python3

'''
Application example of the Fortformat class, based on a dataset
that provides global system properties as target values to fit.
The dataset is extended by user specified external atomic features.
'''

import os
import numpy as np
from fortformat import Fortformat
from ase.io.vasp import read_vasp, read_vasp_out

def main():
    '''Main driver routine.'''

    np.random.seed(42)

    inpaths = [os.path.join(os.getcwd(), '../globalTargets/vaspdata', entry)
               for entry in sorted(os.listdir('../globalTargets/vaspdata'))]

    strucs = []
    features = []
    energies = np.empty((len(inpaths), 1))

    for ii, inpath in enumerate(inpaths):
        struc = read_vasp(os.path.join(inpath, 'POSCAR'))
        strucs.append(struc)
        props = read_vasp_out(os.path.join(inpath, 'OUTCAR'))
        energies[ii, 0] = props.get_total_energy()
        features.append(np.random.random_sample((len(struc), 3)))

    fnetdata = Fortformat(strucs, 'fnetdata.xml', targets=energies,
                          features=features, atomic=False, frac=True)
    fnetdata.dump()

if __name__ == '__main__':
    main()