molml.atom module

A module to compute atom based representations.

This module contains a variety of methods to extract features from molecules based on the atoms in the molecule. This means that every molecule will result in an array of values (n_atoms, n_features).

class molml.atom.Shell(input_type='list', n_jobs=1, depth=1, use_coordination=False, add_unknown=False)

Bases: molml.base.SetMergeMixin, molml.base.BaseFeature

A feature that counts the number of elements in a distance shell from the starting atom. This is similar to the features developed in Qu et. al. with the exception that it is atom-based rather than bond-based.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

depth : int, default=1

The length of the atom chains to generate for connections

use_coordination : boolean, default=False

Specifies whether or not to use the coordination number of the atoms (C1 vs C2 vs C3 vs C4).

add_unknown : boolean, default=False

Specifies whether or not to include an extra UNKNOWN count in the feature vector.

References

Qu, X.; Latino, D. A.; Aires-de Sousa, J. A Big Data Approach to the Ultra-fast Prediction of DFT-calculated Bond Energies. J. Cheminf. 2013, 5, 34.

Attributes:
_elements : tuple

All the elements/types that are in the fit molecules.

ATTRIBUTES = ('_elements',)
LABELS = (('get_shell_labels', '_elements'),)
get_shell_labels(self, elements)
class molml.atom.LocalEncodedBond(input_type='list', n_jobs=1, segments=100, smoothing='norm', start=0.2, end=6.0, slope=20.0, min_depth=0, max_depth=0, spacing='linear', form=1, add_unknown=False, use_comb_idxs=False)

Bases: molml.base.FormMixin, molml.base.SetMergeMixin, molml.base.EncodedFeature

A smoothed histogram of atomic distances.

This is a method to generallize the idea of bond counting. Instead of seeing bonds as a discrete count that is thresholded at a given length, they are seen as general distance histograms. This is supplemented with smoothing functions. This is a slight modification of the EncodedBond to use with atoms.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

segments : int, default=100

The number of bins/segments to use when generating the histogram.

smoothing : string or callable, default=’norm’

A string or callable to use to smooth the histogram values. If a callable is given, it must take just a single argument that is a float. For a list of supported default functions look at SMOOTHING_FUNCTIONS.

start : float, default=0.2

The starting point for the histgram sampling in angstroms.

end : float, default=6.0

The ending point for the histogram sampling in angstroms.

slope : float, default=20.

A parameter to tune the smoothing values. This is applied as a multiplication before calling the smoothing function.

min_depth : int, default=0

A parameter to set the minimum geodesic distance to include in the interactions. A value of np.inf signifies including only intermolecular interactions.

max_depth : int, default=0

A parameter to set the maximum geodesic distance to include in the interactions. A value of 0 signifies that all interactions are included.

spacing : string or callable, default=’linear’

The histogram interval spacing type. Must be one of (“linear”, “inverse”, or “log”). Linear spacing is normal spacing. Inverse takes and evaluates the distances as 1/r and the start and end points are 1/x. For log spacing, the distances are evaluated as numpy.log(r) and the start and end points are numpy.log(x). If the value is callable, then it should take a float or vector of floats and return a similar mapping like the other methods.

form : int, default=1

The histogram splitting style to use. This value changes the scaling of this method to be O(E) or O(1) for 1 or 0 respectively (where E is the number of elements).

add_unknown : boolean, default=False

Specifies whether or not to include an extra UNKNOWN count in the feature vector.

use_comb_idxs : bool, default=False

Whether or not to use all combinations of indices when doing the subselection. If this is false, a middle out scheme will be used.

Attributes:
_elements : tuple

A tuple of all the elements in the fit molecules.

ATTRIBUTES = ('_elements',)
LABELS = (('get_encoded_labels', '_elements'),)
class molml.atom.LocalEncodedAngle(input_type='list', n_jobs=1, segments=100, smoothing='norm', slope=20.0, min_depth=0, max_depth=0, r_cut=6.0, form=2, add_unknown=False)

Bases: molml.base.FormMixin, molml.base.SetMergeMixin, molml.base.EncodedFeature

A smoothed histogram of atomic angles.

This method is similar to EncodedBond but for angles in molecules. This is done by enumerating triplets of atoms and computing the angle between them. The bins are thing smoothed with smoothing functions. This is a slight modification of the EncodedAngle to work with single atoms at a time. This sets the vertex of the angle to be the atom being examined.

Note: The angles used are 0 to \(\pi\).

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

segments : int, default=100

The number of bins/segments to use when generating the histogram.

smoothing : string or callable, default=’norm’

A string or callable to use to smooth the histogram values. If a callable is given, it must take just a single argument that is a float. For a list of supported default functions look at SMOOTHING_FUNCTIONS.

slope : float, default=20.

A parameter to tune the smoothing values. This is applied as a multiplication before calling the smoothing function.

min_depth : int, default=0

A parameter to set the minimum geodesic distance to include in the interactions. A value of np.inf signifies including only intermolecular interactions.

max_depth : int, default=0

A parameter to set the maximum geodesic distance to include in the interactions. A value of 0 signifies that all interactions are included.

form : int, default=2

The histogram splitting style to use. This value changes the scaling of this method to be O(E^2), O(E), or O(1) for 2, 1, or 0 respectively (where E is the number of elements).

add_unknown : boolean, default=False

Specifies whether or not to include an extra UNKNOWN count in the feature vector.

use_comb_idxs : bool, default=False

Whether or not to use all combinations of indices when doing the subselection. If this is false, a middle out scheme will be used.

Attributes:
_pairs : tuple

A tuple of all the element pairs in the fit molecules.

ATTRIBUTES = ('_pairs',)
LABELS = (('get_encoded_labels', '_pairs'),)
f_c(self, R)
class molml.atom.LocalCoulombMatrix(input_type='list', n_jobs=1, max_occupancy=4, r_cut=10.0, alpha=6, use_reduced=False, use_decay=False)

Bases: molml.base.BaseFeature

An implementation of the Coulomb Matrix where only the local atom environment is used by using a cutoff radius.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

max_occupancy : int, default=4

The maximum number of atoms to be included the in local environment.

r_cut : float, default=6

The maximum distance allowed for atoms to be considered local to the “central atom”.

alpha : number, default=6

Some value to exponentiate the distance in the coulomb matrix.

use_reduced : bool, default=False

This setting uses only the first row of the local coulomb matrix and the diagonal. This reduces the feature from scaling as O(max_occupancy ** 2) to just O(max_occupancy).

use_decay : bool, default=False

This setting defines an extra decay for the values as they get futher away from the “central atom”. This is to alleviate issues the arise as atoms enter or leave the cutoff radius.

\[\begin{split}M_{ij} = \begin{cases} \frac{Z_{p_i} Z_{p_j}}{(\| R_{p_1} - R_{p_i} \|_2 + \| R_{p_1} - R_{p_j} \|_2 + \| R_{p_i} - R_{p_j} \|_2 )^{\alpha}}, & i \neq j \\ 0.5 Z_{p_i}^{2.4} & i = j \end{cases}\end{split}\]

References

Barker, J.; Bulin, J.; Hamaekers, J.; Mathias, S. Localized Coulomb Descriptors for the Gaussian Approximation Potential. arXiv 1611.05126

ATTRIBUTES = None
LABELS = (('get_local_coulomb_labels', None),)
fit(self, X, y=None)

No fitting is required because it is defined by the parameters.

get_local_coulomb_labels(self)
class molml.atom.BehlerParrinello(input_type='list', n_jobs=1, r_cut=6.0, r_s=1.0, eta=1.0, lambda_=1.0, zeta=1.0)

Bases: molml.base.SetMergeMixin, molml.base.BaseFeature

An implementation of the descriptors used in Behler-Parrinello Neural Networks.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

r_cut : float, default=6.

The maximum distance allowed for atoms to be considered local to the “central atom”.

r_s : float, default=1.0

An offset parameter for computing gaussian values between pairwise distances.

eta : float, default=1.0

A decay parameter for the gaussian distances.

lambda_ : float, default=1.0

This value sets the orientation of the cosine function for the angles. It should only take values in {-1., 1.}.

zeta : float, default=1.0

A decay parameter for the angular terms.

References

Behler, J; Parrinello, M. Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces. Phys. Rev. Lett. 98, 146401.

Attributes:
_elements : tuple

A set of all the elements in the molecules.

_element_pairs : tuple

A set of all the element pairs in the molecules.

ATTRIBUTES = ('_elements', '_element_pairs')
LABELS = ('_elements', ('get_chain_labels', '_element_pairs'))
calculate_Theta(self, R_vecs)

Compute the angular term for all triples of atoms.

\[\Theta_{ijk} = (R_{ij} . R_{ik}) / (|R_{ij}| |R_{ik}|)\]

Right now this is a fairly naive implementation so this could be optimized quite a bit.

Parameters:
R_vecs : array, shape=(N_atoms, 3)

An array of the Cartesian coordinates of all the atoms

Returns:
Theta : array, shape=(N_atoms, N_atoms, N_atoms)

The angular term for all the atoms given.

f_c(self, R)
g_1(self, R, elements)

A radial symmetry function.

\[G^1_i = \sum_{j \neq i} \exp(- \eta (R_{ij} - R_s)^2) f_c(R_{ij})\]
Parameters:
R : array, shape=(N_atoms, N_atoms)

A distance matrix for all the atoms (scipy.spatial.cdist)

Returns:
total : array, shape=(N_atoms, N_elements)

The atom-wise g_1 evaluations.

g_2(self, Theta, R, elements)

An angular symmetry function.

\[G^2_i = 2^{1-\zeta} \sum_{i,k \neq i} (1 + \lambda \cos(\Theta_{ijk}))^\zeta \exp(-\eta (R_{ij}^2 + R_{ik}^2 + R_{jk}^2)) f_c(R_{ij}) f_c(R_{ik}) f_c(R_{jk})\]

This function needs to be optimized.

Parameters:
Theta : array, shape=(N_atoms, N_atoms, N_atoms)

An array of triplet angles.

R : array, shape=(N_atoms, N_atoms)

A distance matrix for all the atoms (scipy.spatial.cdist).

elements : list

A list of all the elements in the molecule.

Returns:
total : array, shape=(N_atoms, len(self._element_pairs))

The atom-wise g_2 evaluations.

get_chain_labels(self, chains)