molml.molecule module

A module to compute molecule based representations.

This module contains a variety of methods to extract features from molecules based on the entire molecule. All of the methods included here will produce one vector per molecule input.

class molml.molecule.Connectivity(input_type='list', n_jobs=1, depth=1, use_bond_order=False, use_coordination=False, add_unknown=False, do_tfidf=False)

Bases: molml.base.SetMergeMixin, molml.base.BaseFeature

A collection of feature types based on the connectivity of atoms.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

depth : int, default=1

The length of the atom chains to generate for connections

use_bond_order : boolean, default=False

Specifies whether or not to use bond order information (C-C versus C=C). Note: for depth=1, this option does nothing.

use_coordination : boolean, default=False

Specifies whether or not to use the coordination number of the atoms (C1 vs C2 vs C3 vs C4).

add_unknown : boolean, default=False

Specifies whether or not to include an extra UNKNOWN count in the feature vector.

do_tfidf : boolean, default=False

Apply weighting to counts based on their inverse document (molecule) frequency.

References

Collins, C.; Gordon, G.; von Lilienfeld, O. A.; Yaron, D. Constant Size Molecular Descriptors For Use With Machine Learning. arXiv:1701.06649

Attributes:
_base_groups : tuple, tuples

All the chains that are in the fit molecules.

ATTRIBUTES = ('_base_groups', '_idf_values')
LABELS = (('get_chain_labels', '_base_groups'),)
fit(self, X, y=None)

Fit the model.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to fit.

Returns:
self : object

Returns the instance itself.

get_chain_labels(self, chains)
class molml.molecule.ConnectivityTree(input_type='list', n_jobs=1, depth=1, use_bond_order=False, use_coordination=False, preserve_paths=False, use_parent_element=True, add_unknown=False, do_tfidf=False)

Bases: molml.molecule.Connectivity

A collection of feature types based on a connectivity tree of atoms.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

depth : int, default=1

The length of the atom trees to generate for connections

use_bond_order : boolean, default=False

Specifies whether or not to use bond order information (C-C versus C=C). Note: for depth=1, this option does nothing.

use_coordination : boolean, default=False

Specifies whether or not to use the coordination number of the atoms (C1 vs C2 vs C3 vs C4).

preserve_paths : boolean, default=False

Include the local index to the parent node in each tuple. This helps to differentiate elements at the same depth, but with different parents. Note: for depth<3, this option does nothing.

use_parent_element : boolean, default=True

Include the parent nodes element type. This helps to differentiate elements with different parent elements, but not to the same extreme as preserve_paths. Note: this does nothing if use_bond_order is set as they are redundant.

add_unknown : boolean, default=False

Specifies whether or not to include an extra UNKNOWN count in the feature vector.

do_tfidf : boolean, default=False

Apply weighting to counts based on their inverse document (molecule) frequency.

Attributes:
_base_groups : tuple, tuples

All the trees that are in the fit molecules.

ATTRIBUTES = ('_base_groups', '_idf_values')
LABELS = (('get_tree_labels', '_base_groups'),)
get_tree_labels(self, trees)
class molml.molecule.Autocorrelation(input_type='list', n_jobs=1, depths=(0, 1, 2, 3), properties=None)

Bases: molml.base.BaseFeature

A molecular descriptor based on Autocorrelation functions for properties.

This is a compact (only depends on the number of properties used and the number of depths) molecule representation that uses the graph distance between atoms to extract information.

\[V_d = \sum_i \sum_j P_i P_j \delta(d_{ij}, d)\]
Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

depths : list/tuple, default=(0, 1, 2, 3)

A list of depths to use for computing the autocorrelations functions.

properties : list/tuple, default=None

A list/tuple of properties to use. Each of these properties should be defined for a single atom in the molecule. Each property can be either a function (that takes in a LazyValues function and returns a vector the with one element per atom) or it can be a one of the following strings (‘Z’, ‘EN’, ‘CN’, ‘I’, ‘R’). Each of these keys corresponds to the atomic number, the electronegativity, coordination number, the identity function (always returns 1), and the covalent radius. If this value is None, then all the predefined properties will be used.

References

Janet, J. P. and Kulik, H. J. Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure-Property Relationships. J. Phys. Chem. A 2017, 121, 8939-8954

ATTRIBUTES = None
FUNCTIONS = {'CN': <function <lambda>>, 'EN': <function <lambda>>, 'I': <function <lambda>>, 'R': <function <lambda>>, 'Z': <function <lambda>>}
LABELS = ('_labels',)
fit(self, X, y=None)

No fitting is required because it is defined by the parameters.

class molml.molecule.EncodedAngle(input_type='list', n_jobs=1, segments=40, smoothing='norm', slope=20.0, min_depth=0, max_depth=0, form=3, r_cut=6.0, add_unknown=False)

Bases: molml.base.SetMergeMixin, molml.base.EncodedFeature

A smoothed histogram of atomic angles.

This method is similar to EncodedBond but for angles in molecules. This is done by enumerating all triplets of atoms and computing the angle between them. The bins are then smoothed with smoothing functions. Note: The angles used are 0 to pi.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

segments : int, default=40

The number of bins/segments to use when generating the histogram. Empirically, it has been found that there is no benefit to having more than 40-50 segments.

smoothing : string or callable, default=’norm’

A string or callable to use to smooth the histogram values. If a callable is given, it must take just a single argument that is a float (or vector of floats). For a list of supported default functions look at SMOOTHING_FUNCTIONS.

slope : float, default=20.

A parameter to tune the smoothing values. This is applied as a multiplication before calling the smoothing function.

min_depth : int, default=0

A parameter to set the minimum geodesic distance to include in the interactions. A value of np.inf signifies including only intermolecular interactions.

max_depth : int, default=0

A parameter to set the maximum geodesic distance to include in the interactions. A value of 0 signifies that all interactions are included.

form : int, default=3

The histogram splitting style to use. This changes the scaling of this method to be O(E^3), O(E^2), O(E), or O(1) for 3, 2, 1, or 0 respectively (where E is the number of elements).

r_cut : float, default=6.

The maximum distance allowed for atoms to be considered local to the “central atom”.

add_unknown : boolean, default=False

Specifies whether or not to include an extra UNKNOWN count in the feature vector.

Attributes:
_groups : tuple, tuples

A tuple of all the groups (element chains) in the fit molecules.

ATTRIBUTES = ('_groups',)
LABELS = (('get_encoded_labels', '_groups'),)
f_c(self, R)
class molml.molecule.EncodedBond(input_type='list', n_jobs=1, segments=100, smoothing='norm', start=0.2, end=6.0, slope=20.0, min_depth=0, max_depth=0, spacing='linear', form=2, add_unknown=False)

Bases: molml.base.SetMergeMixin, molml.base.EncodedFeature

A smoothed histogram of atomic distances.

This is a method to generallize the idea of bond counting. Instead of seeing bonds as a discrete count that is thresholded at a given length, they are seen as general distance histograms. This is supplemented with smoothing functions.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

segments : int, default=100

The number of bins/segments to use when generating the histogram. Empirically, it has been found that values beyond 50-100 have little benefit.

smoothing : string or callable, default=’norm’

A string or callable to use to smooth the histogram values. If a callable is given, it must take just a single argument that is a float (or vector of floats). For a list of supported default functions look at SMOOTHING_FUNCTIONS.

start : float, default=0.2

The starting point for the histgram sampling in angstroms.

end : float, default=6.0

The ending point for the histogram sampling in angstroms.

slope : float, default=20.

A parameter to tune the smoothing values. This is applied as a multiplication before calling the smoothing function.

min_depth : int, default=0

A parameter to set the minimum geodesic distance to include in the interactions. A value of np.inf signifies including only intermolecular interactions.

max_depth : int, default=0

A parameter to set the maximum geodesic distance to include in the interactions. A value of 0 signifies that all interactions are included.

spacing : string or callable, default=’linear’

The histogram interval spacing type. Must be one of (“linear”, “inverse”, or “log”). Linear spacing is normal spacing. Inverse takes and evaluates the distances as 1/r and the start and end points are 1/x. For log spacing, the distances are evaluated as numpy.log(r) and the start and end points are numpy.log(x). If the value is callable, then it should take a float or vector of floats and return a similar mapping like the other methods.

form : int, default=2

The histogram splitting style to use. This changes the scaling of this method to be O(E^2), O(E), or O(1) for 2, 1, or 0 respectively (where E is the number of elements).

add_unknown : boolean, default=False

Specifies whether or not to include an extra UNKNOWN count in the feature vector.

References

Collins, C.; Gordon, G.; von Lilienfeld, O. A.; Yaron, D. Constant Size Molecular Descriptors For Use With Machine Learning. arXiv:1701.06649

Attributes:
_element_pairs : tuple, tuples

A tuple of all the element pairs in the fit molecules.

ATTRIBUTES = ('_element_pairs',)
LABELS = (('get_encoded_labels', '_element_pairs'),)
class molml.molecule.CoulombMatrix(input_type='list', n_jobs=1, sort=False, eigen=False, drop_values=False, only_lower_triangle=False)

Bases: molml.base.BaseFeature

A molecular descriptor based on Coulomb interactions.

This is a feature that uses a Coulomb-like interaction between all atoms in the molecule to generate a matrix that is then vectorized.

\[\begin{split}C_{ij} = \begin{cases} \frac{Z_i Z_j}{\| r_i - r_j \|} & i \neq j \\ 0.5 Z_i^{2.4} & i = j \end{cases}\end{split}\]
Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

sort : bool, default=False

Specifies whether or not to sort the coulomb matrix based on the sum of the rows (same as L1 norm).

eigen : bool, default=False

Specifies whether or not to use the eigen spectrum of the coulomb matrix rather than the matrix itself. This changes the scaling to be linear in the number of atoms.

drop_values : bool, default=False

Specifies whether or not to drop the atoms from molecules larger than the training set. If this value is set to False, and the molecule is too large to transform, the transform will throw a ValueError. If it is set to True, then it will truncate the molecule to only include the first _max_size atoms of the molecule.

only_lower_triangle : bool, default=False

Specifies whether or not to only use the lower triangle of the Coulomb Matrix. This cuts the dimensionality in half by removing the duplicate values in the upper triangle of the matrix. This does nothing if eigen is set.

References

Rupp, M.; Tkatchenko, A.; Muller, K.-R.; von Lilienfeld, O. A. Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Phys. Rev. Lett. 2012, 108, 058301.

Hansen, K.; Montavon, G.; Biegler, F.; Fazli, S.; Rupp, M.; Scheffler, M.; von Lilienfeld, O. A.; Tkatchenko, A.; Muller, K.-R. Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies. J. Chem. Theory Comput. 2013, 9, 3404-3419.

Attributes:
_max_size : int

The size of the largest molecule in the fit molecules by number of atoms.

ATTRIBUTES = ('_max_size',)
LABELS = (('get_coulomb_labels', '_max_size'),)
fit(self, X, y=None)

Fit the model.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to fit.

Returns:
self : object

Returns the instance itself.

get_coulomb_labels(self, max_size)
class molml.molecule.BagOfBonds(input_type='list', n_jobs=1, drop_values=False, add_atoms=False)

Bases: molml.base.BaseFeature

A molecular descriptor that groups interactions from the Coulomb Matrix.

This feature starts the same as the Coulomb Matrix, and then interaction terms of the same element pair are grouped together and then sorted before they are vectorized.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

drop_values : bool, default=False

Specifies whether or not to drop interactions if there are more than was seen in the training set. If this value is set to False, and the molecule is too large to transform, it will throw a ValueError. If it is set to True, then it will truncate that particular bag to only include the largest _bag_sizes[ele1, ele2] of the molecule.

add_atoms : bool, default=False

Adds the diagonal of the Coulomb Matrix to the bags.

References

Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; von Lilienfeld, O. A.; Muller, K.-R.; Tkatchenko, A. Machine Learning Predictions of Molecular Properties: Accurate Many-body Potentials and Nonlocality in Chemical Space. J. Phys. Chem. Lett. 2015, 6, 2326-2331.

Attributes:
_bag_sizes : dict, element pair->int

A dictonary mapping element pairs to the maximum size of that element pair block in all the fit molecules.

ATTRIBUTES = ('_bag_sizes',)
LABELS = (('get_bob_labels', '_bag_sizes'),)
fit(self, X, y=None)

Fit the model.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to fit.

Returns:
self : object

Returns the instance itself.

get_bob_labels(self, bag_sizes)