molml.crystal module

A module to compute molecule based representations.

This module contains a variety of methods to extract features from molecules based on the entire molecule. All of the methods included here will produce one vector per molecule input.

class molml.crystal.GenerallizedCrystal(input_type=None, n_jobs=1, transformer=None, radius=None, units=None)

Bases: molml.base.InputTypeMixin, molml.base.BaseFeature

A wrapper around other features to facilitate faking crystals.

This is done by a brute force expansion of atoms in the molecules based on a given unit cell. This is highly inefficient, but it does set a baseline.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

transformer : BaseFeature, default=None

The transformer that will be used once the atoms have been expanded into the crystal.

radius : float, default=None

The cutoff radius for including unit cells in angstroms.

units : list or int, default=None

The number of unit cells to include for each axis (if this is an int, then it is the same for all).

References

Faber, F.; Lindmaa, A; von Lilienfeld, O. A.; Armiento, R. Crystal Structure Representations for Machine Learning Models of Formation Energies. arXiv:1503.07406

ATTRIBUTES = None
LABELS = None
convert_input(self, X)

Convert the input (as specified in self.input_type) to a usable form.

Parameters:
X : list or string (depends on the instance value of input_type)

An object that stores the data for a single molecule. See the Notes for more details.

Returns:
values : Object

An object that allows the lazy evaluation of different properties

Raises:
ValueError

If the input_type given is not allowed.

Notes

If input_type is ‘list’, then it must be an iterable of (elements, coodinates pairs) for each molecule. Where the elements are an iterable of the form (ele1, ele2, …, elen) and coordinates are an iterable of the form [(x1, y1, z1), (x2, y2, z2), …, (xn, yn, zn)]. This allows allows for connections to be included. This is a dictionary where the keys are the indices of the atoms and the values are dictonaries with the key being another index and the value is the bond order (one of ‘1’, ‘Ar’, ‘2’, or ‘3’). Example for methane:

{
    0: {1: "1", 2: "1", 3: "1", 4: "1"},
    1: {0: "1"},
    2: {0: "1"},
    3: {0: "1"},
    4: {0: "1"},
}

If input_type is ‘filename’, then it must be an iterable of paths/filenames for each molecule. Currently, the supported formats are: xyz, mol2, and a simple xyz format (.out).

If input_type is a list, then they will be treated as labels to each of the arguments passed in via a tuple. For example, input_type=”list” can be reproduced with [“elements”, “coords”] or [“elements”, “coords”, “connections”].

If input_type is a callable, then it is assumed that the callable returns a LazyValues object.

fit(self, X, y=None)

Fit the model.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to fit.

Returns:
self : object

Returns the instance itself.

fit_transform(self, X, y=None)

A naive default implementation of fitting and transforming.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to fit and then transform

Returns:
array : array, shape=(n_samples, n_features)

The transformed features

transform(self, X, y=None)

Framework for a potentially parallel transform.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to transform

Returns:
array : array, shape=(n_samples, n_features)

The transformed features

class molml.crystal.EwaldSumMatrix(input_type='list', n_jobs=1, L_max=10, G_max=10, sort=False, eigen=False)

Bases: molml.molecule.CoulombMatrix

In this construction, we use a similar form to the Ewald sum of breaking the interaction into three parts and adding them together.

The interaction between two atoms is defined as follows

\[x_{ij} = x_{ij}^{(r)} + x_{ij}^{(m)} + x_{ij}^0.\]

The components are defined as follows

\[ \begin{align}\begin{aligned}x_{ij}^{(r)} = Z_i Z_j \sum_L \frac{\text{erfc}(\alpha \| r_i - r_j + L \|_2)} {\| r_i - r_j + L \|_2}\\x_{ij}^{(m)} = \frac{Z_i Z_j}{\pi V} \sum_G \frac{e^{-\|G\|_2^2 / (2 \alpha)^2}} {\|G\|_2^2} \cos(G \cdot (r_i - r_j))\\x_{ij}^0 = -(Z_i^2 + Z_j^2) \frac{\alpha}{\sqrt{\pi}} - (Z_i + Z_j)^2 \frac{\pi}{2 V \alpha^2}\\x_{ii} = -Z_i^2 \frac{\alpha}{\sqrt{\pi}} - Z_i^2 \frac{\pi}{2 V \alpha^2}\\\alpha = \sqrt{\pi} \left(\frac{0.01 M}{V}\right)^{1/6}\end{aligned}\end{align} \]
Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

sort : bool, default=False

Specifies whether or not to sort the coulomb matrix based on the sum of the rows (same as L1 norm).

eigen : bool, default=False

Specifies whether or not to use the eigen spectrum of the coulomb matrix rather than the matrix itself. This changes the scaling to be linear in the number of atoms.

References

Faber, F.; Lindmaa, A; von Lilienfeld, O. A.; Armiento, R. Crystal Structure Representations for Machine Learning Models of Formation Energies. arXiv:1503.07406

Attributes:
_max_size : int

The size of the largest molecule in the fit molecules by number of atoms.

ATTRIBUTES = ('_max_size',)
LABELS = None
class molml.crystal.SineMatrix(input_type='list', n_jobs=1, sort=False, eigen=False)

Bases: molml.molecule.CoulombMatrix

A molecular descriptor based on Coulomb interactions.

This is a feature that uses a Coulomb-like interaction between all atoms in the molecule to generate a matrix that is then vectorized.

\[\begin{split}C_{ij} = \begin{cases} Z_i Z_j \Phi(r_i, r_j) & i \neq j \\ 0.5 Z_i^{2.4} & i = j \end{cases}\end{split}\]

Where \(\Phi(r_i, r_j)\)

\[\| B \cdot \sum_{k={x,y,z}} \hat e_k \sin^2 \left[ \pi \hat e_k B^{-1} \cdot (r_i - r_j) \right] \|_2^{-1}\]

and \(B\) is a matrix of the lattice basis vectors.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

sort : bool, default=False

Specifies whether or not to sort the coulomb matrix based on the sum of the rows (same as L1 norm).

eigen : bool, default=False

Specifies whether or not to use the eigen spectrum of the coulomb matrix rather than the matrix itself. This changes the scaling to be linear in the number of atoms.

References

Faber, F.; Lindmaa, A; von Lilienfeld, O. A.; Armiento, R. Crystal Structure Representations for Machine Learning Models of Formation Energies. arXiv:1503.07406

Attributes:
_max_size : int

The size of the largest molecule in the fit molecules by number of atoms.

ATTRIBUTES = ('_max_size',)
LABELS = None