molml.base module

A collection of all the base transformer constructions.

This module is a collection of all the base classes and mixins for use with the other transformers.

class molml.base.BaseFeature(input_type='list', n_jobs=1)

Bases: object

A base class for all the features.

Parameters:
input_type : str, list of str, or callable, default=’list’

Specifies the format the input values will be (must be one of ‘list’, ‘filename’, a list of strings, or a callable). If it is a list of strings, the strings tell the order of (and if they are included) the different molecule attributes (coords, elements, numbers, connections). If a callable is given, then it is assumed to return a LazyValues object.

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

check_fit(self)

Check if the transformer has been fit

Raises:
ValueError

The transformer has not been fit.

convert_input(self, X)

Convert the input (as specified in self.input_type) to a usable form.

Parameters:
X : list or string (depends on the instance value of input_type)

An object that stores the data for a single molecule. See the Notes for more details.

Returns:
values : Object

An object that allows the lazy evaluation of different properties

Raises:
ValueError

If the input_type given is not allowed.

Notes

If input_type is ‘list’, then it must be an iterable of (elements, coodinates pairs) for each molecule. Where the elements are an iterable of the form (ele1, ele2, …, elen) and coordinates are an iterable of the form [(x1, y1, z1), (x2, y2, z2), …, (xn, yn, zn)]. This allows allows for connections to be included. This is a dictionary where the keys are the indices of the atoms and the values are dictonaries with the key being another index and the value is the bond order (one of ‘1’, ‘Ar’, ‘2’, or ‘3’). Example for methane:

{
    0: {1: "1", 2: "1", 3: "1", 4: "1"},
    1: {0: "1"},
    2: {0: "1"},
    3: {0: "1"},
    4: {0: "1"},
}

If input_type is ‘filename’, then it must be an iterable of paths/filenames for each molecule. Currently, the supported formats are: xyz, mol2, and a simple xyz format (.out).

If input_type is a list, then they will be treated as labels to each of the arguments passed in via a tuple. For example, input_type=”list” can be reproduced with [“elements”, “coords”] or [“elements”, “coords”, “connections”].

If input_type is a callable, then it is assumed that the callable returns a LazyValues object.

fit(self, X, y=None)

Fit the model.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to fit.

Returns:
self : object

Returns the instance itself.

fit_transform(self, X, y=None)

A naive default implementation of fitting and transforming.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to fit and then transform

Returns:
array : array, shape=(n_samples, n_features)

The transformed features

classmethod get_citation(self)
get_labels(self)

Get the labels for the features in the transformer

Returns:
values : tuple

All of the labels of the resulting features.

get_params(self)

Get a dictonary of all the feature parameters.

Returns:
params : dict

A dictonary of all the feature parameters.

map(self, f, seq)

Parallel implementation of map.

Parameters:
f : callable

A function to map to all the values in ‘seq’

seq : iterable

An iterable of values to process with ‘f’

Returns:
results : list, shape=[len(seq)]

The evaluated values

reduce(self, f, seq)

Parallel implementation of reduce.

This changes the problem from being O(n) steps to O(lg n)

Parameters:
f : callable

A function to use to reduce the values of ‘seq’

seq : iterable

An iterable of values to process

Returns:
results : object

A single reduced object based on ‘seq’ and ‘f’

save_json(self, f)

Save the model data in a json file

Parameters:
f : str or file descriptor

The path to save the data or a file descriptor to save it to.

set_params(self, **kwargs)

Set the feature parameter values.

Parameters:
kwargs : kwargs

Key value pairs to set for the feature parameters. Keys that are not valid parameters will be ignored.

slugify(self)

Convert an instance to a simple string.

Returns:
string : str

The slug string

to_json(self)

Return model data as a json compatible dict

This will recursively convert other transformer objects as well.

Returns:
data : dict

The json data

transform(self, X, y=None)

Framework for a potentially parallel transform.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to transform

Returns:
array : array, shape=(n_samples, n_features)

The transformed features

class molml.base.EncodedFeature(input_type='list', n_jobs=1, segments=100, smoothing='norm', slope=20.0, start=0.2, end=6.0, spacing='linear')

Bases: molml.base.BaseFeature

This is a generalized class to handle all kinds of encoding feature representations. These approaches seem to be a fairly general way of making lists of scalar values more effective to use in machine learning models. Essentially, it can be viewed as kernel smoothed histograms over the values of interest.

Parameters:
input_type : string, default=’list’

Specifies the format the input values will be (must be one of ‘list’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specifc amount, and numbers less than 1 will use the number of cores the computer has.

segments : int, default=100

The number of bins/segments to use when generating the histogram. Empirically, it has been found that values beyond 50-100 have little benefit.

smoothing : string or callable, default=’norm’

A string or callable to use to smooth the histogram values. If a callable is given, it must take just a single argument that is a float (or vector of floats). For a list of supported default functions look at SMOOTHING_FUNCTIONS.

start : float, default=0.2

The starting point for the histgram sampling in angstroms.

end : float, default=6.0

The ending point for the histogram sampling in angstroms.

slope : float, default=20.

A parameter to tune the smoothing values. This is applied as a multiplication before calling the smoothing function.

spacing : string or callable, default=’linear’

The histogram interval spacing type. Must be one of (“linear”, “inverse”, or “log”). Linear spacing is normal spacing. Inverse takes and evaluates the distances as 1/r and the start and end points are 1/x. For log spacing, the distances are evaluated as numpy.log(r) and the start and end points are numpy.log(x). If the value is callable, then it should take a float or vector of floats and return a similar mapping like the other methods.

References

Collins, C.; Gordon, G.; von Lilienfeld, O. A.; Yaron, D. Constant Size Molecular Descriptors For Use With Machine Learning. arXiv:1701.06649

encode_values(self, iterator, lengths, saved_lengths=0)

Encodes an iterable of values into a uniform length array. These values can then be indexed to allow binning them in different sections of the array. After the values are processed, the array can by flattened down to a desired number of axes.

Parameters:
iterator : iterable

The collection of values to encode. Each item in the iterable must contain values for (idx, value, scaling). Where idx is a tuple of integer values indicating which encoding bucket the values go in, value is the value to encode, and scaling is a factor that gets multiplied by the final encoded subvector before getting added to the total (This is mostly used to mask values and scale their influence with distance. If idx is None, then the value will be skipped.

lengths : tuple of ints

The number of encoding axes to create. In terms of EncodedBonds, this would be the number of element pairs.

saved_lengths : ints

The number of axis components to retain. The order that they get saved is the same order that is given in lengths. For example, when doing atom encodings, this should be 1 to retain the atom axis.

Returns:
vector : array

The final concatenated vector of all the subvectors. This will have a shape of lengths[:saved_lengths] + product(lengths[saved_lengths:]) * segments).

get_encoded_labels(self, groups)
Parameters:
groups : list

A list of all the groups.

Returns:
labels : list

A list of all the feature labels.

get_group_order(self, groups)
Parameters:
groups : list

A list of all the groups.

Returns:
value_order : list

A list of all groups in order.

class molml.base.FormMixin(form=1, add_unknown=False, use_comb_idxs=False, *args, **kwargs)

Bases: object

A simple mixin for handling form transformations

This mixin handles all how index mapping is done when going from higher dimensional attributes to lower dimensional ones. By default, this mixin uses the first value in ATTRIBUTES as the basis for the index mapping.

use_comb_idxs : bool, default=False
Whether or not to use all combinations of indices when doing the subselection. If this is false, a middle out scheme will be used.
get_group_order(self, groups)
Parameters:
groups : list

A list of all the groups. This is ignored.

Returns:
value_order : list

A list of all groups in order.

get_idx_map(self)

Lazily load the idx_map.

Returns:
idx_map : IndexMap

The IndexMap object for this form and add_unknown setting.

transform(self, X, y=None)

Framework for a potentially parallel transform.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to transform

Returns:
array : array, shape=(n_samples, n_features)

The transformed features

class molml.base.InputTypeMixin

Bases: object

A simple mixin to to check input_types if there are multiples.

This mixin adds a method to check if a transformer parameter does not have the same input_type as the parent object.

check_transformer(self, transformer)

Check a transformer.

Parameters:
transformer : BaseFeature

A transformer object.

Raises:
ValueError

If the input_type pairing given is not allowed.

class molml.base.SetMergeMixin

Bases: object

A simple mixin that will merge sets.

This mixin replaces all the duplicate code that just merges sets when doing the parallel fits. For this to work, it requires that the subclasses define ATTRIBUTES.

fit(self, X, y=None)

Fit the model.

Parameters:
X : list, shape=(n_samples, )

A list of objects to use to fit.

Returns:
self : object

Returns the instance itself.