molml.fragment module

A module to compute fragment based representations.

This module contains a variety of methods to extract features from molecules based on defined fragments in the molecule. This means that every molecule will result in an array of values (n_fragments, n_features). Note: If atom-wise features are used, then this would extend to be (n_fragments, n_atoms, n_features).

class molml.fragment.FragmentMap(input_type='filename', n_jobs=1, transformer=None, filename_to_label='basename', label_to_filename=('.', ))

Bases: molml.base.BaseFeature

Extract information based on features from fragments.

This is like if there were n features that were extracted from the molecule of interest, and each of these n features corresponded to their own feature vectors. These fragments are then used together as a single representation. The output of these fragment vectors is in the same order that they are given.

For example,

FragmentMap().fit_transform([[‘A’, ‘B’], [‘C, ‘A’], [‘B’, ‘C’])

would produce arrays like

[[f_A, f_B], [f_C, f_A], [f_B, f_C]]

for a final shape of (3, 2, n_features).

input_type : str, default=’filename’

Specifies the format the input values will be (must be one of ‘label’ or ‘filename’).

n_jobs : int, default=1

Specifies the number of processes to create when generating the features. Positive numbers specify a specific amount, and numbers less than 1 will use the number of cores the computer has.

transformer : BaseFeature, default=None

Some feature extractor that takes inputs and converts them to a numpy array of data. This should convert the fragment fragments into some vector representation to use. Because the information given to this class is at the label/filename level, the transformer must be able to work with the filenames directly. Either using the standard ‘filename’ input_type, or using a user-defined function.

filename_to_label : callable or str, default=’basename’

The function to use to convert labels into filenames. The function should take a single str argument and return a label to use for that filename. The conversion between labels and filenames is not really required, but may allow for simpler bookkeeping outside this class. There are some predefined functions available in cls.LABEL_FUNCTIONS (‘identity’, ‘basename’) as recommendations for what to use.

label_to_filename : callable or list of str, default=(‘.’, )

A function to convert labels into filenames to pass to the transformer. The function should take a single str argument and return a valid path. If a valid path does not exist, this should raise a ValueError. If this is a list, then it will be interpreted as a list of paths to search for files. Specifically, these are used in globs of the form os.path.join(dir_name, label + ‘.*’). Note: This will only use the first file that is found matching that label. The directories will be searched in the order given.

_x_fragments : dict, str->numpy.array

Dictionary mapping label strings to their corresponding feature vectors.

ATTRIBUTES = ('_x_fragments',)
LABELS = (('get_mapping_labels', None),)
LABEL_FUNCTIONS = {'basename': <function <lambda>>, 'identity': <function <lambda>>}
convert_input(self, X)

Convert the input (as specified in self.input_type) to a usable form.

X : list or string (depends on the instance value of input_type)

An object that stores the data for a single molecule. See the Notes for more details.

values : Object

An object that allows the lazy evaluation of different properties


If the input_type given is not allowed.


If input_type is ‘list’, then it must be an iterable of (elements, coodinates pairs) for each molecule. Where the elements are an iterable of the form (ele1, ele2, …, elen) and coordinates are an iterable of the form [(x1, y1, z1), (x2, y2, z2), …, (xn, yn, zn)]. This allows allows for connections to be included. This is a dictionary where the keys are the indices of the atoms and the values are dictonaries with the key being another index and the value is the bond order (one of ‘1’, ‘Ar’, ‘2’, or ‘3’). Example for methane:

    0: {1: "1", 2: "1", 3: "1", 4: "1"},
    1: {0: "1"},
    2: {0: "1"},
    3: {0: "1"},
    4: {0: "1"},

If input_type is ‘filename’, then it must be an iterable of paths/filenames for each molecule. Currently, the supported formats are: xyz, mol2, and a simple xyz format (.out).

If input_type is a list, then they will be treated as labels to each of the arguments passed in via a tuple. For example, input_type=”list” can be reproduced with [“elements”, “coords”] or [“elements”, “coords”, “connections”].

If input_type is a callable, then it is assumed that the callable returns a LazyValues object.

fit(self, X, y=None)

Fit the model.

X : list, shape=(n_samples, )

A list of objects to use to fit.

self : object

Returns the instance itself.