Latent Structure Pattern Mining

Abstract. Pattern mining methods for graph data have largely been restricted to ground features, such as frequent or correlated subgraphs. Kazius et al. have demonstrated the use of elaborate patterns in the biochemical domain, summarizing several ground features at once. Such patterns bear the potential to reveal latent information not present in any individual ground feature. However, those patterns were handcrafted by chemical experts. In this paper, we present a data-driven bottom-up method for pattern generation that takes advantage of the embedding relationships among individual ground features. The method works fully automatically and does not require data preprocessing (e.g., to introduce abstract node or edge labels). Controlling the process of generating ground features, it is possible to align them canonically and merge (stack) them, yielding a weighted edge graph. In a subsequent step, the subgraph features can further be reduced by singular value decomposition (SVD). Our experiments show that the resulting features enable substantial performance improvements on chemical datasets that have been problematic so far for graph mining approaches.

News

  • 13 Apr 2012: High level tutorial explaining LAST-PM integration in Lazar.
  • 29 Mar 2012: Development version supports Ruby 1.9.
  • 04 Jan 2012: Java bindings available (in addition to Ruby, Python).
  • 04 Nov 2011: LAST-PM is cited in the new and highly accessed  OpenBabel paper.
  • 14 Aug 2011: The OpenTox paper is one of the most-accessed Open-Access articles. Before, OpenTox was already featured on Slashdot. LAST-PM is available as REST web service within OpenTox.
  • 01 Jul 2011: Added experimental multinomial support and regression support, i.e. pattern selection according to multiple class (>2) values and numeric values.
  • 22 Jun 2010: The Latent Structure Pattern Mining paper (co-authored by Christoph Helma, Tobias Cramer, and Stefan Kramer) has been accepted for the ECML/PKDD 2010 conference (Sep 20 – Sep 24 2010 in Barcelona) for a presentation at the conference and inclusion in the conference proceedings.

trRow_img View embedded version of the paper (you may need to be logged in with your Google account).

Usage

Note 1: I recommend LAST-PM for small- and medium-sized datasets of <800 compounds.
Note 2: The algorithm produces a sparse selection of class-correlated subgraph descriptors. Therefore, if your dataset is very skewed (say with very few actives) a non-class-correlated method might be more appropriate.
Note 3: Although a switch for regression is implemented, the algorithm has only been tested for mining class-correlated patterns in the context of binary classification.
Moreover, scientific evaluation was only done for the settings given in the paper. Find instructions how to reproduce the experimental results from ECML/PKDD 2010 below in the database downloads.

The below calls are for the purpose of demonstration. The most important use-cases and how to trigger them are described in tabular format on a separate page.

Usage via command line

Assuming a checked out and compiled algorithm in fminer2 subdirectory, a typical call is:

$ fminer2/fminer/fminer fminer2/liblast/liblast.so -f14 \
       ofsdata/folds1/yoshida.0.train.smi \
       ofsdata/folds1/yoshida.0.train.class

This calculates LAST-PM descriptors with minimum frequency 14 for Yoshida Dataset fold no 1/10. For help on switches and file formats, execute fminer2/fminer/fminer fminer2/liblast/liblast.so -h and read the README.
You would then convert this output to LAST-SMARTS using LAST-UTILS.

Note: The “\” in the code above indicate that the line does not end there. They are not part of the actual command and must not be entered.

Summary of command line options Installation and technical documentation API

Usage via REST web services

LAST-PM is available as a REST web service within OpenTox. Besides distributed computing, this allows for convenient direct access to the LAST-SMARTS format of the descriptors. However, you will have to upload your dataset first to an OpenTox dataset service (more details).

Assuming your dataset is available as http://hostname/dataset/212 and the prediction feature is given within the dataset as http://hostname/dataset/212/feature/Hamster%20Carcinogenicity, then you would call LAST-PM as follows:

curl -X POST http://hostname/algorithm/fminer/last/ \
-d "dataset_uri=http%3A%2F%2Fhostname%2Fdataset%2F212" \
-d "prediction_feature=http://hostname/\
    dataset/212/feature/Hamster%2520Carcinogenicity"

Note that you will have to URI-escape the arguments in any case, which could, as in the above example, lead to double URI-encoding for the prediction_feature value.

You can provide a minimum frequency constraint using parameter min_frequency. Follow the procedure outlined here to find an appropriate value for the parameter.

Summary of complete command line options

Interpretation Example

We inspected the molecular fragments generated for the classification of the bloodbarr dataset. A total number of 310 fragments was found by LAST-PM in the msa variant and 19 of these have ambiguities in the atomic number which represents a node of the structure.
Taking into account its statistical significance, we chose the highest relevant fragment with ambiguities for a closer inspection; its expression in SMARTS reads [#7,#8]-[#6]-[#6]-[#6]-[#7](-[#6])-[#6]-[#6]. A significant number of molecules containing this fragment are unable to cross the blood-brain barrier according to the binary classification of the dataset.

Example LAST-PM structure

Example LAST-PM structure

Depicted above are two examples for molecules which contain the fragment. Chemically interesting is the interchangeability of the hetero atoms nitrogen ([#7]) and oxygen ([#8]), marked by arrows. Although different in their chemical behavior, these atoms have in common a strong electronegativity. They become negatively charged when bound to aliphatic carbon atoms, as is the case in these fragments. In addition, the ambiguous position is separated by three carbon atoms from another negatively charged amine group. Therefore, the generated pattern represents molecules which possess the characteristic electrostatic structure of two negatively charged sites separated by a fixed distance. The occurrence of the pattern is a clear indication for a polar molecule.

This example structure involves just a single atom ambiguity. In general however, LAST-PM finds much more powerful and more flexible ambiguities. Read on to learn details about how this is expressed in SMARTS notation.

Latent Structure Output

The output is GraphML, an XML-based graph description language. It features the following extra fields:

  • act-binary target class
  • hops-how many ground patterns contributed to this meta pattern
  • lab-label of node or edge
  • weight-edge weight between 0 and hops
  • del-downweighted to 0 by SVD; indicates deletion.

Atom labels are atom count, or atom count + 150, if aromatic. Bond labels are bond order 1-3, or 4, if aromatic.

Example Output:

    <graph id="1" edgedefault="undirected">
        <data key="act">1</data>

        <data key="hops">3</data>
        <node id="0">
            <data key="lab_n">6</data>
        </node>
        <node id="1">

            <data key="lab_n">8</data>
        </node>
        <node id="2">
            <data key="lab_n">6</data>
        </node>

        <node id="3">
            <data key="lab_n">6</data>
        </node>
        <edge source="0" target="1">
            <data key="lab_e">4</data>

            <data key="weight">3</data>
            <data key="del">0</data>
        </edge>
        <edge source="1" target="2">
            <data key="lab_e">4</data>

            <data key="weight">2</data>
            <data key="del">0</data>
        </edge>
        <edge source="2" target="3">
            <data key="lab_e">4</data>

            <data key="weight">1</data>
            <data key="del">1</data>
        </edge>
    </graph>

This denotes four atoms, namely two carbons, one oxygen, and another carbon, interconnected by aromatic bonds with weights 3, 2, and 1.

LAST-UTILS

Code (Ruby)

Using last-utils, LAST-PM output is converted to a special type of SMARTS patterns (LAST-SMARTS). Formally, LAST-SMARTS are defined as follows in Extended Backus-Naur Form:

    AN := '17' | '35' | '5' | '6' | '7' | '8' | '15' | '16' | '9' | '53'
    A  := (AN ',' A) | AN
    SB := '-' | '=' | '#' | ':'
    E  := (SB ',' E) | SB
    N  := '[#' A ']'
    LR := (L ',' LR) | L
    L  := N '(' E LS ')' ('(' E N ')')+
    BN := '[#' A ';$' '(' LR ')' ']' '(~*)'+
    LS := (N | BN) | LS E (N | BN)

It uses an organic subset of atoms and four different bond types (in practice, the single bond ‘-‘ can be left out). The first two variants make use of recursive SMARTS to describe optional branches.

Take the following example of a latent structure converted to LAST-SMARTS. It describes two branches as optional:

[#7][#6;$([#6]([#7])([#7])=[#6]),$([#6]([#7][#6])([#7])=[#6])](~*)=[#6]

This denotes a nitrogen (single-)connected to a carbon double-connected to a carbon (bold). The middle carbon’s local environment, attached via conjunction (‘;‘), is described using two recursive SMARTS, denoted by $(...), and a disjunction (‘,‘).

Note: This is an example for larger optional parts of a structure, involving branches. Ambiguities on the atomic level do not require recursive SMARTS (not shown here).

Each recursive SMARTS consists of the atom itself, back, and forward links (underlined), but additionally specifies either a nitrogen or a nitrogen/carbon branch.
The back and forward links are needed, since the standard is truly recursive, i.e. nothing inside the

$(...)

is identified with the outside. For the same reason, we need

(~*)

(arbitrary branch) to enforce that at least one optional branch is actually attached.

The example would match the structures NC(N)=C and NC(NC)=C (one branch each), and also NC(N)(NC)=C (both branches), but not NC=C (no branch).

Try it out here.

Note: Variants msa and nls produce LAST-SMARTS with optional parts of the structures (with recursive SMARTS), while nop disallows optional parts of the structures (ambiguities only on the atom / edge level).

Main Database

Every .arff file contains train and test data (including features) in contiguous 90%/10% blocks. Ten such files make up a 10CV.

Database (including the 20 folds) with binary class labels, taken from the study Ulrich Rückert and Stefan Kramer, Optimizing Feature Sets for Structured Data, inStan Matwin and Dunja Mladenic, editors, 18th ECML. Springer, 2007.
The NCTRER dataset deals with the binding activity of small molecules at the estrogen receptor, the Yoshida dataset classifies molecules according to their bio-availability, and the bloodbarr dataset deals with the degree to which a molecule can cross the bloodbrain barrier.
LAST-PM, ALL, BBRC, MOSS instantiations for repeated 10-fold crossvalidation[*] .

  • Unzip file, start WEKA, open Experimenter.
  • On the first tab, click on “Open…” and load file “combined.exp”.
  • Run experiments on second tab.
  • View results on the third tab. Open “combined.csv”. Hit “Perform test”.

External Datasets

Bioavailability Datasets for external testing[*], taken from: Fumitaka Yoshida and John G. Topliss. QSAR Model for Drug Human Oral Bioavailability. Journal of Medicinal Chemistry, 43(13):2575–2585, Jun 2000.
The .arff files contain train and test data in two contiguous blocks.

  • Unzip file, start WEKA, open Experimenter.
  • On the first tab, click on “Open…” and load file “combined.exp”.
  • Run experiments on second tab.
  • View results on the third tab: Open “combined.csv”. Hit “Perform test”.

Blood-Brain Barrier Datasets for external testing[*], taken from: Hu Li, Chun Wei Yap, Choong Yong Ung, Ying Xue, Zhi Wei Cao, and Yu Zong Chen. Effect of Selection of Molecular Descriptors on the Prediction of Blood-Brain Barrier Penetrating and Nonpenetrating Agents by Statistical Learning Methods. Journal of Chemical Information and Modeling, 45(5):1376–1384, Aug 2005.
Every .arff file contains either train or test data (recognizable from the filename).

  • Unzip file, start WEKA, open Explorer.
  • On the first tab, click on “Open file…” and load a train file.
  • On the second tab, click on “Choose…” under “Classifier” and select “functions->SMO”.
  • Select “Supplied test set”, click “Set…” and load the test file corresponding to the train file.
  • Click “Start”.

Blood-Brain Barrier Datasets for repeated 10-fold crossvalidation[*] , taken from: T. J. Hou and X. J. Xu. ADME Evaluation in Drug Discovery. 3. Modeling Blood-Brain Barrier Partitioning Using Simple Molecular Descriptors. Journal of Chemical Information and Computer Sciences, 43(6):2137–2152, Oct 2003.
The .arff files contain train and test data in contiguous 90%/10% blocks. Ten such files make up a crossvalidaton.

  • Unzip file, start WEKA, open Experimenter.
  • On the first tab, click on “Open…” and load file “combined.exp”.
  • Run experiments on second tab.
  • View results on the third tab: Open “combined.csv”. Hit “Perform test”.

Links

Related work:

LAST-SMARTS related:

[*] Remarks:
The datasets are instantiated for WEKA (results were obtained with WEKA v3.6.1), i.e. include instances and features in .arff file format. Where possible, source files with chemical structures and activities are included for the external datasets.

You can recalculate the features as follows: File names such as

<dataset>.<min-frequency>.<fold>.train.<svd-compression>.<variant>.arff

have meanings, allowing to recalculate the features at any time using LAST-PM algorithm. Example:

yoshida.f14.9.train.svd20.a.nop.comb.arff

means

  • min-frequency 14,
  • fold no 9,
  • 20% SVD compression (standard) and
  • nop variant.

So the command line for fminer would be:

fminer /path/to/liblast.so -f14 \
/path/to/ofsdata/folds/yoshida.9.train.smi \
/path/to/ofsdata/folds/yoshida.9.train.class > my_features.graphml

The features are converted with last-utils, version dc6e291299dfa05e2b93c5ee734bb53eb013ddc0 as follows:

/path/to/ruby last-utils.rb 1 nop < my_features.graphml > my_features.smarts

Newer versions of last-utils are not guaranteed to produce the exact same descriptors due to improvements made later on.

Comments are closed.