Backbone Refinement Class Mining

Abstract. We present a new approach to large-scale graph mining based on so-called backbone refinement classes. The method efficiently mines tree-shaped subgraph descriptors under minimum frequency and significance constraints, using classes of fragments to reduce feature set size and running times. The classes are defined in terms of fragments sharing a common backbone. The method is able to optimize structural inter-feature entropy as opposed to purely occurrence-based criteria, which is characteristic for open or closed fragment mining.

News

  • 13 Apr 2012: High level tutorial explaining BBRC integration in Lazar.
  • 29 Mar 2012: Development version supports Ruby 1.9.
  • 04 Jan 2012: Java bindings available (in addition to Ruby, Python).
  • 04 Nov 2011: BBRC is cited in the new and highly accessed  OpenBabel paper.
  • 14 Aug 2011: The OpenTox paper is one of the most-accessed Open-Access articles. Before, OpenTox was already featured on Slashdot. BBRC is available as REST web service within OpenTox.
  • 01 Jul 2011: Added experimental multinomial support and regression support, i.e. pattern selection according to multiple class (>2) values and numeric values.
  • 22 Sep 2010: Transition package for fminer2 is available. You can use it to reproduce without much effort the crossvalidation results for BBRC descriptors in the paper, using ver 2 of the fminer software.
  • 05 May 2010: Machine Learning Journal article is available.
  • 29 Nov 2009: Added configure script and bugfixes.
  • 10 Jul 2009: KDD conference proceedings are online.
  • 30 Apr 2009: The paper has been selected for oral presentation at MLG 2009.
  • 29 Apr 2009: The Backbone Refinement Class paper (co-authored by Christoph Helma and Stefan Kramer) has been accepted for the KDD 2009 conference on Data Mining and Knowledge Discovery (Jun 28 – Jul 1 2009 in Paris) for a presentation at the conference and inclusion in the conference proceedings.

trRow_img View embedded version of the paper (you may need to be logged in with your Google account).

Usage

Note 1: I recommend BBRC mining for datasets of > 800 compounds up to very large sizes. Smaller datasets will yield a very sparse selection.
Note 2: The algorithm produces a sparse selection of class-correlated subgraph descriptors. Therefore, if your dataset is very skewed (say with very few actives) a non-class-correlated method might be more appropriate.
Note 3: Although a switch for regression is implemented, the algorithm has only been tested for mining class-correlated patterns in the context of binary classification. There is a package that you can use to automatically reproduce the KDD 2009 results, see http://github.com/amaunz/fminer2-transition.

The below calls are for the purpose of demonstration. The most important use-cases and how to trigger them are described in tabular format on a separate page.

Usage via command line

Assuming a checked out and compiled algorithm in fminer2 subdirectory, as an example, call:

$ fminer2/fminer/fminer fminer2/libbbrc/libbbrc.so -f6 \
   cpdbdata/salmonella_mutagenicity/salmonella_mutagenicity_alt.smi \
   cpdbdata/salmonella_mutagenicity/salmonella_mutagenicity_alt.class

This calculates BBRC descriptors with minimum frequency 6 for the Salmonella dataset. For help on switches and file formats, execute fminer2/fminer/fminer fminer2/libbbrc/libbbrc.so -h and read the README.

Note: The “\” in the code above indicate that the line does not end there. They are not part of the actual command and must not be entered.

Summary of command line options Installation and technical documentation API

Usage via REST web services

BBRC is available as a REST web service within OpenTox. Besides distributed computing, this allows for convenient access to the BBRC features in SMARTS format. However, you will have to upload your dataset first to an OpenTox dataset service (more details).

Assuming your dataset is available as http://hostname/dataset/212 and the prediction feature is given within the dataset as http://hostname/dataset/212/feature/Hamster%20Carcinogenicity, then you would call BBRC as follows:

curl -X POST http://hostname/algorithm/fminer/bbrc/ \
-d "dataset_uri=http%3A%2F%2Fhostname%2Fdataset%2F212" \
-d "prediction_feature=http://hostname/dataset/\
    212/feature/Hamster%2520Carcinogenicity"

Note that you will have to URI-escape the arguments in any case, which could, as in the above example, lead to double URI-encoding for the prediction_feature value.

You can provide a minimum frequency constraint using parameter min_frequency. Follow the procedure outlined here to find an appropriate value for the parameter.

Summary of complete command line options.

Euclidean Embedding of Molecules and Backbone Refinement Class Features based on Co-Occurrence and Entropy

(Schulz et al.)

The 2-D embedding reflects feature-feature and instance-feature co-occurrence as well as entropy of the features with respect to the target classes.

Usage for animations (Flash-plugin required):

  1. (De)activating features are (red) green, (In)active instances (salmon) blue.
  2. Point your mouse to a feature (instance). The matching instances (features) will be marked. The brighter a feature, the more significantly it is correlated with the endpoint.
  3. Use the mouse wheel to zoom in and out and the left mouse button to drag.

Data: CPDB salmonella mutagenicity
Thresholds: 95% significance, minimum frequency 6 (Switches: -f 6)
Euclidean embedding of a Backbone Refinement Class Descriptors and Salmonella Mutagenicity data
Static version.
Animation.

Data: CPDB Multicell Call
Thresholds: 95% significance, minimum frequency 6 (Switches: -f 6)
Euclidean embedding of a Backbone Refinement Class Descriptors and Multicell call data
Static version.
Animation.

Data: Mouse Carcinogenicity
Thresholds: 95% significance, minimum frequency 6 (Switches: -f 6)
Euclidean embedding of a Backbone Refinement Class Descriptors and Mouse Carcinogenicity data
Static version.
Animation.

Data: Rat Carcinogenicity
Thresholds: 95% significance, minimum frequency 6 (Switches: -f 6)
Euclidean embedding of a Backbone Refinement Class Descriptors and Rat Carcinogenicity data
Static version.
Animation.

Observations: Separation of target classes along top left to bottom right, many highly descriptive features. Good distribution of features and instances, and well-characterized groups of instances in the outer parts. The data seems suitable for classification tasks.

Validation and Repeatability

Links to the datasets used in this study (the thresholds used and switches for fminer are given below):

  • CPDB carcinogenicity and mutagenicity datasetsTo reproduce experiments (Fminer arguments) : 95% significance (default), minimum frequency 6 (-f6).
  • Yeast anti-cancer datasetsTo reproduce experiments (Fminer arguments) : 95% significance (default), minimum frequency given in the paper.

CPDB and Large-Scale Datasheet, comparison to Open Trees, Maximal Trees (large scale), Static vs Upper Bound Pruning. Excel Sheet (.xls) for MS Office 2000/2003/2007 (132K)

  • Runtime
  • Feature Count
  • Leave-One-Out Crossvalidation Accuracy
  • Large-Scale Analysis

Repeatability: the thresholds used and switches for fminer are given in the file.

Comments are closed.