cs.maunz.de  

Computer Science

Backbone Refinement Class Mining (BBRC).


Abstract. We present a new approach to large-scale graph mining based on so-called backbone refinement classes. The method efficiently mines tree-shaped subgraph descriptors under minimum frequency and significance constraints, using classes of fragments to reduce feature set size and running times. The classes are defined in terms of fragments sharing a common backbone. The method is able to optimize structural inter-feature entropy as opposed to purely occurrence-based criteria, which is characteristic for open or closed fragment mining. [draft pdf, in publications, full source code, API]

News

29 Apr 2009: The Backbone Refinement Class paper (co-authored by Christoph Helma and Stefan Kramer) has been accepted for the KDD 2009 conference on Data Mining and Knowledge Discovery (Jun 28 - Jul 1 2009 in Paris) for a presentation at the conference and inclusion in the conference proceedings.

30 Apr 2009: The paper has been selected for oral presentation at MLG 2009.

10 Jul 2009: KDD conference proceedings are online.

29 Nov 2009: Added configure script and bugfixes.

05 May 2010: Machine Learning Journal article is available.


Usage Example

Assuming a checked out and compiled algorithm in fminer2 subdirectory, as an example, call:
$ fminer2/fminer/fminer fminer2/libbbrc/libbbrc.so -f6 cpdbdata/salmonella_mutagenicity/salmonella_mutagenicity_alt.smi \
   cpdbdata/salmonella_mutagenicity/salmonella_mutagenicity_alt.class > salmonella.bbrc
This calculates BBRC descriptors with minimum frequency 6 for the Salmonella dataset. For help on switches and file formats, execute fminer2/fminer/fminer fminer2/libbbrc/libbbrc.so -h and read the README.

Find the API documentation here.



Euclidean Embedding of Molecules and Backbone Refinement Class Features based on Co-Occurrence and Entropy

(Schulz et al.)

The 2-D embedding reflects feature-feature and instance-feature co-occurrence as well as entropy of the features with respect to the target classes.

Usage: (De)activating features are (red) green, (In)active instances (salmon) blue. Point your mouse to a feature (instance). The matching instances (features) will be marked. The brighter a feature, the more significantly it is correlated with the endpoint. Use the mouse wheel to zoom in and out and the left mouse button to drag.

Observations: Separation of target classes along top left to bottom right, many highly descriptive features. Good distribution of features and instances, and well-characterized groups of instances in the outer parts. The data seems suitable for classification tasks.

Links to the datasets used in this study (the thresholds used and switches for fminer are given below):

Data: CPDB salmonella mutagenicity
Thresholds: 95% significance, minimum frequency 6 (Switches: -f 6)

Euclidean embedding of a Backbone Refinement Class Descriptors and Salmonella Mutagenicity data

Click here for the static version (opens in a separate window).

Click here for the animation (opens in a separate window).

Data: CPDB Multicell Call
Thresholds: 95% significance, minimum frequency 6 (Switches: -f 6)

Euclidean embedding of a Backbone Refinement Class Descriptors and Multicell call data

Click here for the static version (opens in a separate window).

Click here for the animation (opens in a separate window).

Data: Mouse Carcinogenicity
Thresholds: 95% significance, minimum frequency 6 (Switches: -f 6)

Euclidean embedding of a Backbone Refinement Class Descriptors and Mouse Carcinogenicity data

Click here for the static version (opens in a separate window).

Click here for the animation (opens in a separate window).

Data: Rat Carcinogenicity
Thresholds: 95% significance, minimum frequency 6 (Switches: -f 6)

Euclidean embedding of a Backbone Refinement Class Descriptors and Rat Carcinogenicity data

Click here for the static version (opens in a separate window).

Click here for the animation (opens in a separate window).

BBRC Feature Validation

CPDB and Large-Scale Datasheet, comparison to Open Trees, Maximal Trees (large scale), Static vs Upper Bound Pruning.

Excel Sheet (.xls) for MS Office 2000/2003/2007 (132K)

  • Runtime
  • Feature Count
  • Leave-One-Out Crossvalidation Accuracy
  • Large-Scale Analysis

Repeatability: the thresholds used and switches for fminer are given in the file.

Contact

Dipl.-Inf. Andreas Maunz
Machine Learning Lab
University Freiburg
Georges-Köhler-Allee 79
79110 Freiburg, Germany
Phone: +49761/203-8442, Fax: +49761/203-7700
Email: maunza@fdm.uni-freiburg.de
Web: http://cs.maunz.de