LibBbrc (libbbrc)

LibBbrc

This is the Bbrc library, available at http://github.com/amaunz/fminer2/tree/master , subdirectory libbbrc (see below for download and build instructions).
The Fminer frontend application is available from http://github.com/amaunz/fminer2/tree/master , subdirectory fminer.
You may download the scientific documentation from http://cs.maunz.de . The paper is entitled "Large Scale Graph Mining using Backbone Refinement Classes".
Supporting information is available here: http://bbrc.maunz.de .

Contents

Contact details are located at the end of this page.



News

29 Apr 2009: The Backbone Refinement Class paper (co-authored by Christoph Helma and Stefan Kramer) has been accepted for the KDD 2009 conference on Data Mining and Knowledge Discovery (Jun 28 - Jul 1 2009 in Paris) for a presentation at the conference and inclusion in the conference proceedings.

30 Apr 2009: The paper has been selected for oral presentation at MLG 2009.

10 Jul 2009: KDD conference proceedings are online.

29 Nov 2009: Added configure script and bugfixes.

19 Apr 2010: Machine Learning paper accepted.



Abstract

We present a new approach to large-scale graph mining based on so-called backbone refinement classes. The method efficiently mines tree-shaped subgraph descriptors under minimum frequency and significance constraints, using classes of fragments to reduce feature set size and running times. The classes are defined in terms of fragments sharing a common backbone. The method is able to optimize structural inter-feature entropy as opposed to occurrences, which is characteristic for open or closed fragment mining. In the experiments, the proposed method reduces feature set sizes by >90 % and >30 % compared to complete tree mining and open tree mining, respectively. Evaluation using crossvalidation runs shows that their classification accuracy is similar to the complete set of trees but significantly better than that of open trees. Compared to open or closed fragment mining, a large part of the search space can be pruned due to an improved statistical constraint (dynamic upper bound adjustment), which is also confirmed in the experiments in lower running times compared to ordinary (static) upper bound pruning. Further analysis using large-scale datasets yields insight into important properties of the proposed descriptors, such as the dataset coverage and the class size represented by each descriptor. A final cross-validation run confirms that the novel descriptors render large training sets feasible which previously might have been intractable.

Euclidean embedding of a Backbone Refinement Class Descriptors and Salmonella Mutagenicity data
Co-occurrence-based 2D embedding of molecules and backbone refinement class features showing close to perfect separation of target classes along top left to bottom right. (De)activating features are (red) green, (In)active instances (salmon) blue. Data: CPDB salmonella mutagenicity; Euclidean embedding: Schulz et. al,.
Click here for a flash-animated version, indicating occurrences.

License

LibBbrc is licensed under the terms of the GNU General Public License (GPL, see LICENSE). LibBbrc is derived from (i.e. includes code from) the following project, licensed under GPL:

LibBbrc uses (i.e. links to) the following projects, also licensed under GPL:

These licensing conditions mean essentially that your published program may only use (i.e., link to) and/or derive code from LibBbrc under the condition that your source code is also freely available. This is to secure public availability and freedom of use.



Installation

LibBbrc is a library, written in C++. It dynamically links to OpenBabel and GSL libraries. This section describes the installation of both the library and the frontend application for Linux.

Compiling from source

Linux SO: install development tools (gcc, g++, GNU make) and GSL as well as OpenBabel development package, then compile LibBbrc. On Ubuntu, you can e.g. do it like this:

Language Portability

The API can be made available to other languages. Follow the installation instructions above. A config file for Swig to automagically create languages bindings exists (rbbrc_wrap.i).

The Makefile features a target that creates ruby bindings using this file. On Ubuntu, you can e.g. do this:

Guidance on Using (Lib)Bbrc

Most setting are sensible by default, see description of constructors and objects below.

I would suggest to manipulate the minimum frequency only at first. The number of fragments output should not be more than a multitude of the number of input graphs. For most chemical databases, a minimum frequency threshold of 2-3% will deliver good results. LibBbrc does not support percentage values, you will have to calculate absolute numbers.

Environment Variables

FMINER_SMARTS: Produce output in SMARTS format. In this case, each line is a YAML sequence, containing SMARTS fragment, p-value, and two sequences denoting positive and negative class occurrences (line numbers in Smiles file):

  - [ smarts,    p_chisq,    occ_list_active,    occ_list_inactive ]

Documentation for YAML can be found at: http://yaml.org/spec/cvs/current.html# (e.g. export FMINER_SMARTS=1).
FMINER_LAZAR: Produce output in linfrag format which can be used as input to Lazar (e.g. export FMINER_LAZAR=1).
FMINER_PVALUES: Produce p-values instead of chi-square values (e.g. export FMINER_PVALUES=1).
FMINER_NO_AROMATIC: Disallow aromatic bonds on non-aromatic bonds (e.g. export FMINER_NO_AROMATIC=1).

Examples using the LibBbrc API

LibBbrc uses the 'singleton' design pattern known from software engineering, i.e., class instantiation is restricted to one object. To empty the database after a run to feed new compounds, use the Bbrc::Reset() routine.

The following code demonstrate the use of the Bbrc API from C++ and ruby. It feeds a set of class-labelled molecules in SMILES format (the API currently allows no gSpan input, use the frontend application for that) and calculates a vector of fragments along with statistical relevance and occurrences and prints them out. Every root node corresponds to a single chemical element. The output consists of gSpan graphs.

C++

This example uses libBbrc in a C++ program. The example assumes that you have created the C++ library using make.

 #include "bbrc.h"
 #include <iostream>
 #include <string.h>
 using namespace std;

 Bbrc* MyFminer;
 int main(int argc, char *argv[], char *envp) {
   MyFminer= new Bbrc();
   MyFminer->AddCompound ("COC1=CC=C(C=C1)C2=NC(=C([NH]2)C3=CC=CC=C3)C4=CC=CC=C4" , 1);
   MyFminer->AddCompound ("O=C1NC(=S)NC(=O)C1C(=O)NC2=CC=CC=C2" , 2);
      // ... continue adding compounds
   MyFminer->AddActivity((bool) true, 1);
   MyFminer->AddActivity((bool) false, 2);
      // ... continue adding activities (true for active, false for inactive)
   cerr << MyFminer->GetNoCompounds() << " compounds" << endl;
   // Toy example: special settings for mining all fragments
   MyFminer->SetChisqSig(0); // use no significance constraint
   MyFminer->SetRefineSingles(true); // refine structures with support 1
   // gather results for every root node in vector instead of immediate output
   MyFminer->SetConsoleOut(false);
   for ( int j = 0; j < (int) MyFminer->GetNoRootNodes(); j++ ) {
      vector<string>* result = MyFminer->MineRoot(j);
      for( int i = 0; i < result->size(); i++) {
        cout << (*result)[i] << endl;
      }
   }
   delete MyFminer;
 }

Ruby

This example assumes that you have created ruby bindings using make ruby.

 require 'bbrc'
 MyFminer = Bbrc::Bbrc.new()
 MyFminer.AddCompound("COC1=CC=C(C=C1)C2=NC(=C([NH]2)C3=CC=CC=C3)C4=CC=CC=C4" , 1)
 MyFminer.AddCompound("O=C1NC(=S)NC(=O)C1C(=O)NC2=CC=CC=C2" , 2)
    # ... continue adding compounds
 MyFminer.AddActivity(true, 1)
 MyFminer.AddActivity(false, 2)
    # ... continue adding activities (true for active, false for inactive)
 print MyFminer.GetNoCompounds()  
 puts " compounds"
 # Toy example: special settings for mining all fragments
 # use no significance constraint
 MyFminer.SetChisqSig(0) 
 # refine structures with support 1
 MyFminer.SetRefineSingles(true) 
 # gather results for every root node in vector instead of immediate output
 MyFminer.SetConsoleOut(false)
 (0 .. MyFminer.GetNoRootNodes()-1).each do |j|
    result = MyFminer.MineRoot(j)
    puts "Results"
    result.each do |res|
        puts res
   end
 end

Description of Constructors and Options

For the purpose of demonstration we used a toy database of two compounds and an unusual parameter configuration. Please note, that in general the defaults set by the standard constructor are sensible for most databases. They switch on BBRC mining with upper bound pruning for 95% significance and a minimum frequency of 2. The complete standard settings are:

  Bbrc ();

There also exist more flexible constructors:

 Bbrc (int type, unsigned int minfreq);
 Bbrc (int type, unsigned int minfreq, float chisq_val, bool do_backbone);

It is recommended to increase minimum frequency as a first step when too many features are generated.



Contact

Dipl.-Inf. Andreas Maunz
Freiburg Center for Data Analysis and Modelling
Hermann-Herder-Str. 3a
79104 Freiburg, Germany
Phone: +49761/203-8442, Fax: +49761/203-7700
Email: maunza@fdm.uni-freiburg.de
Web: http://cs.maunz.de

Author:
(c) 2010 by Andreas Maunz, 2010
Generated on Mon Jul 26 17:57:44 2010 for libbbrc by  doxygen 1.6.3