|
libbbrc
|
LibBbrc
This is the Bbrc library, available at http://github.com/amaunz/fminer2/tree/master , subdirectory libbbrc (see below for download and build instructions).
The Fminer frontend application is available from http://github.com/amaunz/fminer2/tree/master , subdirectory fminer.
You may download the scientific documentation from http://cs.maunz.de . The paper is entitled "Large Scale Graph Mining using Backbone Refinement Classes".
Supporting information is available here: http://bbrc.maunz.de .
Note that scientific evaluation was only done for the settings given in the paper. Thus, for other settings no results can be guaranteed. The switches for the leave-one-out crossvalidation in the paper were:
There is a package that you can use to automatically reproduce the KDD 2009 results on BBRC features, see http://github.com/amaunz/fminer2-transition.
Contact details are located at the end of this page.
We present a new approach to large-scale graph mining based on so-called backbone refinement classes. The method efficiently mines tree-shaped subgraph descriptors under minimum frequency and significance constraints, using classes of fragments to reduce feature set size and running times. The classes are defined in terms of fragments sharing a common backbone. The method is able to optimize structural inter-feature entropy as opposed to occurrences, which is characteristic for open or closed fragment mining. In the experiments, the proposed method reduces feature set sizes by >90 % and >30 % compared to complete tree mining and open tree mining, respectively. Evaluation using crossvalidation runs shows that their classification accuracy is similar to the complete set of trees but significantly better than that of open trees. Compared to open or closed fragment mining, a large part of the search space can be pruned due to an improved statistical constraint (dynamic upper bound adjustment), which is also confirmed in the experiments in lower running times compared to ordinary (static) upper bound pruning. Further analysis using large-scale datasets yields insight into important properties of the proposed descriptors, such as the dataset coverage and the class size represented by each descriptor. A final cross-validation run confirms that the novel descriptors render large training sets feasible which previously might have been intractable.
|
|
Co-occurrence-based 2D embedding of molecules and backbone refinement class features showing close to perfect separation of target classes along top left to bottom right. (De)activating features are (red) green, (In)active instances (salmon) blue. Data: CPDB salmonella mutagenicity; Euclidean embedding: Schulz et. al,. Click here for a flash-animated version, indicating occurrences. |
LibBbrc is licensed under the terms of the GNU General Public License (GPL, see LICENSE). LibBbrc is derived from (i.e. includes code from) the following project, licensed under GPL:
LibBbrc uses (i.e. links to) the following projects, also licensed under GPL:
These licensing conditions mean essentially that your published program may only use (i.e., link to) and/or derive code from LibBbrc under the condition that your source code is also freely available. This is to secure public availability and freedom of use.
LibBbrc is a library, written in C++. It dynamically links to OpenBabel and GSL libraries. This section describes the installation of both the library and the frontend application for Linux.
Linux SO: install development tools (gcc, g++, GNU make) and GSL as well as OpenBabel development package, then compile LibBbrc. On Ubuntu, you can e.g. do it like this:
apt-get install build-essential # development tools apt-get install libgsl0-dev # GSL binary lib and headers
apt-get build-dep libopenbabel-dev # build dependencies for OB apt-get source libopenbabel-dev # extracts OB source to the current dir
apt-get install libopenbabel-dev # OB binary lib and headers
git clone git://github.com/amaunz/fminer2.git) and cd to libbbrc subdirectory. Use ./configure to configure the Makefile automatically or, in the Makefile, adjust the include (-I) and linker (-L) flags. Run make.fminer subdirectory. Use ./configure to configure the Makefile automatically or, in the Makefile, adjust the include (-I) and linker (-L) flags. Run make.The API can be made available to other languages.
The Makefile features a target that creates ruby bindings. On Ubuntu, you can e.g. do this:
sudo apt-get install ruby1.8-dev./configure <version> to configure the Makefile automatically or, adjust the include flags (-I) in the Makefile in the line INCLUDE_RB = ... so that the directory contains file ruby.h. Also, let RUBY = ... point to the right executable.make ruby. Use make rbtest to test. The configuration was tested with swig 1.3.40 and Ruby 1.8.The Makefile features a target that creates python bindings. On Ubuntu, you can e.g. do this:
sudo apt-get install python2.7-devINCLUDE_PY = ... so that the directory contains file Python.h. Also, let PYTHON = ... point to the right executable.make python. Use make pytest to test. The configuration was tested with Python 2.7.The Makefile features a target that creates java bindings. On Ubuntu, you can e.g. do this:
sudo apt-get install openjdk-6-jdkINCLUDE_JAVA = ... so that the directory contains file jni.h. Also, make sure that the paths in LD_PRELOAD=... in the jtest line are correct.make java. Use make jtest to test. The configuration was tested with OpenJDK 1.6.Important: There are swig interface files (*.i) and pre-configured swig output files (*.cxx). You need to re-create those output files if you are deploying for newer versions of the target languages, and you can find the necessary swig calls in the Makefile (commented out).
Bbrc descriptors are a sparse collection of structurally dissimilar, class-correlated descriptors. You must provide input molecules in SMILES format and a target class for every molecule (see examples below), maximally five different classes. You can also provide numeric values instead of target classes. In that case you must use SetRegression(true).
Note: Always do SetRegression(true) first, before adding any numeric value.
Most setting are sensible by default, see description of constructors and objects below. I would suggest to manipulate the minimum frequency only at first. The number of fragments output should not be more than a multitude of the number of input graphs. For minimum frequency, LibBbrc does not support percentage values. You will have to calculate absolute numbers.
FMINER_SMARTS: Produce output in SMARTS format. In this case, each line is a YAML sequence, containing SMARTS fragment, p-value, and sequences denoting class occurrences (line numbers in Smiles file):
- [ smarts, p_chisq, occ_list_class1, occ_list_class2, ... ]
Documentation for YAML can be found at: http://yaml.org/spec/cvs/current.html# (e.g. export FMINER_SMARTS=1).
FMINER_LAZAR : Produce output in linfrag format which can be used as input to Lazar (e.g. export FMINER_LAZAR=1).
FMINER_PVALUES : Produce p-values instead of chi-square values (e.g. export FMINER_PVALUES=1).
FMINER_NO_AROMATIC_WC: Disallow aromatic wildcard bonds on aliphatic bonds, when aromatic perception was switched off ('-a') (e.g. export FMINER_NO_AROMATIC_WC=1).
FMINER_SILENT : Redirect STDERR (debug output) of fminer to local file 'fminer_debug.txt'
FMINER_NR_HITS : Display (in the occurrence lists) the number of times each fragment occurs in a molecule.
Note: The value you set the environment variables to is irrelevant. Use unset to disable the environment variables, e.g. unset FMINER_LAZAR.
LibBbrc uses the 'singleton' design pattern known from software engineering, i.e., class instantiation is restricted to one object. To empty the database after a run and before feeding new compounds, use the Bbrc::Reset() routine.
The following code demonstrate the use of the Bbrc API from C++, python, and ruby. It feeds a set of class-labelled molecules in SMILES format (the API currently allows no gSpan input, use the frontend application for that) and calculates a vector of fragments along with statistical relevance and occurrences and prints them out. Every root node corresponds to a single chemical element. The output consists of gSpan graphs.
This example uses libBbrc in a C++ program. The example assumes that you have created the C++ library using make.
#include "bbrc.h" #include <iostream> #include <string.h> using namespace std; Bbrc* MyFminer; // global singleton instance int main(int argc, char *argv[], char *envp) { MyFminer= new Bbrc(); // Toy example: special settings for mining all fragments MyFminer->SetChisqSig(0); // use no significance constraint MyFminer->SetRefineSingles(true); // refine structures with support 1 MyFminer->SetConsoleOut(false); // Add compounds below. IMPORTANT! Do not change settings after adding compounds! MyFminer->AddCompound ("COC1=CC=C(C=C1)C2=NC(=C([NH]2)C3=CC=CC=C3)C4=CC=CC=C4" , 1); MyFminer->AddCompound ("O=C1NC(=S)NC(=O)C1C(=O)NC2=CC=CC=C2" , 2); // ... continue adding compounds MyFminer->AddActivity((bool) 1.0, 1); // 1.0 denotes one class in this example, MyFminer->AddActivity((bool) 0.0, 2); // 0.0 denotes the other class (you can use more than two classes, max 5). // ... continue adding activities (true for active, false for inactive) cerr << MyFminer->GetNoCompounds() << " compounds" << endl; // gather results for every root node in vector instead of immediate output for ( int j = 0; j < (int) MyFminer->GetNoRootNodes(); j++ ) { vector<string>* result = MyFminer->MineRoot(j); for( int i = 0; i < result->size(); i++) { cout << (*result)[i] << endl; } } delete MyFminer; // or call Reset() and start over. }
This example assumes that you have created ruby bindings using make ruby.
require 'bbrc' MyFminer = Bbrc::Bbrc.new() # global singleton instance # Toy example: special settings for mining all fragments # use no significance constraint MyFminer.SetChisqSig(0) # refine structures with support 1 MyFminer.SetRefineSingles(true) # gather results for every root node in vector instead of immediate output MyFminer.SetConsoleOut(false) # Add compounds below. IMPORTANT! Do not change settings after adding compounds! MyFminer.AddCompound("COC1=CC=C(C=C1)C2=NC(=C([NH]2)C3=CC=CC=C3)C4=CC=CC=C4" , 1) MyFminer.AddCompound("O=C1NC(=S)NC(=O)C1C(=O)NC2=CC=CC=C2" , 2) # ... continue adding compounds MyFminer.AddActivity(1,0, 1) # 1.0 denotes one class in this example... MyFminer.AddActivity(0.0, 2) # 0.0 denotes the other class (you can use more than two classes, max 5). # ... continue adding activities (true for active, false for inactive) print MyFminer.GetNoCompounds() puts " compounds" (0 .. MyFminer.GetNoRootNodes()-1).each do |j| result = MyFminer.MineRoot(j) puts "Results" result.each do |res| puts res end end # Call Reset() to start over.
This example assumes that you have created python bindings using make python.
import bbrc MyFminer = bbrc.Bbrc() # global singleton instance # Toy example: special settings for mining all fragments # use no significance constraint MyFminer.SetChisqSig(0) # refine structures with support 1 MyFminer.SetRefineSingles(1) MyFminer.SetConsoleOut(0) # Add compounds below. IMPORTANT! Do not change settings after adding compounds! MyFminer.AddCompound("COC1=CC=C(C=C1)C2=NC(=C([NH]2)C3=CC=CC=C3)C4=CC=CC=C4" , 1) MyFminer.AddCompound("O=C1NC(=S)NC(=O)C1C(=O)NC2=CC=CC=C2" , 2) # ... continue adding compounds MyFminer.AddActivity(1.0, 1) # 1.0 denotes one class in this example... MyFminer.AddActivity(0.0, 2) # 0.0 denotes the other class (you can use more than two classes, max 5). # ... continue adding activities (true for active, false for inactive) print repr(MyFminer.GetNoCompounds()) + ' compounds.' # gather results for every root node in vector instead of immediate output for j in range(0, MyFminer.GetNoRootNodes()): result = MyFminer.MineRoot(j); for i in range(0, result.size()): print result[i]; # Call Reset() to start over.
This example assumes that you have created java bindings using make java.
public class test { public static void main(String args[]) { System.loadLibrary("bbrc"); Bbrc MyFminer; MyFminer = new Bbrc(); // Toy example: special settings for mining all fragments MyFminer.SetChisqSig(0); // use no significance constraint MyFminer.SetRefineSingles(true); // refine structures with support 1 MyFminer.SetConsoleOut(false); // Add compounds below. IMPORTANT! DO NOT CHANGE SETTINGS AFTER ADDING COMPOUNDS! MyFminer.AddCompound ("COC1=CC=C(C=C1)C2=NC(=C([NH]2)C3=CC=CC=C3)C4=CC=CC=C4", 1); MyFminer.AddCompound ("O=C1NC(=S)NC(=O)C1C(=O)NC2=CC=CC=C2", 2); // ... continue adding compounds MyFminer.AddActivity(1.0F, 1); MyFminer.AddActivity(0.0F, 2); // ... continue adding activities (1.0F for active, 0.0F for inactive) System.out.println(MyFminer.GetNoCompounds() + " compounds"); // gather results for every root node in vector instead of immediate output for (int j = 0; j < (int) MyFminer.GetNoRootNodes(); j++) { SVector result = MyFminer.MineRoot(j); for(int i = 0; i < result.size(); i++) { System.out.println(result.get(i)); } } MyFminer = null; } }
For the purpose of demonstration we used a toy database of two compounds and an unusual parameter configuration. Please note, that in general the defaults set by the standard constructor are sensible for most databases. They switch on BBRC mining with upper bound pruning for 95% significance and a minimum frequency of 2. The complete standard settings are:
Bbrc ();
There also exist more flexible constructors:
Bbrc (int type, unsigned int minfreq); Bbrc (int type, unsigned int minfreq, float chisq_val, bool do_backbone);
It is recommended to increase minimum frequency as a first step when too many features are generated.
Dipl.-Inf. Andreas Maunz
Institute for Physics
Hermann-Herder-Str. 3
79104 Freiburg, Germany
Email: maunza@fdm.uni-freiburg.de
Web: http://cs.maunz.de