A Software for Learning Bayesian Networks
Datasets

These packages contain all of the files necessary to reproduce the experiments in many of our papers.

UAI 2015 - Quality Study

PSS files - This was a large study investigating the quality of Bayesian network structures learned using a variety of exact and approximate algorithms.

AAAI 2014 - Portfolio Study

PSS files - This was a large study investigating the prediction of runtimes for learning Bayesian network structures. As a part of this study, a large number of PSS files were created.

AAAI 2014; UAI 2014

Published datasets (csv, pss, binary scores) - All of the files to reproduce the experiments in our AAAI 2014 and UAI 2014 papers.

UAI 2013

Experiment set 2 - All of the files to reproduce the experiments in Section 5.4 of the paper, including the synthetic networks (Net folder), sampled datasets (csv folder) and parent set scores (pss folder). This spreadsheet (UAI 2013 sheet) gives runtime information on using the score program in the C++ version of URLearning to calculate the necessary parent set scores from the sampled datasets on nodes which have 32GB of RAM and 2 Intel Xeon E5540 2.53GHz CPU:s. The CPUs have 4 cores, so one node has a total of 8 cores. The cores are using hyperthreading (so 16 threads total, I think).  10 threads were used during the score calculations.

Papers prior to UAI 2013

Published datasets (csv only) - The input files necessary to reproduce the experiments in most of our papers prior to UAI 2013, including the input datasets (csv folder). Most of these datasets are processed versions from the UCI machine learning repository. Continuous variables were binarized around their mean. Finally, each value of categorical variables were mapped arbitrary to integers (e.g., if a categorical variable had four categories, the categories would be mapped arbitrarily to {0, 1, 2, 3}); using these values, the categorical values were binarized around the mean. For most datasets, records with missing values were removed. This process sometimes results in variables with only a single observed value; these variables sometimes affect the scores in unexpected ways, especially fNML. I am working to remove these datasets.

Scores were calculated for these datasets by setting a parent limit of 8; furthermore a time limit of no more than 10 minutes for actual score calculation times was imposed on each variable (note that post-processing pruning and writing to disk are not included in this limit). The scores were calculated on a node which has XXX. This spreadsheet (Published Datasets sheet) gives runtime information on using the score program in the C++ version of URLearning to calculate, prune, and write the scores to disk. Due to technical problems with the automation, some of the running times may not be exactly correct; I am working to correct these. The following scores are available:

Unpublished datasets - The preprocessing scheme described for the published datasets may not reflect real-world usage scenarios, so I recalculated the scores using a more sensible realistic preprocessing step. First, records with missing values were removed. Next, continuous variables were discretized according to the NML - optimal histogram by Kontkanen and Myllymäki. Then, categorical variables with more than 10 values were typically removed (a note is included if some other step was taken to handle categorical variables with large cardinality). Finally, variables with a single value were removed from the dataset.

This preprocessing scheme results in more large cardinality variables than the simple scheme used for the published results. Consequently, it can result in more parent set pruning for scoring functions with a heavy complexity penalty, such as BIC. On the other hand, larger parent sets can be more informative, so that can reduce the amount of parent set pruning. An interesting venue for future work is to consider how this preprocessing affects learning.

File Formats

Unless otherwise noted, the files use the following formats.