Conflict and Dependency Graphs of Highly Configurable Software

  1. Heradio, Ruben 1
  2. Cambelo, Luis 1
  3. Olivero, Miguel Angel 2
  4. Sanchez, Jose Manuel 3
  5. Fernandez-Amoros, David 1
  1. 1 Universidad Nacional de Educación a Distancia
    info
    Universidad Nacional de Educación a Distancia

    Madrid, España

    ROR https://ror.org/02msb5n36

    Geographic location of the organization Universidad Nacional de Educación a Distancia
  2. 2 Universidad de Sevilla
    info
    Universidad de Sevilla

    Sevilla, España

    ROR https://ror.org/03yxnpp24

    Geographic location of the organization Universidad de Sevilla
  3. 3 Universidad de Cádiz
    info
    Universidad de Cádiz

    Cádiz, España

    ROR https://ror.org/04mxxkb11

    Geographic location of the organization Universidad de Cádiz

Editor: Zenodo

Year of publication: 2025

Type: Dataset

Version: 1

CC BY 4.0

Dataset versions:
Version Created DOI
1 06-10-2025 10.5281/zenodo.17277789

Abstract

Conflict and Dependency Graphs of Highly Configurable Software This repository contains a dataset of conflict and dependency graphs for 5,709 highly configurable software systems, along with the necessary code to replicate and validate the dataset.   1. Motivation and Purpose Software Product Lines (SPLs) enable developers to create diverse software products by combining and configuring reusable blocks of code, commonly referred to as features or configurable options. These systems are typically represented by Variability Models (VMs), which capture the relationships and constraints between options. However, VMs often fail to explicitly represent the indirect and transitive relationships that emerge from chaining direct constraints. These hidden relationships can make large-scale models, such as the Linux Kernel, difficult to reason about, validate, and manually configure. This dataset addresses these limitations by providing closure transitive graphs for 5,709 variability models from industrial and open-source systems. This graph-based representation offers several advantages: Explicit Relationships: It makes all direct and indirect "strong relationships" (dependencies and conflicts) explicit. A strong dependency means that if option A is enabled, option B must always be enabled. A strong conflict means that options A and B cannot be enabled simultaneously. Advanced Analysis: It enables the use of powerful graph analysis tools (like Gephi, Pajek, and igraph) to compute metrics, detect structural patterns (e.g., small-world), identify connected components, and find dominant features. Improved Reasoning: It helps propagate user configuration decisions automatically and simplifies reasoning about the impact of turning specific options on or off. The goal of this dataset is to facilitate advanced research on dependency patterns, feature centrality, and the evolution of configurability in complex systems by providing a curated collection of VMs transformed into a more expressive graph-based format.  2. Dataset Format The repository is organized into three main folders: graphs, replication, and validation.  2.1 summary.csv The root folder contains a summary.csv file that provides detailed metadata for each of the 5,709 variability models in the dataset. Each row corresponds to one model and includes columns such as: Identifier: A unique string to identify all files associated with a model (e.g., "LinuxX86__3-9__SystemsSoftware__Kuiter25__KMax"). NumNodes, NumDead, NumCore, NumExcludes, NumRequires: Statistics on the number of nodes, dead options, core options, conflict edges, and dependency arcs. Name: name of the model. Version: version of the model. Domain: application area of the variability model. Bibliographic and repository information (BibliographicID, BibliographicDOI, Repository). ToolToGetTheFormula: translator used to transform the variability model into a Boolean formula (e.g., “KMax”). The dataset was not generated directly from the original variability models (like Kconfig files). Instead, the starting point for building the graphs was a collection of Boolean formulas that had been previously synthesized and published by other research teams. These formulas, which logically encode the constraints of all 5,709 configurable systems, were created by the following researchers using specific tools: Sundermann et al., who used the TraVarT translator. Kuiter et al., who used the KMax and KConfigReader translators. Fernandez et al., who used the KconfigSampler tool. All these pre-existing formulas are in the /replication/formulas folder. They are written in the standard DIMACS CNF (Conjunctive Normal Form) format. The authors of this dataset then took this collection of formulas as input for their own process  2.2 /graphs Folder This folder contains the dataset itself. For each of the 5,709 models, there are four files, distinguished by their unique identifier: <identifier>__requires.net: A directed graph representing strong dependencies. An arc from node A to B means A strongly depends on B. <identifier>__excludes.net: An undirected graph representing strong conflicts. An edge between nodes A and B means they are mutually exclusive. <identifier>__core.txt: A text file listing all core options, which are options that must be enabled in every valid configuration. The file is empty if there are none. <identifier>__dead.txt: A text file listing all dead options, which are options that can never be enabled in any valid configuration. The file is empty if there are none. The dependency and conflict graphs (.net files) are in the Pajek NET format, which is widely supported by graph analysis tools like Gephi, igraph, and Pajek. These files are structured with a *Vertices section defining the nodes and an *Arcs (for dependency graphs) or *Edges (for conflict graphs) section defining the links. To simplify analysis and visualization, links related to core and dead options are omitted from the graphs, and these options are listed in separate files instead. 3. How to Replicate the Dataset All code required to regenerate the dataset from the original Boolean formulas is provided and has been tested on Linux. The process starts from the formulas located in the /replication/formulas directory. These formulas are encoded in the standard DIMACS CNF format. To replicate the entire dataset: 1.  Navigate to the /replication/bin directory. 2.  Run the following command in your terminal:     bash get_all_transitive_graphs.sh     This script internally calls the get_transitive_graph executable, which implements the algorithms to compute strong relationships from the Boolean formulas. If you wish to compile the C++ source code from scratch: 1.  Navigate to the /replication/src directory. 2.  Run the command:     bash makegtg.sh     This will compile the get_transitive_graph.cpp source code and create the executable. The implementation uses the MiniSat 2.2.0 solver via the IPASIR interface to speed up computations. 4. How to Validate the Dataset The repository includes scripts to validate the correctness of the generated dataset. The validation process checks two key aspects: (1) the lists of core and dead options, and (2) the dependency and conflict links for a sample of nodes in each graph. It uses an external tool, [MiniBones](https://github.com/MikolasJanota/minibones), to independently compute the backbone of the Boolean formulas for verification. To run the validation: 1.  First, ensure you have the R interpreter installed. 2.  Navigate to the /validation directory. You can then choose one of two validation options: -   Validate a single model: To validate a specific model M by checking N random nodes, execute:     Rscript validate_transitive_graph.R -f M -c N     (Replace M with the model's DIMACS file path and N with the number of nodes to check). -   Validate the entire dataset: To validate all models, run the following shell script:     bash validate_all_transitive_graphs.sh     By default, this script tests 1,000 nodes for each model; however, this number can be adjusted within the script. The dataset was successfully tested with N = 1,000.