Description of the algorithm for creation of machine learning models’ chains
Purpose of the algorithm
Creation of a chain of machine learning (ML) models to solve application tasks.
For each problem to be solved, the modified genetic programming algorithm forms a chain of several ML models contained in a given set. A chain can contain two types of nodes:
Secondary nodes. Nodes that combine multiple models (input characteristics are generated from model predictions merged by this node)
Primary nodes. The nodes, representing independent models (model of the first level which input features are obtained from solving task)
The data transfer between nodes can be represented as follows:
1. Algorithm resources (composer_requirements)
- primary и secondary - two sets of models that can be used as Primary nodes and Secondary nodes, respectively, taken from the generated model repository.
- num_of_generations - a number of generations of the evolutionary algorithm.
- pop_size - the size of trees population.
- max_depth - maximum depth of a tree.
- max_arity - maximum arity of a Secondary node (maximum amount of models, which can be merged by a node).
- crossover_prob - breeding probability in a generic algorithm.
- mutation_prob - mutation probability in a generic algorithm.
2. Database of the solving task.
3. A quality metric for a task solution (taken from an existing quality metrics repository quality_metrics_repository.py).
As an example, if you want to apply the ROC-AUCs quality metric to the algorithm, you need to initialize it as follows:
Metric function code
metric_function = MetricsRepository().metric_by_id(ClassificationMetricsEnum.ROCAUC)
The output is the ML-model tree found by an evolutionary algorithm and converted using _tree_to_chain method into the ML-model chain.
Description of the operation principle
Solutions in genetic programming are presented as trees. The trees may be n-ary, i.e. the number of under-nodes of each tree node may not be greater than n. Usually, in genetic programming, tree nodes are elements of one of two sets: terminal (T) or functional (F). A functional set includes functions that are used to form a solution and a terminal set includes constants and variables. Within this framework, the equivalent of the function nodes is Secondary nodes, and Primary nodes stand for terminal ones.
Let’s have a look at the detailed description of the algorithm:
The evolution process according to which the genetic programming algorithm is functioning can be described as follows: a population of several chains is accidentally initialized, and then covers the best regions of the search space through tree selection, mutation, and recombination random processes.
Scheme of the genetic programming algorithm for model chaining:
In the first stage, a population of trees is generated. A growing method is used for initialization in genetic programming. With this method, primary nodes can be located at any depth (except zero, since it always has a functional node), but the maximum depth of the node in the tree is d (user-defined parameter).
2. Calculation of suitability of individuals
Based on the quality metric selected from the repository.
The developed implementation uses tournament selection of individuals realized in the tournament_selection method. According to this strategy, a group of n population individuals (n > 2) selected randomly is created to select one individual who will participate in the crossing as parents. The most suitable individual is selected from this group and the remaining individuals are discarded. The number n is called tournament size, usually, the size of tournaments is n = 2, n = 5 and n = 9. Advantages of this type of selection: simplicity, absence of the requirement to rank individuals, which requires much more computational resources.
Selection is implemented as follows:
def tournament_selection(fitnesses, group_size=5):
selected = 
pair_num = 0
for j in range(len(fitnesses) * 2):
if not j % 2:
if j > 1:
pair_num += 1
tournir = [randint(0, len(fitnesses) - 1) for _ in range(group_size)]
fitness_obj_from_tour = [fitnesses[tournir[i]] for i in range(group_size)]
The standard breeding type implemented in the standard_crossover method was used. After selecting the parents in the selection step for each parent pair:
- Breeding point is chosen from every available parent.
- Parents exchange subtrees below the breeding point.
Breeding is implemented as follows:
def standard_crossover(tree1, tree2, max_depth,
if tree1 is tree2 or random.random() > crossover_prob:
tree1_copy = deepcopy(tree1)
rn_layer = randint(0, tree1_copy.get_depth_down() - 1)
rn_self_layer = randint(0, tree2.get_depth_down() - 1)
if rn_layer == 0 and rn_self_layer == 0:
changed_node = choice(tree1_copy.get_nodes_from_layer(rn_layer))
node_for_change = choice(tree2.get_nodes_from_layer(rn_self_layer))
if rn_layer == 0:
if changed_node.get_depth_up() + node_for_change.get_depth_down() - \
node_for_change.get_depth_up() < max_depth + 1:
In the standard_mutation method, a point mutation scheme was implemented. A randomly selected node in the tree changes to a randomly selected element of the same type. I.e. the selected node is replaced by a randomly selected member of a set of valid models for primary nodes if that node belongs to that type, or a member of a set of valid secondary nodes.
Mutation is implemented as follows:
def standard_mutation(root_node, secondary_requirements,
if not probability:
probability = 1.0 / root_node.get_depth_down()
if random.random() < probability:
node.eval_strategy.model = random.choice(secondary_requirements)
for child in node.nodes_from:
if random.random() < probability:
node.eval_strategy.model = random.choice(primary_requirements)
result = deepcopy(root_node)
How to use the algorithm for applications
First, you must define the model classes for which you want to chain. Their start and calculation is given to the external software complex in which the algorithm is embedded.
Binding with external functions is implemented through callback mechanics - that is, an object/function is passed to the algorithm optimizer as an external parameter, which, for example, can track changes in the population.
At the end of each iteration of the algorithm the function sellf.history_callback () is called, where the required parameters (fitness function values, best/worst individuals, etc.) are transmitted. After that, any wanted functionality is implemented in callback - console output, visualization and so on. An example of this implementation can be found in the Keras documentation.
Evolutionary operators are preferably separated into separate scripts and divided into selection.py, crossover.py, mutation.py, etc.
Guide for the advanced developer
For practical application of GPComp, it is useful to create a GPComposer wrapper class. For abstraction from a particular implementation of chains and models within a framework, only the functions providing such creation are passed to the GPChainOptimiser class implementing the GPComp algorithm as an argument.
def compose_chain(self, data: InputData, initial_chain: Optional[Chain],
metrics: Optional[Callable]) -> Chain:
metric_function_for_nodes = partial(self._metric_for_nodes,
optimiser = GPChainOptimiser(initial_chain=initial_chain,
best_found, history = optimiser.optimise(metric_function_for_nodes)
best_chain = GPComposer._tree_to_chain(tree_root=best_found, data=data)
Compositional requirements can be specified as particular class-GPComposerRequirements inherited from the general ComposerRequirements.
Composer requirements code
composer_requirements = GPComposerRequirements(
primary=[LogRegression(), KNN(), LDA(), XGBoost(), DecisionTree(), RandomForest()],
secondary=[LogRegression(), KNN(), LDA(), XGBoost(), DecisionTree(), RandomForest()], max_arity=3,
max_depth=2, pop_size=20, num_of_generations=10,
The obtained chains are visualized using the networking library:
Chain visualisation example code
graph, node_labels = _as_nx_graph(chain=deepcopy(chain))
pos = node_positions(graph.to_undirected())
font_size=12, font_family='calibri', font_weight='bold',
To evaluate the quality of the scoring model, use the ROC ACC metric, the implementation of which can be obtained from the sklearn.metrics library.
Metric example code
def calculate_validation_metric(chain: Chain, dataset_to_validate: InputData) -> float:
# the execution of the obtained composite models
predicted = chain.predict(dataset_to_validate)
# the quality assessment for the simulation results
roc_auc_value = roc_auc(y_true=dataset_to_validate.target,
One should obtain input data from the file as follows:
Input data example code
data_frame = pd.read_csv(file_path)
data_frame = _convert_dtypes(data_frame=data_frame)
data_array = np.array(data_frame).T
idx = data_array
features = data_array[1:-1].T
target = data_array[-1].astype(np.float)
return InputData(idx=idx, features=features, target=target)
The combination of predictions from the outputs of several models (to transmit them to the input of the next model in the chain) can be done as follows:
From prediction node example code
def from_predictions(outputs: List['OutputData'], target: np.array):
idx = outputs.idx
features = list()
for elem in outputs:
return InputData(idx=idx, features=np.array(features).T, target=target)
One can use the Chain class to integrate multiple models into a chain. New nodes are added using the chain_add method. Use the NodeGenerator class methods to create a node with the specified model type.
Complex chain example code
complex_chain = Chain()
last_node = NodeGenerator.secondary_node(MLP())
y1 = NodeGenerator.primary_node(XGBoost(), dataset_to_train)
y2 = NodeGenerator.primary_node(LDA(), dataset_to_train)
y3 = NodeGenerator.secondary_node(RandomForest(), [y1, y2])
y4 = NodeGenerator.primary_node(KNN(), dataset_to_train)
y5 = NodeGenerator.primary_node(DecisionTree(), dataset_to_train)
y6 = NodeGenerator.secondary_node(QDA(), [y4, y5])
last_node.nodes_from = [y3, y6]