Complexity analysis of algorithms: A case study on bioinformatics tools

The data volume produced by the omic sciences nowadays was driven by the adoption of new generation sequencing platforms, popularly called NGS (Next Generation Sequencing). Among the analysis performed with this data, we can mention: mapping, genome assembly, genome annotation, pangenomic analysis, quality control, redundancy removal, among others. When it comes to redundancy removal analysis, it is worth noting the existence of several tools that perform this task, with proven accuracy through their scientific publications, but they lack criteria related to algorithmic complexity. Thus, this work aims to perform an algorithmic complexity analysis in computational tools for removing redundancy of raw reads from the DNA sequencing process, through empirical analysis. The analysis was performed with sixteen raw reads datasets. The datasets were processed with the following tools: MarDRe, NGSReadsTreatment, ParDRe, FastUniq, and BioSeqZip, and analyzed using the R statistical platform, through the GuessCompx package. The results demonstrate that the BioSeqZip and ParDRe tools present less complexity in this analysis.

deterministic in the analysis of the complexity of algorithms, because an efficient algorithm, as the input grows towards infinity, presents the smallest variation in time for execution. (Levitin, 2012). OBJECTIVES: This work aims at an empirical algorithmic complexity analysis, performed in computational tools developed to remove redundancy in raw reads from the DNA sequencing process, through the GuessCompx package (Agenis-Nevers et al., 2021) using as input the processing time of each tool. MATERIALS AND METHODS Tools and data source: The tools selected for this analysis were MarDRe (Expósito et al., 2017), ParDRe, FastUniq (Xu et al., 2012), NGS Reads Treatment (Gaia et al., 2019), and BioSeqZip (Urgese et al., 2020). They were chosen because they are tools capable of manipulating platform-independent NGS data and are freely available to the scientific community. For this analysis were used sixteen genome sequencing datasets obtained from NCBI, listed in Table 1. The analysis: To measure the total processing time for each tool used in this analysis, an inhouse-Script was developed using the Python programming language version 3.8. The opensource GuessCompx package was used to empirically estimate the complexity from the total processing time per tool. To obtain the estimate of the algorithmic complexity of the data generated by the tools, the glm function was used (https://www.rdocumentation.org/packages/stats/versions/3. 6.2/topics/glm), which is present in the platform of statistic R, ISSN (Online) = 2522-6754 ISSN (Print) = 2522-6746 www.sciplatform.com

Research Manuscript
it consists of a generalized linear model, adjusting the complexity functions based on time values and the input size, to return the function that indicates the complexity analyzed in each model. Through the Big-O notation, it is possible to order the functions by the increase in the asymptotic growth rate (Goodrich et al., 2014). In Table 2  Regardless of the size of the input dataset, the algorithm will always run at the same time (Goodrich et al., 2014).

Doublelogarithmic
The order that divides the problem twice into smaller problems, processing at each interaction, ¼ of the data (Cormin et al., 1992).
The order that divides the problem into smaller problems, processing half of the data at each interaction (Levitin, 2012).
The order in which performance increases linearly in direct proportion to the size of the input dataset (Goodrich et al., 2014).

O(n)
Linearithmic time The problem is divided into smaller problems, which are solved independently and then merged (Goodrich et al., 2014).

O(n log n)
Quadratic Algorithm performance grows proportionally to the square of the input dataset size (Goodrich et al., 2014).

O(n²)
Cubic Algorithm performance grows proportionally to the cube of the input dataset size (Goodrich et al., 2014).

O(n³)
Quadruple Algorithm performance grows proportionally to quadruple the size of the input dataset (Goodrich et al., 2014).

RESULTS AND DISCUSSION:
After processing the datasets with each tool, the number of reads per dataset and the total processing time per tool in seconds were generated, as shown in Table 3. These data were used as inputs to obtain an estimate of the algorithmic complexity of each tool. Figure 1 below shows the graph generated right after an initial ordering in an ascending manner, which shows the behavior of each tool as the size of the datasets will increase. On the vertical axis, it contains the time each tool took to remove duplicate reads in each dataset, and on the horizontal axis, it shows the size of the datasets.  Figure 2 shows the result generated from the data obtained through the GuessCompx package using the glm function, presenting the best-fitted model referring to the original input and execution time data of the tools, indicating the time complexity of each.   Table 3. List of datasets processed by tool. It can be noted that for the datasets, from the worst to the best complexity time, that is, from the highest to the lowest asymptotic growth rate, there is the NGSReadsTreatment tool, which presents the performance of order O(n log n). Then, the tools MarDRe and FastUniq, which operated in a similar way on the results, obtained a growth of order O(log n). Finally, presenting the best performances in this analysis are the BioSeqZip and ParDRe tools, which obtained O(log log n) complexity. CONCLUSION: In this work, a complexity analysis was performed among five computational tools, which are used to remove redundancy in raw reads resulting from the DNA sequencing process, using input size and time values as parameters. It is important to emphasize that although the complexity of algorithms is not a new subject, there is a lack of materials within the area of computing and mathematics that address the functions related to the complexity of algorithms. Based on the results obtained, among the five chosen tools, BioSeqZip and ParDRe were shown to be more effective in relation to the datasets used in this analysis, presenting the O (log log n) order complexity. Therefore, the analysis of algorithmic complexity in computational tools applied in the omics sciences is necessary, because, with the constant increase in the volume of data, they become more complex to be processed and, the more predictable the tool in terms of cost of time is, more useful it will be, being able to assist the user, as an evaluation criterion, in choosing the tool that best corresponds the needs of your research.

CONFLICT OF INTERESTS:
Authors have no conflict of interest ACKNOWLEDGMENT: Thanks to the Brazilian Research Council (CNPq) and the Federal University of Pará, this work received help from PROPESP/UFPA. This work is part of the research developed by BIOD (Bioinformatics, Omics, and Development research group -www.biod.ufpa.br). AAOV thanks to Federal University of Pará (UFPA) and PHCGS thanks to PVTA341-2020 from Federal Rural University of Amazonia (UFRA).