This proposal pursues the linking of two different research domains to come up with a coordinated multi-disciplinary analysis-and-development targeting metagenomics studies. Different studies on the natural conservation and evolution of metagenomic samples will be carried out using the software prototypes developed by the teams. In particular, the proposed research will include biological analysis of metagenomes, fine-tuning and optimization of custom software for metagenomic comparisons and the development of a scientific methodology to establish evolutionary and conservative relationships across metagenomic samples. It is expected that the described methodology will lay the basis for an automated, non-parametric workflow to discover feature relationships in metagenomic samples that depend on conservation levels.
Metagenomics is the science of studying genomic material from uncultured organisms. Samples can be gathered in a relatively fast and inexpensive fashion and produce huge datasets. However, metagenomes as uncultivated samples, often include a certain degree of noise, suffer from high variability and their composition can be easily biased by both over and under-representation. These inherent characteristics make both the computational comparison of metagenomes and the biological analysis and interpretation of results extremely difficult. In this line, several tools were born to perform coarse-grained comparisons and extract the big picture information from metagenomes, such as COMMET  or MASH . However, these algorithms rely heavily on parameters, which often change the outcome of the experiment strongly. Moreover, low conserved signals are difficult to detect by these algorithms.
To overcome such limitations, a tool for the exhaustive pairwise metagenomic analysis was developed at the University of Malaga (UMA), namely IMSAME (Incremental Multi-Stage Alignment of MEtagenomes) [3, 4]. This tool offers a high-level of sensitivity which makes it able to discover low conserved signals. Given the inherent metagenomics properties, it becomes mandatory to use highly sensitive algorithms to minimize the number of undiscovered signals. Furthermore, these signals can provide additional information that could be used to trace different features (such as historic migration of populations, phenotypes or health conditions) between metagenomic datasets as a function of their similarity level.
From a different perspective, the Laboratório Nacional de Computação Científica (LNCC) has implemented a pipeline for mapping the metagenomic data over species presented using a reference database (e.g. Genbank ) and comparing the abundance of species in each sample. The steps involved in the identification of species is (1) quality sequencing trimming, (2) identifying and filtering sequences derived from the host and (3) align the retained reads against a non-redundant database (using e.g. BLASTX  and MEGAN ) to detect trends in taxonomy and function via a similarity-based approach, considering distinct hierarchy levels on taxonomic and functional categorizations.
By matching features to similarity levels, it is expected to find underlying relationships between the conservation state of genomic regions of interest and their functional role in the organism, such as:
- Highly conserved regions should identify historical characteristics of the host, such as provenance or race, which are the result of long-term genomic evolution.
- Low conserved regions should identify recent changes in the host, such as the pathogens from a latter disease, or the microbial community found in exotic types of food (in respect to the host).
Furthermore, in-between highly and low conserved regions, other features are expected to be matched as result of analysis. Eventually, a detailed correlation between conservation status of homologies and host changes will be established, helping the scientific community in several ways, such as:
- Reduction of computation times. By knowing the expected conservation level of target queries, it is possible to limit the search space to a smaller exploration range.
- Greater insights in the biology of samples, e.g. discovering the conservation status of features throughout different samples.
- Faster and lighter classification of metagenomes based on their distributions of conservation levels.
As proof-of-concept, two metagenomic cases will be studied: (1) the well-known Turnbaugh  core gut microbiome experiment and (2) a dataset including nasopharyngeal metagenomic samples from children with severe respiratory diseases across the world. Considering that about thirty percent of the pneumonia cases in children do not show any etiologic agents, and on the other hand, asymptomatic children present a large portion of these microorganisms, the identification of microbiota involved with the appearance of symptomatic pneumonia is a relevant public health point in different countries. Ten metagenomic samples from children with severe respiratory diseases, aged 0 to 5 years, were sequenced at the Darcy Fontoura de Almeida Genomic Unit in the LNCC, being 4 from Cambodia, 2 from France, 2 from Mali and 2 from Brazil, which will be used for taxonomic identification and comparison. In the first experiment, conservation levels will be attempted to be matched to relationships between body mass index and family features (such as brotherhood, sisterhood and motherhood). On the second experiment, the conservation levels will be matched to origin and disease, correlating the epidemiology of viruses and bacteria with the virulence and the severity of the disease.
Careful and exhaustive biological examination will take place from the results on the different datasets and computational strategies. Particularly, it will be mandatory to assess the biological correctness of the feature matching procedure in order to generalize it for further experiments. This is, extensive testing will be carried out afterwards by performing all vs. all pairwise comparisons between the conserved homologies and reference databases (such as gene, proteins or genomes). For such purpose, trends in bacterial functional profile will be obtained using the COG  and SEED  databanks, indexed at MEGAN tool, in addition to CAZy , and KEGG , available at LNCC. The functional role of the conserved homologies will be discussed and backed up by biological evidence and background.
In this project, a software suite for the analysis of pairwise metagenomic comparisons will be developed, along with a scientific methodology to study the linkage between features and conservation levels in similarity. The software and the methodology will be backed up by biological evidence and reasoning.
More specifically, in this proposal we aim to complete the following, measurable objectives:
- The study different signals contained in a set of metagenomes throughout a computational approach. That is, extracting features from all types of conserved signals (both low and highly conserved homologies) and infer relationships between the samples based on such features. Expected outcome: Per experiment performed, produce a list of features (e.g. diseases) across metagenomes and their average level of similarity (traceability of the feature).
- A detailed correlation should be identified matching features to host changes. Expected outcome: A series of features will be linked by their similarity level to clusters (e.g. metagenomes belonging to hosts with a condition will only be grouped in clusters if a particular similarity level is used).
- Besides the analytical dimension of this proposal, additional optimizations, mechanisms and pipelines will be designed and developed to assist in the process of extracting the conserved homologies. Expected outcome: Benchmarking with different datasets will show reduction in CPU time and memory regarding other software.
- Transfer knowledge on the field and open new collaborations of analysis for metagenomic studies. Expected outcome: From the generated results, scientific publications and related works will be derived.
The project duration will be of 6 months. The following sections describe the secondments and working plan of the project and the different tasks carried out by both collaborators.
The global research approach (see Figure 1) follows well known concepts in defining and developing software-architectures and is strongly connected to the objectives described in the precedent sub-section. In the first task (T1), a requirements analysis is carried out. The requirements are analyzed with double perspective: the software modules to develop and the datasets to validate the procedures, which are derived especially from metagenomic use cases. These requirements are the input for the data obtention process (T2) and subsequent software modeling and design steps (T3). In this line, new software libraries will be developed on top of the existing prototypes to solve the computational bottleneck frequently present when dealing with metagenomic data, particularly under the exhaustiveness of the presented scenario. A continuous circle of development, feedback and improvements will be active during the life-cycle of these core tasks. Besides this continuous process of improvement, a formal modeling and verification (T4) will take place. Tasks in T4 includes a full testing environment making use of the concepts of workflows to enable repeatable and measurable outcomes. Biological analysis will be used to back up and support hypothesis on the evidence found. T5 is centered in the dissemination of results and production of publishing material. A continuous coordination task (T6) is carried out along all the life cycle of the program to perform knowledge transfer across both research groups.
Figure 1. Schemee illustrating the general organization of the working plan. Black ellipses identify the task number, whereas the blue ones assign the number of months to perform the work. Rectangles display the project partner. The project is planned for 6 months, and most of the tasks overlap their scheduled time and therefore will be carried out in parallel.
Figure 2. Gantt chart of the project. Cyan colored fortnights represent secondments. Green squares represent active tasks during the given time. Yellow squares are used to indicate a continuously active task.
More specifically, the tasks comprise:
- Preliminary analysis: Definition of the data requisites to meet the criteria for the expected results. Definition of the software system, inputs and outputs, pipelines and processing methods.
- Data procedures: Analyze the state of the art of metagenomic data, gather data and perform suitability tests, exploratory analysis and cleaning of outliers. Study the data availability and authorship. Define the data parameters and constraints of the experiment.
- Design and development: Design the software modules and components involved in the pipeline. Design the automated procedures for repeatability and reproducibility of the executions. Develop the software prototypes to meet desired outputs and computational specifications (e.g. runtimes and resources usage).
- Integration and validation. Consolidate the software modules into a curated suite. Perform biological analysis on the outcomes, evaluate results in respect to computational demands and biological expectations. Custom fine-tuning and optimization of different software pieces to accelerate the processing steps.
- Dissemination of results. Index and summarize a scientific report of the performed experiments. Develop a methodology step-by-step on the processing of metagenomes to extract the proposed outcomes. Write a scientific article and submit it to high quality journals.
- Training and knowledge transfer. Continuously impart and receive feedback and training on the different parts involved in the project (i.e. biological analysis and software manipulation and design) in a bidirectional fashion.
A competitive collaboration with large know-how and expertise in the addressed domains will be created: the University of Malaga (UMA, Spain), whose research is focused in High Performance Computing applied to Comparative Genomics and Metagenomics, will provide with the necessary computational infrastructure, applications and developments in order to make possible the highly demanding computational processes; and the Laboratório Nacional de Computação Científica (LNCC, Brazil), whose expertise in Microbiology, Computational Biology and Metagenomics will provide with the necessary background to successfully analyze and study the biological implications and aspects of the research proposal.
The collaboration combines the application domain of Biology (LNCC) with the technical research fields of computer architecture and High Performance Computing (UMA). Both research groups accredit a large and consolidated history of previous successful collaborative developments, such as [13, 14, 15].
Furthermore, the groups account for a long collaboration in carrying out international projects, such as “High Performance, Cloud and Symbolic Computing in Big-Data problems applied to mathematical modeling of Comparative Genomics” , “Research, Infrastructure and Training in High Performance and Cloud Computing applied to Next Generation Sequencing and Metagenomics data analysis”  and “Aplicações biomédicas em plataformas computacionais de alto desempenho” .
The exchange of personnel will be accomplished by Prof. Trelles and Prof. Ribeiro de Vasconcelos in two secondments at both the University of Malaga and the Laboratório Nacional de Computação Científica, which are expected to take place during the 10th January – 6th March and 1st May – 15th June, respectively.
PROFILE OF THE RESEARCHERS
Oswaldo Trelles, PhD. is full-Professor at the Computer Architecture Department in the University of Malaga, Spain, and visitor professor at the Johannes Kepler University of Linz, Austria. He obtained his PhD. degree in Industrial Engineering in the University of Malaga, a M.Sc. degree in Computer Sciences at the Universidad Politécnica, Madrid, and a M.Sc. degree in Cellular and Molecular Biology also in Malaga. His main interests in research include parallel paradigms for high performance computing in different architectures, from distributed to shared memory, Cloud Computing and multicore parallel computers; data mining and automatic knowledge discovering in the fields of biological and biomedical research and finally, design of software platforms for integration of data and services targeting Big-Data problems. His group provides a multidisciplinary top level scientific environment with long experience of interfacing with biologists and medical researchers.
Ana Tereza Vasconcelos PhD. is full-researcher at The National Laboratory for Scientific Computation. She obtained the PhD. degree in the Univ. of Texas – MDAnderson Cancer Center, MD, United States. Master’s in Byophysics at the Univ. Federal do Rio de Janeiro (UFRJ) and Doctorate in Genetics at UFRJ, Rio De Janeiro, Brazil. She founded The LABINFO (Bioinformatics Laboratory) in 2000 under the auspices of the Program of Biotechnology and Genetic Resources of the Brazilian Ministry of Science, Technology and Innovation and in 2008 ‘Darcy Fontoura of Almeida’ Computational Genomics Unit (Unidade de Genômica Computacional Darcy Fontoura de Almeida – UGCDFA), harboring a high-throughput sequencing facility. She is also coordinator of the International Associated Laboratory (CNRS) in the field of Bioinformatics. She has experience in the field of Bioinformatics and Computational Biology working mainly in the following subjects: Genomics, development of mathematical methods and computational tools applied to genomics and annotation of genomes. She has expertise in processing and analyzing nucleotide sequences (genomes, metagenomes, exomes and cDNA) generated from large-scale sequencing in Brazil and abroad using computational programs and databases, several of which were developed by their group. In particular, she accounts for a long-time experience in the field of metagenomics (see [19, 20, 21, 22]). She has published more than 140 international journal papers, 4 book chapters and 3 patents.
- Maillet, Nicolas, et al. “COMMET: comparing and combining multiple metagenomic datasets.” Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on. IEEE, 2014.
- Ondov, Brian D., et al. “Mash: fast genome and metagenome distance estimation using MinHash.” Genome biology 17.1 (2016): 132.
- Pérez-Wohlfeil, Esteban, Oscar Torreno, and Oswaldo Trelles. “Pairwise and incremental multi-stage alignment of metagenomes: a new proposal.” International Conference on Bioinformatics and Biomedical Engineering. Springer, Cham, 2017.
- Pérez-Wohlfeil, Esteban, Oscar Torreno, and Oswaldo Trelles. “Accelerating exhaustive pairwise metagenomic comparisons.” International Conference on Algorithms and Architectures for Parallel Processing. Springer, Cham, 2017.
- Altschul, Stephen F., et al. “Basic local alignment search tool.” Journal of molecular biology 215.3 (1990): 403-410.
- Bilofsky, Howard S., and Burks Christian. “The GenBank® genetic sequence data bank.” Nucleic acids research 16.5 (1988): 1861-1863.
- Huson, Daniel H., et al. “MEGAN analysis of metagenomic data.” Genome research 17.3 (2007): 377-386.
- Turnbaugh, Peter J., et al. “A core gut microbiome in obese and lean twins.” nature 457.7228 (2009): 480.
- Tatusov, Roman L., et al. “The COG database: an updated version includes eukaryotes.” BMC bioinformatics 4.1 (2003): 41.
- Kew, Royal Botanic Gardens. “Seed information database (SID). Version 7.1.” (2008).
- Cantarel, Brandi L., et al. “The Carbohydrate-Active EnZymes database (CAZy): an expert resource for glycogenomics.” Nucleic acids research 37.suppl_1 (2008): D233-D238.
- Kanehisa, Minoru. “The KEGG database.” silico simulation of biological processes 247 (2002): 91-103.
- Siqueira, Franciele Maboni, Esteban Pérez-Wohlfeil, Fabíola Marques Carvalho, Oswaldo Trelles, Irene Silveira Schrank, Ana Tereza Ribeiro Vasconcelos, and Arnaldo Zaha. “Microbiome overview in swine lungs.” PloS one 12, no. 7 (2017): e0181503.
- Vilasbôas, Fabrıcio, Carla Osthoff, Kary Ocana, Oswaldo Trelles, and Ana Tereza Vasconcelos. “Análise da execução do algoritmo CFRK.”
- Vilasbôas, Fabrıcio, Carla Osthoff, Kary Ocana, Oswaldo Trelles, and Ana Tereza Vasconcelos. “Método computacional baseado em workflow para contabilizaçao da frequência de repetiç ao de k-mers.”
- High Performance, Cloud and Symbolic Computing in Big-Data problems applied to mathematical modeling of Comparative Genomics. Funded by the Industry-Academia Partnership and Pathways (IAPP) – Marie Curie Programme-EU (2.616.332,63 Euros). Code: 324554. From February 2013 to January 2017. IP Dr. Oswaldo Trelles (Project coordinator).
- Research, Infrastructure and Training in High Performance and Cloud Computing applied to Next Generation Sequencing and Metagenomics data analysis. Funded by Programa Ciência sem Fronteiras. Projetos MEC/MCTI/ CAPES /CNPq/ FAPs Nº 61/2011 (244.200,00 Reales / (20.740,00 US$). From October 2013 to December 2016. IP Dr. Oswaldo Trelles (Project coordinator).
- Aplicações biomédicas em plataformas computacionais de alto desempenho. Funded by Programa Hispano-Brasileño de Cooperación Universitaria. Organização de Seminários, Workshops e outras atividades binacionais (7.650 Euros + 9.906,60 Reales) From October 2013 to October 2013. IP Dr. Oswaldo Trelles (Project coordinator).
- Guerra, Alaine B., et al. “Metagenome enrichment approach used for selection of oil-degrading bacteria consortia for drill cutting residue bioremediation.” Environmental Pollution 235 (2018): 869-880.
- Babujia, Letícia Carlos, et al. “Impact of long-term cropping of glyphosate-resistant transgenic soybean [Glycine max (L.) ] on soil microbiome.” Transgenic research 25.4 (2016): 425-440.
- Tavares, Tallita Cruz Lopes, et al. “Metagenomic analysis of sediments under seaports influence in the Equatorial Atlantic Ocean.” Science of the Total Environment 557 (2016): 888-900.
- Pacchioni, Ralfo G., et al. “Taxonomic and functional profiles of soil samples from Atlantic forest and Caatinga biomes in northeastern Brazil.” MicrobiologyOpen 3.3 (2014): 299-315.