INFORMATION TECHNOLOGIES APPLIED TO LIFE SCIENCES
This is the website of the Bioinfomatics and Information Technologies Laboratory (BITLAB) group, part of the COMPUTER ARCHITECTURE DEPARTMENT AT THE UNIVERSITY OF MALAGA (AC-UMA).
BITLAB is a multi-disciplinary research group that focuses on developing solutions using advanced computing for data management and data analysis problems in the bioinformatics and biomedicine fields. By bringing together expertise in data analysis, medicine, and biology; BITLAB is able to deliver user-friendly platforms that bridge the gap between research and applied science.
THE MAIN RESEARCH LINES IN BITLAB ARE:
- 01 CLOUD COMPUTING
- 02 BUILDING PARALLEL APPLICATIONS
- 03 COMPARATIVE GENOMICS
- 04 DATA AND SERVICES INTEGRATION
- 05 KDD IN BIOLOGICAL AND CLINICAL DATA
- 06 SUPPORT FOR GENOMICS PROJECT MANAGEMENT
- 07 ANIMAL BEHAVIOUR BY SEMANTIC VIDEO ANALYSIS
- 08 DISTANCE E-LEARNING TOOLS
- 09 GENE-EXPRESSION DATA ANALYSIS
- 10 LASER-TISSUE INTERACTION MODELLING
- 11 SOFTWARE DEVELOPMENTS
A core competence of our research group is the efficient computation of scientific applications on top of cloud infrastructures, addressing the related problems. From the range of available cloud service models for which we have experience, IaaS represents the most suitable one due to its broad range of configuration possibilities. In addition, results of our performed analysis comparing different PaaS and IaaS solutions suggest that IaaS solutions have a better performance for HPC applications. In our research projects, we have been working with different public cloud providers such as Amazon, Windows Azure and IBM Smart Cloud Enterprise. We have also a large experience with private cloud solutions such as OpenStack.
Galaxy is a workflow management system that enables the definition and sharing of scientific workflows. It was originally developed to deal with biological data, however it is currently used in a number of different public servers (https://wiki.galaxyproject.org/PublicGalaxyServers) of different research domains. One of its main objectives is making easier the management of workflows to scientists not necessarily having computer programming experience.
The Terascale Open-source Resource QUEue manager (TORQUE) (http://www.clusterresources.com/products/torque/) is a popular and widely used open source solution to manage high throughput computing (HTC) clusters. From the usability point of view, both end-users and administrators are familiar with its job submission language and configuration respectively, which are very similar to the Portable Batch System (PBS), what makes it adoption easier. From the computational point of view, TORQUE is able to handle large clusters with tens of thousands of nodes and jobs, and large jobs eventually spanning hundreds of thousands of processors. Besides, TORQUE has built-in scheduling algorithms and it is also able to communicate with external schedulers such as Maui (i.e. the one used in this work) and Moab through a clearly
defined scheduling interface.
Our work with the two previously mentioned tools has been focused on improving the data management, tasks scheduling and node allocation features to optimise a number of optimisation criteria. These criteria include waiting time in the queue, makespan, throughput and resources utilisation. Besides, an additional target has been making easier the exploitation of the underlying infrastructure via user-friendly interfaces
- Torreño O., Ramet D., Karlsson T.J.M. , Lago J. , Bodenhofer U. and Trelles O. (2012); “Is Cloud Computing an attractive alternative for Bioinformatics?; XI Jornadas de Bioinformática / XI Spanish Bioinformatics Workshop; Barcelona – España.
- Johan Karlsson, Oscar Torreño, Oswaldo Trelles (2012); “jORCA – Jumping to the Cloud”; XI Jornadas de Bioinformática / XI Spanish Bioinformatics Workshop; Barcelona – España.
- J. Lago; D. Ramet; O. Torreño, J. Karlsson; J.Falgueras; N. Chelbat; M. Krieger and O.Trelles (2012); “Mr.Cirrus: A Map-Reduce approach for High level Cloud Computing”; ISCB Latin America 2012 Conference on Bioinformatics; Santiago de Chile – Chile
- J. Karlsson, Daniel Ramet, O. Torreño, Günter Klambauer, M.Cano and O. Trelles (2012); “Enabling Large-Scale Bioinformatics Data Analysis with Cloud Computing”; International Workshop on Heterogeneous Architectures and Computing (10th IEEE International Symposium on Parallel and Distributed Processing with Applications); Leganes – Madrid.
- Tor Johan Mikael Karlsson, ´ Oscar Torreño Tirado, Daniel Ramet,Juan Lago, Juan Falgueras Cano, Noura Chelbat, and Oswaldo Trelles (2013); “Running Legacy Bioinformatics Applications with Cloud Computing Resources”; IWANN 2013; Tenerife – España. Published: Part II, LNCS 7903, pp. 200–207, 2013. Springer-Verlag Berlin Heidelberg 2013.
- Oscar Torreno, Johan Karlsson, Paul Heinzlreiter and Oswaldo Trelles (2014); “Running workflows in the cloud”; Jornadas Sarteco (Jornadas de paralelismo 2014); Valladolid – España.
- Oscar Torreño and Oswaldo Trelles; “Auto-scaling strategy for OpenStack cloud resources managed by TORQUE”; Jornadas de Paralelismo 2015. (Jornadas Sarteco); Córdoba – España.
- Krieger, M. T., Torreno, O., Trelles, O., & Kranzlmüller, D. (2016). Building an open source cloud environment with auto-scaling resources for executing bioinformatics and biomedical workflows. Future Generation Computer Systems.
The broad spectrum of demanding applications (CPU intensive, huge memory requirements, storage capacity and I/O bounded algorithms) within genomics presents a challenge for high performance computing.
We drive new solutions that seemed unaffordable only a few years ago. These strategies use parallel computers to solve computationally expensive algorithms that range from computationally regular patterns, such as database searching applications, to very irregularly structured patterns, such as phylogenetic trees. Fine and coarse-grained parallel strategies are addressed for this very diverse set of applications. Different computer architectures are also used, ranging from networks of commodity multi-computers and more powerful workstations to super-computers. In order to avoid the most common sources of inefficiency in parallel computing, our approaches include dynamic load distribution, speculative computation, network-bandwidth optimisation, and intelligent task scheduling.
- O.Trelles-Salazar, E.L.Zapata, J.M.Carazo; “On an efficient parallelization of exhaustive sequence comparison algorithms”; Computer Applications in BioSciences (1994) 10(5):509-511
- Ceron, C., Dopazo, J., Zapata, E.L., Carazo, J.M. and Trelles, O.; “Parallel Implementation for DNAml Program on Message-Passing Architectures”; Parallel Computing and Applications, vol 24 (5-6),June 98 pp.701-716
- Trelles O., Andrade M.A., Valencia A., Zapata E.L., and Carazo J.M.; “Computational Space Reduction and Parallelization of a new Clustering Approach for Large Groups of Sequences”; Bioinformatics vol.14 no.5 1998 (pp.439-451); (formerly CABIOS)
- Trelles O.; “On the parallelisation of bioinformatics applications”; Briefings in bioinformatics (May, 2001) vol.2 (2) pp. 181-194
- Trelles Oswaldo and Rodríguez Andrés; “Parallel Metaheuristics in Bioinformatics: A new class of algorithms”; Wiley Series on Parallel and Distributed Computing, Edited by: Enrique Alba, (ISBN-13-978-0-471-67806-9) pp: 517-549
Genome comparison is a traditional problem with high memory and CPU time requisites. State-of-the-art software face limitations in terms of computational space to be explored and the amount of memory used during computation. These limitations do not allow comparing very large sequences, therefore our efforts have been concreted in overcoming such limitations. More concretely, we have redesigned the process and we have used HPC techniques in the development of the comparison process. The result is a program able to compare genomes without the limitations of original software.
Starting with the simple collection of ungapped local alignments produced by genome comparison procedures we are also able to detect and identify blocks of large rearrangements taking into account repeats, tandem repeats and duplications. To the best of our knowledge, our approach is the first one available as a coherent workflow, thus outperforming current state-of-the-art software tools. In addition, we are able to classify the type of rearrangement what is not possible in the equivalent software. The results obtained are an important source of information for breakpoints refinement and featuring, as well as for the estimation of the Evolutionary Events frequencies to be used in inter-genome distance proposals, etc.
The field of metagenomics, defined as the direct genetic analysis of uncultured samples of genomes contained within an environmental sample, is gaining increasing popularity. The aim of studies of metagenomics is to determine the species present in an environmental community and identify changes in the abundance of species under different conditions. Our work in this field has been focused on developing new tools and datafile specifications that facilitate the identification of differences in abundance of reads assigned to taxa (mapping), enable the detection of reads of low-abundance bacteria (producing evidence of their presence), provide new concepts for filtering spurious matches, etc. In addition, we work on innovative visualization ideas for improved display of metagenomic diversity are also proposed to better understand how reads are mapped to taxa.
- Óscar Torreño and Oswaldo Trelles. Breaking computational barriers in pairwise genome comparison. BMC Bioinformatics, 2015, 16:250. DOI: 10.1186/s12859-015-0679-9.
- Arjona-Medina, J. A., & Trelles, O. (2015, November). Computational Synteny Block: A framework to identify evolutionary events. In Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on (pp. 5-12). IEEE.
- Arjona-Medina, J. A., & Trelles, O. (2016). Refining borders of genome-rearrangements including repetitions. BMC genomics, 17(8), 433.
- Pérez-Wohlfeil, E., Arjona-Medina, J. A., Torreno, O., Ulzurrun, E., & Trelles, O. (2016). Computational workflow for the fine-grained analysis of metagenomic samples. BMC genomics, 17(8), 351.
The scientific knowledge needed to produce a more complete view of any biological process is scattered across various external, heterogeneous and geographically distributed databases. The heterogeneity in formats and storage media, as well as the diversity and dispersion of data, makes it difficult to use this plethora of interrelated information. As such, the integration of these information sources for unified access is a clear and important technological priority.
In this research area we focus on the design and deployment of integration software architectures, based on metadata repositories and programmatic interfaces to provide necessary features. This includes discovery, invocation and documentation of tools, data persistence systems, etc. The platform is able to access, link, and query biological data sets easily and efficiently by integrating a high number of disperse web-based services.
The general goal of Knowledge Discovery in Databases (KDD) technologies is the extraction and abstraction of patterns, perturbations, relationships or associations from the analysed data. Association rules that disclose hidden co-occurrence of items as a set of antecedents and associated consequents is one of the most useful KDD outcomes.
We have extensively applied KDD techniques in gene-expression experiments (DNA arrays). The experimental outputs are combined to form a consolidated table with the expression values of genes (rows) in the different experiments (columns). The main component of this table is the gene expression-matrix. However, each row frequently contains additional information about the particular properties of the gene: function, pathway, chromosome location, GO-terms, etc., known as gene-metadata. At the same time, the experiments or samples are described using different descriptors such as precedence, morphology, clinical information, known as sample-metadata.
In this research area, we develop new algorithms to extend the correlation beyond the expression values by taking into account both gene and sample metadata.
A high number of genomics projects are well underway producing large quantities of comparative genomic data leading a growth of the information management problems. In this context, experiment data management become a time-consuming and prone to error task due to the diversity of interrelated datatypes, high volume of data produced in geographically disperse laboratories and the needs for flexible and powerful access to query, combine, share, and protect the information. In this research area we have designed a generic, customisable and flexible genomics’ project management system, named pUMA (projects at the UMA) a customisable and flexible genomics project management system. pUMA has been used to manage several on-going interdisciplinary projects, including the Spanish Solanaceae project (ESP-SOL) and RIRAAF, amongst others.
In this area, we apply automatic analysis systems for semantic content recognition in videos of living organisms. The system is able to extract metadata from these videos describing the actions and behaviour of individual organisms (cells, bacteria, mice, etc.). Starting with image processing procedures to identify objects of interest, the system tracks the movements and actions of the organism in the video sequences, detecting and recording the relevant behavioural events.
The metadata obtained that describes the behaviour of the organism are also organised in a database. The database search system enables the retrieval and visualisation of selected video sequences that match a given query-by-content criteria. The efficiency and accuracy of the presented system increases analytical power for uncovering and studying the effect of drugs on a variety of behavioural models.
Keywords: Computer vision, video analysis, object recognition, tracking, medicine.
- Rodriguez, A.; Shotton,D.; Guil, N. and Trelles, O.; “Automatic analysis of the content of cell biological video and database organization of their metadata descriptors”; IEEE Transactions in Multimedia (February 2004) Vol 6, No. 1, pages 119-128
- Rodríguez, A.; Shotton,D.; Guil, N. and Trelles, O.; “Analysis and Description of the Semantic Content of Cell Biological Videos”; Multimedia Tools and Applications, 25, 37-58, 2005
- Manuel J. Martín-Vázquez; Mario A. Trelles; Alejandro Sola; R. Glen Calderhead and Oswaldo
We work on the development of a controlled collaborative teaching environment tool specifically focused on emulating a real classroom environment. The main features this tool offers are: a) controlled access based on a central registry; b) distributed architecture that allows simultaneous online classrooms; c) centralised control of the students intercommunications; d) simulation of requests to speak (“raise hands”) in the classroom; e) optimized bandwidth usage to reduce unneeded transmission, offering fluid communication between the participants; f) easy and intuitive interface, even for very young children; g) set of teaching resources such as blackboard and display of presentations.
Our group is focused on computational methodologies for the analysis and interpretation of large-scale expression data sets generated by DNA micro-array experiments (see PreP and double-Scan www.bitlab-es.com/prep and engene www.bitlab-es.com/engenet software). Moreover, our group has a long tradition of interacting with life-sciences teams to not only offer technological support for data-management, but also data analysis support.
Image processing and intelligent systems are of vital importance in industrial and biomedical research, dermatological diagnostic applications and surgery, and post-surgical assessment.
In particular, an accurate and objective method of measuring skin or lesion state (e.g. coarseness, pore size, degree of wrinkling) and pigment changes, using measurable quantitative parameters, is required for clinical applications. The true value of skin or lesion “quality” can only be obtained in this manner, because visual inspection and grading is subjective, and is affected by several factors including: viewing geometry, ambient illumination, assessor’s experience and visual acuity, etc. Visual inspection and grading therefore lacks accuracy, and the ability to record and reproduce the measurements in a consistent manner.
The general approach in our developments in this area is to build a sampling catalogue based on both expert and automatic computer assessed tissue quality. By ’tissue quality’, we refer mainly to colour and texture (coarseness, wrinkles, directionality, etc.). Based on this catalogue, differences in tissue quality can be assessed as a function of the differences in the catalogue position of the different samples (i.e. before and after treatment). As such, these differences can be used as an objective comparative measurement of the improvement obtained by treatment.
In the following section, we present a brief overview of how the research BITLAB is involved in has evolved over time. – change a bit, add something about bioinformatics.
In the early 90s, the group was working on what was at the time a typical task in a computer architecture department: the development of parallel compilers focused on the optimization of sparse matrix computations. Moving in a different direction, the group became interested in Input/Output bounded applications, software with a low computational load over large sets of data. In particular, the emerging field of bioinformatics, very much in its infancy, piqued our interest.
The features of the diverse set of applications in this field appealed to us: large, heterogeneous, ever growing datasets typically stored in various locations worldwide.
Our initial research centred on the parallelisation of “regular” I/O applications in commodity multi-computers formed by linking together simple PC’s and more powerful workstations. Database searching was an ideal fit, and allowed us to test our generic parallel approach that included dynamic load distribution , speculative computation in phylogenetic trees construction , network-bandwidth optimisation, intelligent task scheduling to minimize idle CPU power . In short, ways to avoid the most common sources of inefficiency in parallel computing.
These developments provided us with an efficient parallel computational model to address more CPU demanding problems such as database searching in structural databases, in particular in the field of data-mining algorithms. A summary of the research conducted in this first epoch can be found in [4, 5].
To date, searching for homologies and evolutionary relationships between sequences was the most frequently used strategy for assigning function to new sequences. However, when working with query sequences, which have no clear homologues in the sequence databases, the functional annotation process is unattainable. For these situations we proposed a strategy based on the identification of small fragments shared by both the input sequence and several database sequences , and use association rules  to correlate the fragments with biological annotations.
In addition, we became interested in video analysis, specifically the identification of basic animal behavior. Image segmentation and tracking were used to characterize simple movements, and identify more complex events (tumbling, stopping, moving, etc) [8,9]. A complementary development in this area was our foray into the field of medical imaging, in particular in the objective assessment of laser tissue treatments [10, 11]. Our collaborations with medical teams resulted in the development of a numerical model to simulate the effects of laser radiation in tissue .
Returning to bioinformatics, we shifted our focus to gene-expression data processing, with particular emphasis on double-colour experiments with cDNA biological material. Error removal and data-quality  formed the core of our research. The main outcomes of our research were applications, designed and implemented in close collaboration with the end-user. An important contribution in the field was the development of a new procedure for extending the dynamic range of gene expression values that removed quantization and saturation errors . Clustering and classification procedures were implemented, thereby completing the analysis pipeline . In addition, we introduced the concept of association rules for gene-expression data, combining expression with biological annotations .
At the beginning of the 21st century we were selected to develop the software integration architecture for the NATIONAL INSTITUTE OF BIOINFORMATICS IN SPAIN. In this work, we addressed the design and implementation of an integration platform  and diverse versatile web clients  to access BIOMOBY compatible services; a system by which a client can interact with multiple sources of biological data regardless of the underlying format or schema; using the service description stored in the BIOMOBY catalogue.
Currently we are working in the paradigm of scientific workflows as a crucial element to aid researchers use distributed and heterogeneous resources in a repeatable and well-defined manner. One of our first contributions in this area was a workflow platform that allowed the execution of workflows using the tools provided by the SPANISH NATIONAL INSTITUTE OF BIOINFORMATICS (INB). This platform was based on a flexible, lightweight architecture for publishing biological data and services, and was designed to use BIOMOBY services, with separate persistent, private storage for each user. It was also able to handle long-running services by means of the asynchronous BIOMOBY specification.
Moving with the times, we switched our focus to Galaxy, a widely used workflow platform. In particular, we concentrate our research on the connection of the Galaxy workflow platform with Cloud Computing and High Performance Computing (HPC) resources . The set of applications we are developing includes: comparative genomics tools (e.g. sequence comparison and evolutionary events) [20, 21], biomedical applications (e.g. GWAS data processing analysis)  and metagenomics (e.g. taxonomical classification) .
The group continues to be involved in both national and international collaborative projects. This includes the Spanish Allergy Network (RIRAAF), the pan-European ELIXIR-EXCELERATE bioinformatics capacity & training initiative, and the FP7 Industry-Academia Partnerships and Pathways (IAPP) project Mr.Symbiomath (http://www.mrsymbiomath.eu/) that we are coordinating.
 O.Trelles-Salazar, E.L.Zapata, J.M.Carazo; “On an efficient parallelization of exhaustive sequence comparison algorithms”; Computer Applications in BioSciences (1994) 10(5):509-511
 Ceron, C., Dopazo, J., Zapata, E.L., Carazo, J.M. and Trelles, O.; “Parallel Implementation for DNAml Program on Message-Passing Architectures”; Parallel Computing and Applications, vol 24 (5-6),June 98 pp.701-716
 Trelles O., Andrade M.A., Valencia A., Zapata E.L., and Carazo J.M.; “Computational Space Reduction and Parallelization of a new Clustering Approach for Large Groups of Sequences”; Bioinformatics vol.14 no.5 1998 (pp.439-451); (formerly CABIOS)
 Trelles O.; “On the parallelisation of bioinformatics applications”; Briefings in bioinformatics (May, 2001) vol.2 (2) pp. 181-194
 Trelles Oswaldo and Rodríguez Andrés; “Parallel Metaheuristics in Bioinformatics: A new class of algorithms”; Wiley Series on Parallel and Distributed Computing, Edited by: Enrique Alba, (ISBN-13-978-0-471-67806-9) pp: 517-549
 Perez A.J., Rodríguez, A., Trelles O., Thode G.; “A computational strategy for protein function assignment which addresses the multidomain problem”; Comparative and Functional Genomics , 2002, (3) pag.423-440
 Rodriguez, A.; Carazo, J.M. and A., Trelles O.; “Mining Association Rules from Biological Databases”; Journal of the American Society for Information Science and Technology 56(5):493–504, 2005
 Rodriguez, A.; Shotton,D.; Guil, N. and Trelles, O.; “Automatic analysis of the content of cell biological video and database organization of their metadata descriptors”; IEEE Transactions in Multimedia (February 2004) Vol 6, No. 1, pages 119-128
 Rodríguez, A.; Shotton,D.; Guil, N. and Trelles, O.; “Analysis and Description of the Semantic Content of Cell Biological Videos”; Multimedia Tools and Applications, 25, 37-58, 2005
 Manuel J. Martín-Vázquez; Mario A. Trelles; Alejandro Sola; R. Glen Calderhead and Oswaldo Trelles; (2006); “A new user-friendly software platform for systematic classification of skin lesions to aid in their diagnosis and prognosis”; Journal of Laser in Medical Science (2006) 21: 54–60
 M.A.Trelles, M.D, PhD; X.Alvarez, M.J.Martín-Vázquez, O.Trelles, M.Velez, J.L. Levy and I.Allones M.D.; “Assessment of the Efficacy of Nonablative Long-Pulsed 1064-nn Nd:YAG Laser Treatment of Wrinkles Compared at 2, 4 and 6 Months”; Facial Plastic Surgery; vol. 21, num. 2, 2005 (special issue: Lasers: New Technology and Emerging Trends)
 L.F. Romero, A. Rodríguez, A. Muñoz C., M.A.Trelles, E.L.Zapata and O.Trelles; (1996); “Efficient Computational Parallel Solutions for Laser/Tissue Interaction Modelling”; BIOS-Europe’96; Viena, Austria
 Jorge García de la Nava, Sacha van Hijum and Oswaldo Trelles; “PreP: gene expression data pre-processing”; Bioinformatics 2003 Nov 22; 19 (17): 2328-2329
 Jorge García de la Nava, Sacha A.F.T. van Hijum and Oswaldo Trelles; “Saturation and quantization reduction in microaray experiments using two scans at different sensitivities”; Statistical application in genetics and molecular biology; vol 3. Issue 1, article 11. The Berkeley Electronic Press. ISSN: 1544-6115. (2004)
 Jorge García de la Nava, Daniel Franco Santaella, Jesús Cuenca Alba, José María Carazo, Oswaldo Trelles, Alberto Pascual-Montano; “Engene: The processing and exploratory analysis of gene expression data”; Bioinformatics vol.19 no.5 (2003) pp.657-658
 P. Carmona-Sáez, M. Chagoyen, A. Rodríguez, O. Trelles, J. M. Carazo and A. Pascual-Montano.; “Integrated analysis of gene expression by association rules discovery”; BMC Bioinforamtics 2006, 7:54
 J.F. Aldana, M. Roldán-Castro, I. Navas, M.M. Roldán-García, M. Hidalgo-Conde, and O.Trelles; “Bio-Broker: Integration of Biological Data Sources and Data Analysis Tools”; Software, Practice and Experience 2006; 36:1585-1604. Published Online: www.interscience.wiley.com
 Ismael Navas-Delgado, Maria del Mar Rojano-Muñoz, Sergio Ramírez, Antonio J. Pérez, Eduardo Andrés León; Jose F. Aldana-Montes, and Oswaldo Trelles; “Intelligent client for integrating bioinformatics services”; Bioinformatics, vol.22 no.1 2006 pages 106-111
 Krieger, M. T., Torreno, O., Trelles, O., & Kranzlmüller, D. (2016). “Building an open source cloud environment with auto-scaling resources for executing bioinformatics and biomedical workflows”. Future Generation Computer Systems.
 Torreno, O., & Trelles, O. (2015). “Breaking the computational barriers of pairwise genome comparison”. BMC bioinformatics, 16(1), 1.
 J. A. Arjona-Medina and O. Trelles, “Computational Synteny Block: A Framework to Identify Evolutionary Events”, in IEEE Transactions on NanoBioscience, vol. 15, no. 4, pp. 343-353, June 2016. doi: 10.1109/TNB.2016.2554150
 Alex Upton, Oswaldo Trelles, José Antonio Cornejo-Garcia, James Richard Perkins. “High Performance Computing to Detect Epistasis in Genome Scale Datasets”. Briefings in bioinformatics. 2015, Aug 13. PMID: 26272945.
 Perez-Wohlfeil, E., Arjona-Medina, J. A., Torreno, O., Ulzurrun, E., & Trelles, O. (2016). “Computational workflow for the fine-grained analysis of metagenomic samples”. BMC genomics, 17(8), 802.