Integration of Bioinformatics to molecular research in forest species: the case of Holm oak (Quercus ilex)

Guerrero-Sánchez, Victor M.

dc.contributor.advisor	Jorrín-Novo, Jesús V.
dc.contributor.advisor	Valledor, Luis
dc.contributor.author	Guerrero-Sánchez, Victor M.
dc.date.accessioned	2020-05-12T11:29:44Z
dc.date.available	2020-05-12T11:29:44Z
dc.date.issued	2020
dc.identifier.uri	http://hdl.handle.net/10396/19947
dc.description.abstract	The term Bioinformatics, first coined by Paulien Hogeweg and Ben Hesper, back in 1970 to describe ’the study of informatic processes in biotic systems’, can be defined as “research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, represent, describe, store, analyze, or visualize such data” or “the development and application of data-analytical and theoretical methods, mathematical modelling, and computational simulation techniques to the study of biological, behavioural, and social systems”. The first definition deals with the biological information management, and the second one with computational biology. The general objective and methodology employed in the current Thesis, “Integration of Bioinformatics to molecular research in forest species: the case of Holm oak (Quercus ilex)”, is focused on the first definition. The use of bioinformatic tools (algorithms, programs, databases and repositories) has been used to construct the transcriptome, proteome and metabolome of Holm oak and their integration to define the metabolism and responses to drought in this species. Since the end of the last century, biological research has moved from a reductionist to holistic paradigm, which have been possible thanks to the great technological advances, especially in the molecular biology discipline. Thus, the appearance of platforms based on the Next Generation Sequencing (NGS), and transcriptomics, and Mass Spectrometry(MS),for proteomics and metabolomics has made possible to obtain from hundreds to thousands of data in a single experiment, being impossible the management and analysis of them without the employment of informatics tools. The employment of high throughput techniques and their combination with classic approaches is what defines“SystemsBiology”. It do not only analyse thousands and thousands of molecular entities of an individual, but also the integration and creation of predictive models. This is quite feasible with model organisms (e.g. Arabidopsis), but it is a real challenge for those orphan and recalcitrant experimental systems such as Q. ilex. The study of this species is justified because of the environmental and economic importance in Spain and, because it faces a problem of increasing tree mortality associated to the decline syndrome, a situation that can be worsen in a climate change scenario. Biotechnology can contribute to solve this problem through breeding programs based on markers-assisted selection of elite genotypes that are more tolerant and resistant to biotic and abiotic stresses and more resilient to climate change. As a continuation of the work carried out since 2004 by the research group “Agroforestry and Plant Biochemistry, Proteomics, and Systems Biology”, mostly based on classic biochemistry, physiology and proteomics, and considering that neither the genome of Holm oak has been sequenced yet nor DNA or proteins sequences are available in public databases, as first objective of the Thesis was proposed the construction of the first reference transcriptome for this species. The work is presented in chapter 3, and has been published in Frontiers in Molecular Bioscience. For that purpose, the mRNA extracted from homogenized tissue from acorn embryo, leaves, and roots, was sequenced using an Illumina Hiseq 2500 platform. Three different assemblers were employed, TRINITY, RAY, and MIRA. The assemblies obtained were aligned against the most accurate and nearest phylogenetically transcriptome currently available, that of Quercus robur and Quercus petraea. MIRA generated more and longer contigs than RAY and TRINITY (MIRA>RAY>TRINITY). So, MIRA assembly was used to continue with the corresponding annotation of Q. ilex transcriptome, resulting in 31973 annotated sequences were obtained by Blast2GO using Swiss-Prot as reference database. As a continuation of the previous work, and as a second objective, a new sequencing platform, Ion Torrent, was evaluated in the construction and analysis of the Q. ilex transcriptome. The obtained results are presented in chapter 4 and have been already published in PLoS ONE. Raw sequence reads, obtained from Illumina and Ion Torrent, were assembled by three different software, MIRA, RAY and TRINITY. A hybrid transcriptome combining reads from both sequencing technologies was also assembled using RAY. The hybrid assembly generated the most complete transcriptome. The assembly of Ion Torrent reads of MIRA showed the highest number of shared sequences (84.8%) with the oak transcriptome. In addition, an in silico proteomic analysis was carried out using the translated assemblies as databases. Those from Ion Torrent showed more proteins compared to the Illumina and hybrid assemblies. All the assembled transcripts from the hybrid transcriptome were annotated and grouped according to the corresponding biological processes, molecular functions and cellular components (Gene Ontology). This new generated transcriptome represents a valuable tool to conduct differential gene expression studies in response to biotic and abiotic stresses and to assist and validate the ongoing Q. ilex whole genome sequencing. By using the above mentioned plant sample, the transcriptomic (NGS-Illumina), proteomic (shotgun LC-MS/MS, Orbitrap), and metabolomic (GCMS) profiles were analysed. Results are presented in chapter 5, and have been already published in Frontiers in Plant Science. The annotated Q. ilex transcriptome was compared against the complete in silico proteomes of Arabidopsis thaliana (UP0000065489, Oryza sativa subsp. Japonica (UP00005968010), Populus trichocarpa (UP00000672911), and Eucaliptus grandis (UP00003071112) in order to elucidate the unique and shared sequences. Also, the EC numbers of each proteome were contrasted to achieve a complete picture of the metabolic pathways coverage differences among proteomes studied in previously mentioned species. The descriptive analysis and the visualization of data on a gene-by-gene basis on schematic diagrams (maps) of the biological processes described in Mapman, resulted in the identification of around 62629 transcripts, 2380 protein species, and 62 metabolites. Data were compared with those reported for model plant species, whose genome has been sequenced and well annotated, including Arabidopsis, japonica rice, poplar, and eucalyptus. The integration of the large amount of data reported using bioinformatics tools allowed the Holm oak metabolic network to be partially reconstructed. From the 127 metabolic pathways reported in KEGG pathway database, 123 metabolic pathways can be visualized when using the described methodology. They included: carbohydrate and energy metabolism, amino acid metabolism, lipid metabolism, nucleotide metabolism, and biosynthesis of secondary metabolites. The TCA cycle was the pathway most represented with 5 out of 10 metabolites, 6 out of 8 protein enzymes, and 8 out of 8 enzyme transcripts. On the other hand, gaps, missed pathways, included metabolism of terpenoids and polyketides and lipid metabolism. The multi-omics resource generated in this work will set the basis for ongoing and future studies, bringing the Holm oak closer to model species. As a final objective of the current Thesis, an integrated transcriptomics and proteomics analysis of the response to drought in Q. ilex seedlings has been carried out. Seedlings were subjected to drought conditions by water withholding, and leaf tissue sampled at two times of the experiment, 20 and 25 days. RNA and proteins were extracted and analysed by using RNA-seq (Illumina), and proteomics, LC-MS/MS Orbitrap. Data are presented in chapter 6; it also corresponds to a manuscript to be submitted for publication. Gene products were identified and quantified at transcript and protein levels, establishing correlations between transcript and the corresponding protein abundance. Gene ontology (GO) analysis was performed to classify identified transcripts and proteins in terms of biological process, molecular function and cellular component. A multivariate analysis of the total and variable datasets at transcript and protein levels was performed with mixOmics. To acquire an integrated visualization of Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathway maps, total transcript and protein datasets, specifying those variable transcripts and proteins, were analysed by Paintomics 3 (v0.4.5), considering Arabidopsis thaliana as a model reference. Pathways with p-value < 0.05 were considered as significantly pathways. Interaction networks were constructed using the plugin GeneMANIA under Cytoscape (v3.4.0). The interaction networks included were prediction, co-expression, co-localization, and shared protein domains. This software also finds functionally similar genes that do not exist in the input gene list. RNA-seq analysis generated 47868 transcripts corresponding to 21000 unigenes, with 3588 qualitative or quantitative differences between irrigated and droughted seedlings (1149 up, and 2439 down). From shotgun proteomics, 4008 protein species were identified, corresponding to 2767 different genes. Out of them, 640 had qualitative or quantitative differences in abundance between treatments (353 more and 287 less abundant under drought conditions). A wide gene expression reorganization was observed at the two omics levels with up and down regulation, being this transitory (observed at 20 or 25 days) or permanent (observed at 20 and 25 days). The functional groups, whose genes were most altered in response to drought, were “stress-related” and “chloroplasts”. The most affected metabolic pathways included protein translation, photosynthesis, carbohydrates, amino acids and phenolics. Variable gene products were observed at transcriptomic or proteomic levels, with a reduced number detected at both levels. This included, for example, RPS2, 4CL2, PSB28, and RIN4, among others. From the variable transcript and protein datasets, two networks were constructed, the first one included up accumulated CLPB2, CLPB3, HSP70, HSP17.4, FtsH6, AT1G23740, SMT1, and UGP3, and down accumulated ABA2, RPS1, ADK, and RPL4 genes and the second one included up accumulated CLPB2, CLPB3, HSP70, HSP17.4, FtsH6, AT1G23740, AP1, INVE, AT4G2740, CAD4, FEN1, and HIPP27 and down accumulated ABA2 genes. From a biological point of view, and in terms of stress response and tolerance,Q. ilex seedlings were characterized by an increase in general abiotic stress related gene products, including CPLB2, CPLB3, FTSH6 and PSB28. These variable gene products overexpressed under drought conditions can be proposed as molecular markers of response and tolerance to drought stress. As a general conclusion it is necessary to emphasize that without bioinformatics it would be impossible to analyse the huge amount of data generated byomics approaches, but also, and more important, the importance of a manual evaluation and validation of the results before translating to the biological context, thus avoiding much speculation. Living organisms are much more complex tan we can realize and, improvements in wet and in silico analysis will be necessary to deep in its knowledge and to shed some light to biology, to understand the mechanisms connecting genotype and phenotype, and identify gene and gene product interactions linked to different biological processes such as the plant response to stresses. So, this knowledge will allow a better progression in speed breeding programs and biotechnological-related approaches.	es_ES
dc.description.abstract	El término bioinformática, acuñado por Paulien Hogeweg y Ben Hesper en 1970 para describir "el estudio de los procesos informáticos en sistemas bióticos", se puede definir como "la investigación, desarrollo o aplicación de herramientas y aproximaciones computacionales que permitan el manejo de datos biológicos, médicos o de comportamiento, incluyendo aquellas para adquirir, representar, describir, almacenar, analizar o visualizar dichos datos" o como "el desarrollo y aplicación de métodos analíticos y teóricos, modelos matemáticos y técnicas de simulación computacional para estudiar sistemas biológicos, sociales o de comportamiento". La primera definición hace referencia al manejo de información biológica, y la segunda a la biología computacional. El objetivo general y la metodología empleada en la presente tesis, "Integración de la bioinformática en la investigación molecular en especies forestales: el caso de la encina (Quercus ilex)" se incluyen en el primer grupo, el del uso de herramientas bioinformáticas (algoritmos, programas, bases de datos y repositorios) utilizados para el análisis de datos, principalmente ómicos, y la construcción del transcriptoma, proteoma y metaboloma de referencia en la encina, además de la integración de dichas ómicas para definir el metabolismo y la respuesta a sequía en dicha especie. Desde el final del último siglo, la investigación biológica se ha movido desde un paradigma reduccionista a una aproximación holística, gracias al gran avance tecnológico, especialmente en la disciplina de la biología molecular. La aparición de plataformas para la secuenciación de nueva generación (NGS), en el caso de la genómica y transcriptómica, y la espectrometría de masas (MS), en el caso de la proteómica y metabolómica, ha hecho posible obtener desde cientos a miles de datos de un único experimento; el tratamiento y el análisis de los mismos es prácticamente inviable sin el empleo de herramientas informáticas. El uso de técnicas de alto rendimiento y su combinación con aproximaciones clásicas es lo que define la "Biología de Sistemas", la nueva dirección establecida en la investigación biológica. La Biología de Sistemas no solo incluye el análisis de miles de entidades moleculares, sino también su integración y el establecimiento, a partir de ellos, de modelos predictivos. Esto es bastante factible y una realidad hoy en día con organismos modelo (por ejemplo, Arabidopsis), sin embargo para sistemas huérfanos de estudios moleculares y recalcitrantes, como es el caso de Q. ilex, constituye un auténtico desafío. El estudio de esta especie está justificado tanto por su interés medioambiental y económico para nuestra región, como por el incremento en la mortalidad del arbolado observado en las últimas décadas y asociado a estreses bióticos y abióticos, cuyo conjunto constituye el denominado síndrome de la seca. La muerte del arbolado y la pérdida de masa forestal puede verse agravada en un escenario de cambio climático. La biotecnología puede contribuir a resolver este problema a través de programas de mejora basados en la selección asistida de por marcadores moleculares para la identificación de genotipos élite que son más tolerantes a estreses bióticos y abióticos y más resilientes al cambio climático. Como continuación al trabajo realizado desde 2004 por el grupo de investigación "Bioquímica, Proteómica y Biología de Sistemas Vegetal y Agroforestal", centrado principalmente en estudios de bioquímica clásica, fisiología y proteómica, y considerando la ausencia de secuencias de DNA y proteínas en encina, el primer objetivo de la presente tesis fue la construcción del primer transcriptoma de referencia para esta especie. Este trabajo es presentado en el capítulo 3, y ha sido publicado en Frontiers in Molecular Biosciences. Para ello, se llevó a cabo una extracción de RNAm a partir de tejido homogeneizado de embrión, hojas y raíces, y posterior secuenciación mediante la plataformaIlluminaHiSeq2500. Se emplearon tres ensambladores diferentes, TRINITY, RAY y MIRA. Las secuencias ensambladas fueron alineadas contra el transcriptoma de Quercus robur y Q. petraea, considerado como el transcriptoma filogenéticamente más preciso y cercano a Q. ilex. MIRA generó un mayor número de “contigs” que RAY y TRINITY (MIRA>RAY>Trinity). Por lo tanto, las secuencias ensambladas con MIRA fueron las que se usaron para continuar con la anotación correspondiente del transcriptoma Q. ilex, lo que resultó en 31973 secuencias anotadas obtenidas por Blast2GO utilizando Swiss-Prot como base de datos de referencia. Como continuación del trabajo descrito en el capítulo 4, y como segundo objetivo, se evaluó una nueva plataforma de secuenciación, Ion Torrent, para la construcción y análisis del transcriptoma de Q. ilex. Los resultados obtenidos han sido publicados en PLoS ONE. Como en el capítulo anterior, las lecturas obtenidas a partir de Illumina y Ion Torrent se ensamblaron utilizando tres programas diferentes, MIRA, RAY y TRINITY. En el ensamblado de MIRA con Illumina y el de TRINITY con Ion Torrent generaron el mayor número de transcritos anotados (62628 y 74058 respectivamente). El ensamblado de MIRA con Ion Torrent generó el mayor número de secuencias compartidas con el transcriptoma del roble (84.8%). RAY generó los mejores resultados atendiendo al número de contigs y longitud de los mismos, con valores de E90N50 de 1122bp. Todos los transcritos del nuevo transcriptoma de referencia fueron anotados y agrupados en términos de Gene Ontology ("Biological Process", "Celullar Component" y "Molecular Function"). Dicho transcriptoma se tradujo in silico, obteniéndose una base de datos de proteínas que será utilizada en experimentos de proteómica para la identificación de productos génicos. El uso de dicha base de datos incrementó notablemente el número de especies proteicas identificadas y los parámetros de confianza de la identificación. A partir de las bases de datos generadas y los datos multiómicos obtenidos cuando se utilizó una muestra de encina consistente en un pool de extractos de diferentes tejidos (embrión, hoja y raíz) se reconstruyeron diferentes rutas metabólicas tal y como ocurren en Q. ilex. Los resultados se presentan en el capítulo 5 y han sido publicados en Frontiers in Plant Science. Se llevó a cabo la extracción independiente a partir de la misma muestra del RNA, proteínas y metabolitos, estableciéndose el perfil ómico mediante NGS-Illumina (RNA), shotgun LC-MS/MS, Orbitrap (proteínas) y GC-MS (metabolitos). Se identificaron 62629 transcritos, 2380 especies proteicas y 62 metabolitos. Se llevó a cabo la identificación de productos génicos correspondientes a enzimas mediante la comparación con genomas de referencia incluyendo Arabidopsis thaliana (UP0000065489, Oryza sativa subsp. japonica (UP00005968010), Populus trichocarpa (UP00000672911), and Eucaliptus grandis (UP00003071112). Delas127rutasmetabólicasdescritasenKEGG, y mediante el empleo de Mapman, se visualizaron 123, entre ellas, las del metabolismo energético, de carbohidratos, de aminoacidos, lípicos, nucleótidos y secundario. El ciclo de los ácidos tricarboxílicos (TCA) fue la ruta mejor representadas con 5 de 10 metabolitos, 6 de 8 proteínas enzimáticas y 8 de 8 transcritos. Por otro lado, hay rutas que no se observaron o estaban muy poco representadas, como por ejemplo las del metabolismo de lípidos, terpenoides y policétidos. Como objetivo final de la presente tesis, se llevó a cabo un análisis transcriptómico y proteómico integrado de la respuesta a sequía en plantones de Q. ilex. Los resultados se presentan en el capítulo 6, correspondiente a un manuscrito que será enviado para su publicación. Las plántulas de Q. ilex crecieron en macetas con perlita, siendo sometidas a condiciones de sequía por falta de riego durante 30 días. Se tomaron muestras de hojas a dos tiempos, cuando la fluorescencia de las hojas disminuyó en un 30% y un 50% (20 y 25 días). Tras la extracción de RNA y proteínas se llevó a cabo su análisis mediante RNA-Seq (Illumina) y proteómica “shotgun” (LS-MS/MS, Orbitrap). El análisis de RNA-seq generó 47868 transcritos correspondientes a 21000 unigenes, con 3588 diferencias cualitativas o cuantitativas entre plántulas irrigadas y no irrigadas (1149 sobreexpresados y 2439 reprimidos). A partir de la proteómica “shotgun” se identificaron 4008 proteoformas, productos de 2767 genes diferentes; de ellos, 640 presentaron diferencias cualitativas o cuantitativas en abundancia entre tratamientos (353 más y 287 menos abundantes en condiciones de sequía). Los productos genéticos variables se clasificaron en términos de Gene Ontology (proceso biológico, función molecular y componente celular) y en rutas metabólicas de KEGG en el caso de las enzimas. El conjunto de datos variables se sometió a análisis estadístico multivariante, PCA y sPLS. Finalmente, se usó GeneMANIA para la construcción de redes de interacción. Hubo cambios importantes en el patrón de expresión génica siendo los grupos de respuesta a estrés y cloroplastos lo más afectados. Respecto a rutas metabólicas, se detectaron cambios en la síntesis de proteínas, fotosíntesis, carbohidratos, aminoácidos y fenólicos. Hubo cambios transitorios (observado a un solo tiempo) o permanentes (comunes a los dos tiempos) detectados a nivel de transcrito y/o proteína. El número de productos génicos variables detectados por ambas plataformas fue mínimo, entre ellos RPS2, 4CL2, PSB28 y RIN4. A partir del conjunto de datos de transcritos y proteínas variables, se construyeron dos redes de interacción: la primera incluía los genes sobreexpresados CLPB2, CLPB3, HSP70, HSP17.4, FtsH6, AT1G23740, SMT1 y UGP3, y los genes reprimidos ABA2, RPS1, ADK y RPL4, y la segunda red incluía los genes sobreexpresados CLPB2, CLPB3, HSP70, HSP17.4, FtsH6, AT1G23740, AP1, INVE, AT4G2740, CAD4, FEN1 y HIPP27 y el gen reprimido ABA2. Se proponen como genes marcadores de respuesta y tolerancia a sequía en encina a aquellos sobreexpresados a los dos tiempos y detectados a nivel de transcrito y proteína. Solo un número de genes cumplen dichas características entre los que se incluyen posibles proteínas de respuesta a choque térmico, CPLB2 y CPLB3, a una metaloproteasa cloroplástica, FTSH6, y la proteína del centro de reacción del fotosistema II, PSB28. Como conclusión general, es necesario hacer énfasis en la necesidad del empleo de herramientas bioinformáticas para el análisis de la gran cantidad de datos generados por las técnicas ómicas, a la vez que, en la necesidad de la revisión y validación manual de los resultados de cara a una correcta, no especulativa, interpretación biológica. Los seres vivos son mucho más complejos de lo que podríamos imaginar, y el conocimiento de su biología requiere mejoras en las técnicas de laboratorio y análisis in silico, con el fin de profundizar en el conocimiento de los mecanismos que conectan el genotipo con el fenotipo y la identificación de productos génicos y sus interacciones asociados a diferentes procesos biológicos como son el de la respuesta y tolerancia/resistencia a estreses en plantas. Dicho conocimiento permitirá abordar programas de mejora mediante aproximaciones biotecnológicas.	es_ES
dc.format.mimetype	application/pdf	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Universidad de Córdoba, UCOPress	es_ES
dc.rights	https://creativecommons.org/licenses/by-nc-nd/4.0/	es_ES
dc.subject	Holm oak	es_ES
dc.subject	Quercus ilex	es_ES
dc.subject	Bioinformatics	es_ES
dc.subject	Omics technologies	es_ES
dc.subject	Metabolomics	es_ES
dc.subject	Proteomics	es_ES
dc.subject	Transcriptomics	es_ES
dc.subject	Encina	es_ES
dc.subject	Bioinformática	es_ES
dc.subject	Tecnologías ómicas	es_ES
dc.subject	Metabolómica	es_ES
dc.subject	Proteómica	es_ES
dc.subject	Transcriptómica	es_ES
dc.title	Integration of Bioinformatics to molecular research in forest species: the case of Holm oak (Quercus ilex)	es_ES
dc.title.alternative	Integración de la Bioinformática en la investigación molecular en especies forestales: el caso de la encina (Quercus ilex)	es_ES
dc.type	info:eu-repo/semantics/doctoralThesis	es_ES
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es_ES

Ficheros en el ítem

Nombre:: 2020000002104.pdf
Tamaño:: 25.90Mb
Formato:: PDF

Ver/

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem