Banca de QUALIFICAÇÃO: INACIO GOMES MEDEIROS

Uma banca de QUALIFICAÇÃO de DOUTORADO foi cadastrada pelo programa.
DISCENTE : INACIO GOMES MEDEIROS
DATA : 16/09/2019
HORA: 14:00
LOCAL: BioME - PPg-Bioinfo
TÍTULO:

BioSurfMiner: a pipeline for identification of biosurfactant-synthesis proteins


PALAVRAS-CHAVES:

Biosurfactants. Metagenomics. Machine learning. Pipeline.


PÁGINAS: 49
GRANDE ÁREA: Ciências Biológicas
ÁREA: Biologia Geral
RESUMO:

Biosurfactants are compounds produced by microorganisms that decrease the surface and interfacial tension levels of a mixture. In the oil industry, oil well recovery is an activity that can be optimized by them. Each well, however, has its own physicochemical properties such that a generic selection may not be suitable for any well. Thus, it is essential to know the genes and metabolic pathways employed to produce biosurfactants in the well of interest, as it will enable the use of customized and efficient solutions. Acquisition of this knowledge is possible with metagenomics, the study of the genetic material of an environmental sample. One limitation is that it relies heavily on sequence database searching, a strategy that may not work to discover proteins that are not in them. Machine learning-based computational techniques help to bridge this gap. This work proposes an in silico biosurfactant synthesis protein identification pipeline in oil well metagenomic data using alignment and supervised learning techniques, aiming to be accurate enough that discovered proteins have a high chance of success in in vitro tests. It consists of two steps. In the first, input proteins are aligned against BioSurfDB synthesis proteins, using homology parameters based on the literature and BioSurfDB itself. In the second, simultaneously to the previous one, biological properties related to frequencies of different types of amino acids and physicochemical characteristics, such as isoelectric point, molecular weight, gravvy, among others, are calculated and analyzed by a supervised machine-learning algorithm, that will indicate which of the proteins are synthesized. Four algorithms (Support Vector Machines, Decision Tree, Random Forest, and Naive Bayes Gaussian) were evaluated for
selection. Initially, they were trained and tested with BioSurfDB synthesis proteins and a negative control assembled from the cured part of UniProt, using sensitivity and specificity as metrics. They then classified all proteins from the cured part of UniProt, and then verified which of the ones classified as synthetic had homology to the BioSurfDB synthesis proteins or had a synthesis gene name. All algorithms showed sensitivity below 20% and specificity above 99%. Support Vector and Random Forest machines obtained 100% specificity, but the
first one showed 0% sensitivity, being discarded for the second selection moment. Of the remaining algorithms, Naive Bayes Gaussian was the only one whose portion of the proteins classified as synthesis had homology with BioSurfDB synthesis proteins or had a synthesis gene name, but very small (4 proteins had homology and have a gene name, 19 only have a gene name, and 2 only had homology) compared to the number of classified (3346). It is found that the combination of the biological properties used together with the learning algorithms tested is capable of initially discarding false positives in their entirety in some cases, being quite deficient in finding synthesis proteins, including those that are already known. As future work, we intend to explore the use of other algorithms not covered in this work, as well as other biological properties related to the secondary structure of proteins and their functional domains.


MEMBROS DA BANCA:
Presidente - 1149647 - LUCYMARA FASSARELLA AGNEZ LIMA
Interno - 1893445 - EUZEBIO GUIMARAES BARBOSA
Externo ao Programa - 2859562 - LEONARDO CESAR TEONACIO BEZERRA
Notícia cadastrada em: 10/09/2019 11:10
SIGAA | Superintendência de Tecnologia da Informação - (84) 3342 2210 | Copyright © 2006-2024 - UFRN - sigaa14-producao.info.ufrn.br.sigaa14-producao