CMU Research

Countering Information Overload

CARNEGIE MELLON UNIVERSITY

Creating an inclusion/exclusion algorithm using an Naive Bayes model to determine scientific articles for cancer research

Role

Research Assistant

Timeline

May - August 2022

Skills

Classification

Research

Tools

Colaboratory

Sysrev

Zotero

Python

Project Overview

With so much information available at our fingertips, it can be difficult to comprehend it all. Using a research methodology called knowledge synthesis, allowed my research team and I to collect relevant information on cancer immunotherapy to find important research articles.

My team and I developed advanced search strategies for databases and learned how Natural Language Processing can be used to annotate and search documents. With hands-on experience with each step of the research process, I was able to contribute to an ongoing research project that developed a novel dataset.

Pain Points

With Sysrev, a document review platform, and Zotero, a research assistant software, my team and I were tasked with organizing research articles based off of relevancy to cancer immunotherapy. We classified these articles by confirming whether or not they had specific key words such as "cancer", "nanoparticles", "nano", "disease", "tissue", "immunotherapy", etc.

After marking to include these articles, we would read through them again in order to confirm the relevancy and researched over 300 articles in order to have data for the model.

Then, with Google Colaboratory, my team and I then downloaded the inclusion/exclusion data and sectioned the data by Article Title, if it was Included/Excluded, Category of Article, Abstract, the Prediction Percentage, and many more categories. Our overall goal was to create a model that could include/exclude articles like we were doing but at a more efficient pace.

We split the data into training and testing models and tested our model on the data. Creating a Confusion Matrix, I was able to classify articles if they had "disease but not cancer", "not disease but not cancer", both, or none. With the training model, it resulted in a 99.13% success rate in predicting the category of articles correctly.

This was just a fraction of the work we overall completed as we would classify different areas such as article titles, abstracts, journals, authors, etc. For example, we were able to classify the models based off of article title to see if it had important keywords that would give us a precursor into seeing if the article was relevant or not. Through this, our team gained a lot of information and data from the included articles as we were able to use it to further research cancer immunotherapy methodologies and use the excluded articles for future research studies in disease fields other than cancer.

Reflection

This was a very useful project in understanding how statistical machine learning models can help with increasing the efficiencies in administrative human behavior. Especially, in this project, I had the opportunity to delve into the complexities of implementing statistical machine learning models and witness their impact on real-world data.

One key aspect that stood out during the project was the importance of meticulous data preprocessing. Cleaning and transforming raw data into a suitable format for machine learning algorithms proved to be a critical step in ensuring the accuracy and reliability of our models. This process involved handling missing values, normalizing features, and addressing outliers, highlighting the crucial role of data quality in the success of machine learning endeavors.

Experimenting with various models allowed us to grasp the nuances of each algorithm and discern their strengths and limitations in different scenarios. Understanding the trade-offs between model complexity and interpretability became a recurring theme, emphasizing the need for a thoughtful approach in model selection.

Furthermore, collaboration within the team was a key factor in navigating the challenges posed by the project. Sharing ideas, discussing results, and collectively troubleshooting issues fostered a collaborative spirit that proved instrumental in overcoming obstacles.

Reflecting on the project, I recognize the continuous learning curve that statistical machine learning presents. The ever-expanding landscape of algorithms, techniques, and tools necessitates a commitment to ongoing learning and adaptation. This experience has not only equipped me with practical skills in statistical machine learning but has also instilled in me a deeper appreciation for the interdisciplinary nature of data science.

Page Under Construction