Skip to content

Uc Davis Epicenter for Disease Dynamics Project - Extraction of virus/host relation pairs from academic papers using Snorkel package

Notifications You must be signed in to change notification settings

EricaXia/snorkel

Repository files navigation

Extracting Virus-Host Relations from Research Text - Machine Learning Project

pic

In this project, we build a machine learning system to extract and identify correct mentions of virus and animal host species relations from academic research papers in the context of epidemiological research.

Read summary here

Considering a large majority of infectious diseases are spread from animals to humans, zoonotic diseases have become an important topic of study and the subject of many research studies. Various species of viruses, such as Flaviviruses, may cause the outbreak of viral zoonotic disease. Hence, the relations between viral and animal host species are major factors in understanding the transmission and characteristics of zoonotic diseases. Natural Language Processing extraction techniques can be used to identify species-level mentions of viral-host relations in academic text.

In this project, we build a system to extract and identify correct mentions of virus and animal host species from academic research papers. The goal of such methods is to provide insights into the scientific writing and international research conducted on species linked to zoonotic disease. After extracting frequencies of the mentions of specific viral-host relations, we use supervised machine learning techniques to label entity pairs as having positive or negative associations.

One challenge in the way of applying supervised learning methods is the creation of large, labeled training sets. In our project, we require training sets of confirmed viral and host species relations. Hence, we use data programming by way of a training set creation package called Snorkel (created by HazyResearch from Stanford Dawn project) to create training set. The training sets are noisy, machine labeled sets created by applying user-defined heuristics, called labeling functions, to extracted candidate pairs. A generative model is deployed to unify the labeling functions and reduce noise in the final training set. Finally, end extraction is performed by an LSTM model to predict correct relation mentions.

Code

The tasks are broken up into each step of the pipeline.

Part 1 - Document Preparation, Preprocessing, and Candidate Extraction

  • Read in a corpus of documents in .tsv format
  • Extract candidates through dictionary matching

Part 2 - Labeling Functions Development

  • Develop Labeling Functions to label candidates as true or false
  • Compare LF performance with hand labeled set (gold labels)

Part 3 - Generative Model Training

  • Unify the LFs and reduce their noise
  • Use marginal predictions from the model as the probabilistic training labels (for the end extraction model in Part 4)

Part 4 - Discriminitive End Extraction Model Training

  • Train a LSTM model using training labels from Part 3
  • Evaluate model performance on a blind test set

Additional Data

About

Uc Davis Epicenter for Disease Dynamics Project - Extraction of virus/host relation pairs from academic papers using Snorkel package

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published