Academic/Government Aerospace & Defense Automotive Chemicals Electronics Oil & Gas Personal & Home Care Pharma/Biotech
SciTegic Pipeline Pilot - Data Analysis and Reporting Platform Accord - Cheminformatics Software Materials Studio - Materials Modeling and Simulation Software Discovery Studio - Life Science Modeling and Simulation Software Additional Products
Overview Contract Research Implementation Solutions Consulting Support Training
Overview Scientific Business Intelligence Nanotechnology Consortium Biological Registration Special Interest Group Collaborators
Freeware Trials Product Updates
Conferences Seminars Training User Group Meetings Webinars
Application Guides Case Studies Publications Presentations White Papers
Overview Careers Contact Customers Investor Relations Legal Information Locations Management and Governance Press Releases Strategic Alliances
 
Share with others

Information Extraction from Text Documents using Pipeline Pilot

With so much text-based information currently available, and more content becoming available all the time, it is ever more vital to have effective, efficient means to extract information from relevant documents. It is simply not possible for you to read all the documents. On the other hand, to completely automate the processing of documents can be dangerous, since the machine process can never match the human ability for language comprehension and reasoning. The solution is an approach that utilizes the respective strengths of human and machine processing such that the machine component takes care of the repetitive, automatable tasks so that the key documents can be presented to the user in a form that facilitates review and decision-making.

In the following use-case, we are interested in determing the relationship of a gene (BRAF) with various forms of cancer. We don't want to have to perform multiple searches for BRAF and each form of cancer. Instead we want an efficient, repeatable approach that takes a gene query and relates the retrieved documents with cancer terms.

Create the Text Query

The first step is to create a text query. For a single query, start with the Generate Text Query component and set the query to the gene "BRAF" and ask for the first 50 hits:

Expand the Text Query with Synonyms

Genes, as with many other concepts that appear in text, have a number of synonyms. To ensure that you perform a comprehensive search, it is important to expand your query to include any synonymous terms. PubMed does this automatically, but if you will be searching other data sources, or to ensure the most comprehensive searching, you can expand your query using a dictionary of terms. To do this you can use any concept dictionaries that come prepackaged with Pipeline Pilot (e.g., MeSH, the Me dical S ubject H eadings dictionary from the National Library of Medicine) or you can create your own custom concept dictionaries.

The following protocol shows how a copy of the Entrez Gene file containing gene IDs and synonyms, downloaded from NCBI, can be used to create a custom dictionary of gene synonyms:

Note that to keep this up to date you could include the download step as part of the protocol, and set the job to run every night.

Once the dictionary is created, use the "Expand Text Query with Synonyms" component to expand the BRAF query:

Performing the Search

The next step is to perform the search - in this example we will search PubMed. Searching can occur in one of two ways, either by starting with a search component (see the Pub Med Trend Analsysis case study), or by using streaming data to search with. In this example, we use the latter approach, which would allow us to not just search a single gene (BRAF) but we could instead do an individual search for a list of genes of interest.

Extracting Information from the Documents

Once the documents have been retrieved the next step is to find the relationships with types of cancer. To do this we Annotate using Concepts, to find all occurrences of "cancer" terms from the MeSH dictionary in the retrieved documents. We then keep only those documents that contain such terms. In this simple example, we will just display the results but in more advanced use cases we could do further processing to count the number of documents by types of cancer, filter for specific forms of cancer, calculate trends and correlations between BRAF and various forms of cancer, or any number of other types of analysis, according to our research interest.

Viewing the Output

The following graphic shows the output of this protocol. In the header of the webpage output, the online source (PubMed), the result range (1 - 50), the total number of matching documents (734) and the expaned text query are shown. The first matching document is then shown, with the query term highlighted in yellow and the matching MeSH cancer terms (e.g., carcinoma) highlighted in blue.

A second result shows that BRAF is also associated with melanoma:

As well as viewing the individual documents, you can collate them into a table showing the count of documents and the recent publication frequency to reveal what forms of cancer are most commonly, or most rarely, associated with BRAF, and when those documents were published:

This relatively simple protocol brings together the collective resources of Entrez Gene, PubMed and MeSH, representing thousands of hours of research by hundreds of individual scientists, to retreive and highlight the most important documents for you to review and analyze. The deeper your research questions go, the more evident the power of pipelining your text analyses becomes