• Benjamin Applegate

The Future of Data Driven Research through Scientific Landscape Analysis



Since the halcyon days of the internet, the US government has provided databases of published medical journals and clinical trials, available to researchers and patients alike. PubMed.gov and ClinicalTrials.gov, two libraries of health studies and trials, are easily accessed and can be used to download full PDFs of publications and view the status of clinical trials. Although they are publicly available, the dense scientific language is best navigated by individuals who are professionals in medical fields. Research-minded experts have no problem interpreting the scientific language of these documents, however another problem arises when met with the task of utilizing these resources to inform institutions of trends and whitespaces opportunities for research and how to inform product positioning; the extensive data banks are too large to manually sort through, even with a narrow search scope based on condition and treatment.


For pharmaceutical organizations and other medical institutions, building out a customized compendium of research materials can be a valuable asset, one that can help shape future research goals, and help teams organize their search. A client sought out Ringer Sciences for this exact purpose; to identify trends of specific treatment methods for a variety of cancers in medical research over time. Strategizing our approach and methods was an iterative process, and thus we began constructing the framework with which to undertake this task. As someone who was slightly familiar with these websites before delving deeper into a professional data analysis client ask, I was aware of the advanced search features at a surface level. Learning how Boolean searches work by identifying exact keywords associated with niche publications has made identifying these sources much easier. Using “AND”, “OR”, and “NOT” to pinpoint the conditions and treatments can be a tricky process, but with collaboration on all sides to build a list of relevant search terms, an accurate and complete query can be created. Building out the search is only half the battle however, since if you’re searching across a number of medications for a particular disease, then you should have those search results tagged for quick reference.


Initial Goals and Objectives of Research Cultivation

At the outset of this project, we began by asking questions about how best to capture the research material from PubMed and ClinicalTrials, and we started the process of refining the data output into centralized documents . As we made progress, my team and I realized that there were many steps we could take to improve the output of this customized data search. In addition to the tagged libraries of the publications and clinical trials, with data points labeled for mentions of treatments and drugs, it is potentially beneficial to create an index of terms found throughout the MeSH (Medical Subject Headings) Queries and Titles, providing an easily searchable list for closer inspection.


As a result of tracking the volume of research across multiple time periods, it is possible to construct the charts which highlight the change over time of these publications and studies. By showcasing these trends, parties can see what kinds of research has become more frequent in recent years. Additionally, a supplementary material the Ringer team sought to include is a user manual styled document detailing how to easily use the data files. All of these features focus on clarity and usability, something that the raw output from these databases are lacking.


Process Reflections and the Potential of Automation

Alongside the usefulness of using data science to tag and track these research documents, there is a fair amount of this process which lends itself to automation in the near future. Because the steps to a complete library involve multiple iterations of the same query across different time periods, automation can aid in this repetition to produce the output. Machine learning is a powerful tool that can help to automate curation of the data through learned models. Refining this process is a matter of partnership with the client, reaching an agreement on terms and methods to deliver the highest quality product.


This approach of querying and tagging large amounts of research, and analyzing the trends of such studies and trials, is an incredibly beneficial practice for strategizing which conditions and treatments to concentrate on. As clients invest resources into data-focused studies, it is imperative to assess white space within both research and the industry. A consolidated database of research aids greatly not only in organizing your work at the individual level, but can also help illustrate overall trends. It is my hope that with refinements made in this process for searching for publications and clinical trials, the speed of scientific advancements can be increased, leading to a happier and healthier future by impacting market research positioning for pharma companies.


29 views0 comments

Recent Posts

See All