Big Data

For Better Hearts


How to Use Cardiovascular Data to Develop an Open Access Informatics Platform? - An interview with Richard Dobson and James Teo

How to Use Cardiovascular Data to Develop an Open Access Informatics Platform? - An interview with Richard Dobson and James Teo

Richard Dobson, Professor in Medical Informatics at University College London, Professor and Head of Bioinformatics at King's College London, and WP3 co-lead BigData@Heart and James Teo, Consultant Neurologist, King's College Hospital NHS Foundation Trust were interviewed for BigData@Heart by Duane Schulthess, Managing Director, Vital Transformation, in an aim to provide an overview of their current work in BigData@Heart’s Work Package 3 and the use of cardiovascular data to develop an open access informatics platform for atrial fibrillation and general heart failure.

Duane Schulthess: I’d like to thank the two of you for your time today. Can you please provide an overview of your current work in BigData@Heart’s Work Package 3 and the use of cardiovascular data to develop an open access informatics platform for atrial fibrillation and general heart failure?

Richard Dobson: The first objective of the Work Package is to do a systematic literature review, so to survey what's out there and get a picture of the data sets at our disposal and the overall landscape. Novartis are leading on that, along with the European Society of Cardiology (ESC). We comprehensively characterised data sources on the basis of: data type, design, region, country and type of CV disease, with an initial emphasis on HF. This has resulted in a comprehensive data catalogue of 362 data sources included in non-interventional studies of HF and disseminated in the public domain.

The next stage is to think about how we can use that information to build up a metadata catalogue, or in other words, a fingerprint of the data landscape with a view to facilitating interactions between people to identify opportunities for data sharing. With that systematic literature review, we're creating an opportunity for generating a resource that allows people to look up datasets as well as the kind of variables that are available in those datasets, including the contact details for principal investigators.

This systematic literature review and dataset will then provide a tool for facilitating interactions, networking, and collaborations between researchers working in cardiovascular disease (CVD). It doesn't provide a platform for sharing the data itself, it provides a platform for sharing of the metadata of the various data.

DS: So, when you talk about sharing the metadata information, it's not necessarily the raw patient data, it's the metrics and outcomes contained in the data, is that what you're focusing on?

RD: Right. For example, you may be a researcher interested in finding out which datasets are available for heart failure that also have been genotyped, and also have particular characteristics you need.  Perhaps you need a certain age range, or maybe a certain ethnicity? This will enable you to query these datasets to identify which meet your selection criteria for the studies you're interested in researching.

It provides a view the landscape in terms of data that are available and then facilitates your engagement with the data owners and data controllers to negotiate, collaborate and share these data sets.

DS: This type of distributed network approach has been very successful with projects such as the FDA’s funded outcomes research database Sentinel. It has the benefit of avoiding issues around data access and interoperability. Do you see this approach, keeping the data stored and owned in place where it lives and then pulling only the results, as the natural way forward?

RD: I think it's a pragmatic solution. Obviously, there are lots of legacy data sets that haven't been necessarily generated with interoperability in mind. Some of these are quite old and they've not necessarily been managed or set-up using data standards that we would use if we were setting them up from scratch today.

So, in terms of what we're doing, we're generating a fingerprint template which we then ask people to complete in order to describe their dataset. The underlying data itself may still not be harmonised, but it's a fairly low-cost way to grow to generate an overview of the landscape and facilitate collaboration.

In parallel work, we're taking selected datasets such as CALIBER, SwedeHF, SwedeHEART, ABUCASIS and mapping the data from these studies in the Observational Medical Outcomes Partnership (OMOP) common data model so they are in a harmonised state.

DS: Essentially, OMOP will give you one Rosetta Stone that you can use to search across these legacy data sets from these studies.

RD: Yes, so we’re doing these two things in parallel; we’ll get this broad view of the data that's available with the ability to query the metadata, and then on selected datasets from these specific studies we're very explicitly taking them and harmonising them so you're able to query across the data at the individual level.

DS: Many of your recent articles highlight that the vast majority of valuable health data in the world is unstructured. How does your work harnessing technology to access unstructured data come into play in the context of BigData@Heart?

RD: There's a lot of information which is stored in free text which is incredibly valuable usually under used. It comes with its own set of challenges, but it's incredibly important to be able to get access to this treasure trove of additional deep phenotypic information that goes beyond the thin supernatant of structured information.

Clinicians are often commit in narrative form information that they wouldn't be prepared to commit in structured form, and they provide more finely detailed levels of information in those free texts. We think that it's important to generate a more complete view of the patient, those subtleties relating to comorbidities, or subtle effects of treatment or treatment response. In order to stratify patients, it's important to have this additional information relating to comorbidities which provides the opportunity for early markers of outcomes with information that's captured but often unstructured.

Because it is unstructured, we’ve worked very hard to find ways of using natural language processing to map the data and make it available for searching, and inferring structure based on knowledge bases and ontologies. This enables us to generate more accurate and complete views by combining this unstructured and structured information.

DS: Are you building your own tools or are you working with outside vendors with your natural language processing platforms?  

RD: We have a history of using natural language processing open-source tools.

James Teo: We use mainly in-house built systems based on open-source components for the natural language processing e.g. GATE ( SemEHR ( and I think the principal issue is as data systems evolve, the nomenclature, the language, keeps changing over time, and one has to build systems which are adaptable and not tied to certain nomenclatures or arcane forms of information or ontologies.  We've largely worked on unstructured as well as structured data, in that sense, so that allows us the flexibility to impose structure during the interpretation phase rather than putting ourselves in a strait jacket.

DS: Are you concerned that there's going to be the same lock-in problems, you being tied to a vendor by contract, with natural language processing that we've seen with EHRs and vendor specific data platforms, is this is what you're trying to avoid?

JT: Yeah, I think there is a risk that may happen. Certainly, in clinical practice there's a difference in language, nomenclature, acronyms, and data recording which risks us ending up speaking different languages. Languages evolve, and so we need to have some sort of natural language processing which is adaptable.

RD: We've worked very hard to develop generalised pipelines, and we're very aware of the problems of different hospitals having different proprietary systems which are kind of locked down and suffer from issues of interoperability. So, we've tried to focus on using open source tools to generate pipelines and toolkits that will pull data from heterogeneous data sources, harmonise that data, and optionally de-identify that data. It serves that data up for enterprise search capability as well as specific kinds of natural language processing toolkits and applications that we've developed generating semantic annotation from the underlying unstructured data.

That's all open-source and available for other people to use. We've had considerable success with generating access for research, but also because of the way that we pull the data in a timely fashion. The way the toolkit is implemented, we're able to send information back to clinicians and have queries running close to real-time. In this way we’re able to spot events that happen and respond rapidly, generating the right kind of access to the right people for the right data at the right time.

DS: How do you think we could use this real-world observational data, this heterogeneous data that's unstructured, to ensure better validity of outcomes as well as improve research in cardiovascular disease specifically?

JT: From my perspective, most of the studies which involve new therapies largely work around screening out comorbidities in the exclusion criteria, and I think with observational data we are able to start including patients with these comorbidities to perform virtual trials. Using patients with additional comorbidities, one might expect to see different population behaviour or different effect size of interventions. When you bring these comorbidities into research, you limit your ability to use randomised controlled trials (RCTs).

Another area that is very useful is the ability to track off-licence indications. The existing medicines regulatory framework class this as Phase IV trials, but there isn’t any systematic process for capturing this data, and there is a lot of clinician behaviour using a variety of therapies for off-license indications where you have anecdotal evidence and anecdotal reports of these therapies working. This is a situation where virtual trials can be conducted that would never be funded.

RD: With RCTs, we're reporting effects across a narrow trial population. They're costly and difficult to implement, and there's this reporting of average effects across the population over a short period of time. I think there are issues with patient and clinician discomfort with randomisation, and there's this incredibly long pathway to translation. I think we're able to get at much greater insights using real-world data (to augment RCTs not replace); it's much broader, larger and captures the longitudinal use of medications. It captures the intended, as well as the unintended, consequences in subgroups of people.

JT: As well, there's now an increasing interest in the UK to fund trials which involve repurposing drugs for alternative indications, especially drugs where the original patent has expired. If you have observational data to suggest that off-license use may be anecdotally effective, one could run a virtual trial of the off-license use to see whether it is worthwhile to do a proper trial for a full license.

DS: In what core clinical areas of cardiovascular disease do you see the desire to harness these data sets of real-world evidence and unstructured data to address unmet medical needs?

JT: I think the biggest cardiovascular area will be atrial fibrillation. A lot of information for clinical decision making and risk scoring is captured in structured and unstructured data with varying nomenclature from a variety of sources, and AF is an area with plenty of accessible treatments for preventable, high-cost disease like stroke. Clinical scores for decision-making (CHADS2Vasc and HASBLED) and deciding on appropriate treatment is widely used, but the comorbidity information that feeds it is also held in a variety of formats and sources.

By unifying and harmonising the data, BigData@Heart will allow development of tools which derive this information algorithmically. Daniel Bean, HDR UK research fellow and I are working on developing an automated natural language processing tool for deriving these scores from the free text of all documents.

RD: Our goal is to intervene ahead of time, and move towards more complex, actionable analytics that might predict stroke in AF, for example, and then generate great research. Because, if we can serve the data up in a timely fashion, near real-time, then it gives you the opportunity to develop decision assistance or decision support tools for medicine optimisation.

Published on: 03/19/2018