Conor's Blog
The VA’s 45 billion triples
For us, medical expression boils down to triples, simple subject-predicate-object statements …
The problem is Hypertension
of-patient "Joe Smith"
recorded-on 2011-10-21
diagnosed-by "Dr Fred Jones"… one triple after another, capturing a patient’s health, and so a key metric for us is how many triples does a typical patient get? Without a figure, we can’t know how big a patient triple store would be or how big a complete CCD or Caregraf is.
For an answer, we need access, direct unencumbered access, to an EHR and thanks to FMQL we can have that for the VA’s VistA, the most widely deployed EHR. Of course, we’d need FMQL on a real VistA, not just on our demo system and thanks, indirectly (ssh!), to the VA themselves, we have that too.
So some numbers. Take a very sick man …
Number of Visits: 116
Number of Problems: 38
Number of Orders: 240
Number of Concepts: 1478
...
Number of Triples: 44,281
Size of Caregraf (turtle format): 2.7 MB44,281 – that’s a lot of statements. But hold on, you say, number of orders?. That means you’re counting the process of care, health-care workflow, as opposed to just descriptions of a patient’s health. True and worse, we’re also capturing billing information. In fact, for this man, we’re capturing 163 types of medical fact. Yes Prescription, but Order and Billing information too. If we hone in on “facts about patient health”, we end up with 19,352 triples which is still a lot. Can we pair down more?
Well, if we exclude many inpatient medications, IVs and their ilk, nix inactive problems, cut out the nursing notes, cut all the stuff of immediate care that isn’t relevant for subsequent care – down comes the number to 5000 or so.
We’ve dealt with hospitals with information on 200,000 patients and in the unlikely event that all their patients were 5000 triples sick, that adds up to 1 billion triples. Doable? Can current triples stores handle a billion?
dbpedia, which holds the structured information of Wikipedia – has over a billion triples in its store. In W3C’s discussion of large triple stores, five to fifteen billion triples is commonplace and one trillion is the new benchmark. On top of this, add that SPARQL, the SQL of triple stores, now supports distributed querying and you have the scale of the web itself on your side.
Which gets me back to the VA. They have medical data on nine million people. If these were all 5000-triple sick then we get 45 billion triples. Not a trillion, not close. The VA could put all their patient data in one triple store and query away. Begone so much plumbing, this-and-that custom extraction mechanism. One store, to answer anything. Why not?

Your Thoughts?