Health 3.0: Populating a Snomed CT Property Graph with Synthetic Patient Data

12 min readApr 8, 2022

By John Newberry, Remedy Biomedical & Machine Learning Development Lead

Bioinformatics is the future of healthcare. AI and Machine Learning have the opportunity to flourish wherever large amounts of information can be found, and the human body is an incredible store of information. When combined with population statistics this is a staggering amount of information. However, access to real world health information is difficult as there are the issues with security and privacy. A partial solution to this is the generation of synthetic patient information. Using the Snomed CT ontology we can create the skeleton of a property graph to which we can bind synthetic patient data. This can be accomplished using Snomed CT ontology, Synthea (synthetic patient generation), and the Neo4j graph database infrastructure. From here the possibilities are endless, but more on that later!

Fist we will create a graph instance of Snomed CT using Neo4j, next we will generate synthetic patient data using Synthea, then we will import this information into our graph, and finally we will visualize this with Neo4j Bloom.

Requirements

Access to Snomed CT
Neo4j Desktop
Snomed database loader on Github
Synthea synthetic Patient generation on Github

Snomed CT Download

The Snomed CT Ontology is an incredibly robust representation of medical knowledge including clinical findings, symptoms, diagnoses, procedures, body structures, organisms and other etiologies, substances, pharmaceuticals, devices and specimens. Downloading the dataset can be tricky but if you live in a member country the download can be found here. For this case make sure to download the latest International edition, this file should include Delta, Full, and Snapshot folders.

Loading Snomed CT into a Neo4j Graph

Neo4j is a powerful graph database tool that comes packaged with a variety of tools and plugins ranging from graph visualization to graph data analytics. There are a variety of ways to access the architecture but for this case we will be using Neo4j Desktop which can be downloaded here.

We will be using the snomed-database-loader github repository to load Snomed into a Neo4j instance. This is all open-source information but special thanks to Scott Campbell and his team from the University of Nebraska Medical Center, Omaha, NE for making this available.

This process requires at least Java 8 and at least Python 3.5 with the py2neo python Library. Once these requirements are met, open up Neo4j Desktop, create a project, start up a local DBMS and set a password. From here, some changes must be made in the settings which can be found to the right of the DBMS:

Initial and max heap sizes must be set to 4GB, as shown below
Comment out line constraining import file location, this will allow the import file to be specified in the run command.
Comment in line allowing CSV import, this will be important later.

## Java Heap Size
dbms.memory.heap.initial_size=4G
dbms.memory.heap.max_size=4G## Together these allow import from anywhere on the system
#dbms.directories.import=import
dbms.security.allow_csv_import_from_file_urls=true

Next, locally, set up a folder with the Snomed CT Release files, the snomed-database-loader github repository, and an empty file labeled Output. The output file must be empty on execution, so if there are issues with loading and a restart is necessary, make sure to empty this file before restart. Now its time to start up your Database, hit the start button in Neo4j! Once the DB is running, open the terminal from Neo4j, below the settings tab we just used, cd into the snomed-database-loader file, specifically to the NEO4J (C:\Users\…\snomed-database-loader-master\NEO4J) and input the below code once all prerequisites are met!

python snomed_g_graphdb_build_tools.py db_build --release_type full --mode build --action create --rf2 <Full-rf2-release-directory> --neopw <password> --output_dir <output-directory-path>

Make sure <Full-rf2-release-directory> points towards the Full file within the Snomed download, the password is the password set previously, and <output-directory-path> is the empty Output folder.

Once the code is running it will take some time to import the entire ontology into Neo4j, but the following output should indicate a correct install!

sequence did not exist, primed
JOB_START
FIND_ROLENAMES
FIND_ROLEGROUPS
MAKE_CONCEPT_CSVS
MAKE_DESCRIPTION_CSVS
MAKE_ISA_REL_CSVS
MAKE_DEFINING_REL_CSVS
TEMPLATE_PROCESSING
CYPHER_EXECUTION
CHECK_RESULT
JOB_END
RESULT: SUCCESS

For more specific installation information follow the instructions in the readme.

Once this is done you can now use Cypher to query the Database using Neo4j Browser or visualize the data using Neo4j Bloom! Some work needs to be done to make effective use of the database but if you play around in Bloom, you should have something that looks similar to this at this point!

Zoomed view of graph so far using Neo4j Bloom.

This graph will act as the skeleton to which we will bind all of the patient information.

Generating Synthetic Patient Data

Synthea is a repository used for the generation of synthetic patient data. A portion of this data is generated with the associated Snomed CT coding and can therefore be used here. Synthea patient generation is pretty simple but can get quite complex depending on your needs. In our case we just need simple generation and output using CSV files.

Load the git repo, cd into the folder and run the code below to initialize the system:

run_synthea

We want the output to be a series of csv files, and we want 150 patients, for now, this can be done using the following code:

run_synthea -p 150 --exporter.csv.export true

You can find the output we are looking for in C:\…\synthea-master\output\csv. It will have the files shown below:

Synthea output.

Feel free to comb through this data but we are only interested in any Snomed CT compatible data. This will include the allergies, careplans, conditions, devices, patients, procedures, and supplies files.

Integrating Patient Data into Graph

Now for the meat of this project, patient data integration into our graph database. Quick data manipulation note, not all allergies had Snomed coding, and so these needed to be converted, the names of the substances could be searched using the Snomed CT Browser Tool, and the correct coding could be found there. Additionally, patient names are generated with numbers, these were removed as demonstrated here.

Each csv has tons of information, but we are only concerned with 2 columns, patient ID (PATIENT) and the snomed code (CODE).

First, we need to generate new nodes for each patient based on their unique identifiers (PATIENT). Open up the Neo4j Browser and input the following code, which first places a constraint on the database, preventing multiple nodes from being created for patients with the same ID.

%% Constrains DB to prevent Patient node duplication based on ID
CREATE CONSTRAINT personIdConstraint FOR (patient:Patient) REQUIRE patient.pid IS UNIQUE

Next, we load the patients.csv file, and creates unique nodes with the label Patient, and provides each node with 3 properties, a unique ID, a first name, and a last name. When using Cypher, Neo4js graph querying language, it is important that each block be run separately, do not concatenate these blocks into one script.

%% Loads CSV, and creates nodes with pid, first and last names
LOAD CSV WITH HEADERS FROM "file:///C:\Users\...\synthea-master\output\csv\patients.csv" AS csv
CREATE (p:Patient {pid:csv.Id, fname: csv.FIRSTN, lname: csv.LASTN})
RETURN p

Now to start adding some relationships between the Snomed CT skeleton and each patient! The code below matches nodes with the corresponding patient ID, and the Snomed CT code and then creates a HAS relationship between the two nodes and returns 10 examples of this.

LOAD CSV WITH HEADERS FROM "file:///C:\Users\...\synthea-master\output\csv\allergies.csv" AS csv
MATCH (p:Patient {pid:csv.PATIENT})
MATCH (n:ObjectConcept {sctid:csv.CODE})
CREATE (p)-[:HAS]->(n)
RETURN p, n
LIMIT 10

The next few steps simply follow this pattern, loading the corresponding csv, matching based on patient ID and Snomed CT code, and creating a relationship. Next is careplans.csv

LOAD CSV WITH HEADERS FROM "file:///C:\Users\...\synthea-master\output\csv\careplans.csv" AS csv
MATCH (p:Patient {pid:csv.PATIENT})
MATCH (n:ObjectConcept {sctid:csv.CODE})
CREATE (p)-[:HAS]->(n)
RETURN p, n
LIMIT 10

On to a big one, conditions.csv.

LOAD CSV WITH HEADERS FROM "file:///C:\Users\...\synthea-master\output\csv\conditions.csv" AS csv
MATCH (p:Patient {pid:csv.PATIENT})
MATCH (n:ObjectConcept {sctid:csv.CODE})
CREATE (p)-[:HAS]->(n)
RETURN p, n
LIMIT 10

Next is devices.csv, note the relationship changes here to USED

LOAD CSV WITH HEADERS FROM "file:///C:\Users\...\synthea-master\output\csv\devices.csv" AS csv
MATCH (p:Patient {pid:csv.PATIENT})
MATCH (n:ObjectConcept {sctid:csv.CODE})
CREATE (p)-[:USED]->(n)
RETURN p, n
LIMIT 10

For procedures.csv, the relationship changes to UNDERWENT

LOAD CSV WITH HEADERS FROM "file:///C:\Users\...\synthea-master\output\csv\procedures.csv" AS csv
MATCH (p:Patient {pid:csv.PATIENT})
MATCH (n:ObjectConcept {sctid:csv.CODE})
CREATE (p)-[:UNDERWENT]->(n)
RETURN p, n
LIMIT 10

Finally, supplies.csv with relationship USED.

LOAD CSV WITH HEADERS FROM "file:///C:\Users\...\synthea-master\output\csv\supplies.csv" AS csv
MATCH (p:Patient {pid:csv.PATIENT})
MATCH (n:ObjectConcept {sctid:csv.CODE})
CREATE (p)-[:USED]->(n)
RETURN p, n
LIMIT 10

And with that we have all patient information integrated into the graph. Its a lot of information to visualize but we can now see all patients and all relationships associated with each of those patients. Below we have an example of all patients with all relationships all at once, and a single patient will all of their relationships.

All Patients and all of their relationships using Neo4j Bloom.

Individual patient with all relationships expanded.

Right now it just looks like a bunch of colours and numbers, not useful for us but we will change that shortly.

Applying More Labels to Nodes

At this point there are 4 node Labels, ObjectConcept, RoleGroup, Description, and Patient. Patient and Description are obvious enough, but the other two are less so. RoleGroup is the basis for connections outside of the hierarchical design of Snomed CT. For example, an ankle sprain (disorder) can be due to a traumatic event (event), and these are connected by a RoleGroup node. The ObjectConcept label is the default node label for everything else. Selecting an ObjectConcept node and looking at the FSN property, you can see the Fully Specified Name for each Concept, with a hierarchical term in brackets. We are going to use this term in brackets to generate new labels for these nodes to better help differentiate between Node function.

We will be adding labels for the following: substance, finding, body structure observable entity, organism, product, procedure, disorder, morphological abnormality, situation, qualifier, value, occupation, environment, physical object, physical force, medicinal product, person, ethnic group, cell structure, event, cell, regime/therapy, special concept, social concept, specimen, and record artifact.

Below, each block of code below queries the database, looks for a node with the property FSN ending in one of the above labels, and then adds a new label to that node. Again, each block must be run separately for proper labeling.

MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(substance)'
SET n:Substance
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(finding)'
SET n:Finding
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(body structure)'
SET n:BodyStructure
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(observable entity)'
SET n:ObservableEntity
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(organism)'
SET n:Organism
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(product)'
SET n:Product
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(procedure)'
SET n:Procedure
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(disorder)'
SET n:Disorder
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(morphologic abnormality)'
SET n:MorphologicAbnormality
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(situation)'
SET n:Situation
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(qualifier value)'
SET n:QualifierValue
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(occupation)'
SET n:Occupation
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(environment)'
SET n:Environment
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(physical object)'
SET n:PhysicalObject
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(physical force)'
SET n:PhysicalForce
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(medicinal product)'
SET n:MedicinalProduct
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(person)'
SET n:Person
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(ethnic group)'
SET n:EthnicGroup
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(cell structure)'
SET n:CellStructure
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(event)'
SET n:Event
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(cell)'
SET n:Cell
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(regime/therapy)'
SET n:RegimeorTherapy
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(special concept)'
SET n:SpecialConcept
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(social concept)'
SET n:SocialConcept
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(specimen)'
SET n:Specimen
RETURN n LIMIT 10MATCH (n:ObjectConcept)WHERE n.FSN ENDS WITH '(record artifact)'
SET n:RecordArtifact
RETURN n LIMIT 10

Functionally our graph is set! We now just have a few more things to do for proper visualization with Neo4j Bloom, but things are starting to look a little more colourful!

Colour coded labels complete!

Visualization using Neo4j Bloom

Bloom is powerful visualization tool that is included with Neo4j Desktop. We will just scratch the surface of what is possible here.

Bloom allows you to visualize your data using perspectives, and when you open it up it will have one generated automatically for you. Go through the menu on the left and create a new blank perspective, this may take a few minutes to initialize.

Once initialization is complete, open up the perspective and expand the menu on the left. We are going to add a category for each label that we have created, except for the label ObjectConcept, in total we should have 29 categories.

Once created, categories will show up on both the left and right.

Now when we display the data it will be colour coated according to the label, but when you zoom in it will still just display numbers on each node. This is an easy fix. We have generated a number of properties for each node, and when selecting categories on the left menu, it will display all potential properties for that label. What we are interested in is the FSN property, or Fully Specified Name. Select only this under caption and that will display in this perspective.

Selecting what will be displayed on each node.

Note for the RoleGroup, Description, and Patient categories there is no FSN. For Description select term, and for Patient select fname and lname. And that’s it! Now if we query the graph and zoom in, all Labels are colour coated, and they will display useful information.

Expansion of a small portion of the completed graph.

Graph with information displayed on each node.

Final Thoughts and Future Work

Synthetic patient information can be a tricky thing, as it provides a framework to work from, but any useful relationships that can be pulled from the database must be taken with a grain of salt, they will not necessarily be mirrored with real world data. The utility in what we have built here comes from the potential. Actual health data is very difficult to gain access to, security and privacy will always be issues. But why let a lack of access to real data slow progress? Here we have a framework to build from. Tools for analysis can be designed and tested on this database, the pipeline for data entry can be streamlined or even automated. The entire process can be fine tuned and perfected before any real information even enters the system!

Moving forward, we have two options, increase the volume of data per patient, or start building algorithms to analyze the data. We can increase the number of relationships connected to each of our patients, as we still have information generated with synthea that we did not use, specifically immunization, medication, and observation information. We didn’t use this data because the conversion from the coding they use for Snomed CT is non-trivial, but it is entirely doable. In addition, all of the information that we generated has timestamps associated with it, so we could inject this data into the graph. Second, we could start building algorithms to analyse what we have so far. We could identify patterns and other useful information held within the graph. Machine Learning and Deep Learning on graph databases is an exploding field with what we have we could do some clustering, data reduction, finding similar patient populations, and from there predicting potential future connections. Just to scratch the surface.

Either way, the future is exciting, and this project provides a ton of opportunities moving forward.

Thanks for reading!