- Organizations are challenged by constantly increasing amounts of health data as more connected and digital devices contribute data. The rate at which the data is growing is also causing healthcare data lakes to flood. Organizations need to resolve storage and analytics challenges to use the data collected in data lakes.
A data lake is a centralized repository that stores structured and unstructured data.
Structured data is data stored within fixed confines, such as a file. Structured data is easier to analyze and store because it has straightforward boundaries and is created and stored in a standardized format.
Patient demographic information, diagnosis and procedure codes, medication codes, and certain other data from the electronic health record are typically generated in a standardized, structured way. Traditional data warehouses are usually equipped to handle structured data.
Unstructured data does not have a standardized format and is not organized. Unstructured data comes from many different data sources and can contain images, numbers, and complex data sets.
This unstructured data lives in the data lake and are often too vast to be retrieved conveniently or used for analytics.
Currently healthcare organizations are trying to implement ways to make better use of the data stored in data lakes. Different tools can sort through the data to make it more accessible and actionable. Utilizing tools such as Hadoop, object storage, and even blockchain can turn data lakes from storage dumps to active tools.
Hadoop is an open-source distributed data storage and analytics application. Hadoop is not a data warehouse but acts as a software framework to handle structured and unstructured data.
Hadoop distributes large amounts of data to different processing nodes, then combines the collected results. This approach allows data to be processed faster, since the system is working with smaller batches of localized data instead of the contents of the entire warehouse.
READ MORE: Top 10 Cloud Data Storage Companies
HDFS is the primary distributed storage used by Hadoop applications. HDFS is not a physical database, but it collects data and stores it in clusters until an organization is ready to use it.
Hadoop separates unstructured data into nodes that are individual parts of a larger data structure. The nodes are linked together and able to combine the data stored within to produce results based on parameters set by an organization.
Healthcare organizations also need scalability and automation to support their data lake and may look to object storage as a solution.
Gartner defines distributed object storage as software and hardware solutions that are based on "shared nothing architecture and support object and/or scale-out file technology to address requirements for unstructured data growth.”
Object storage manages data as objects instead of files or blocks. Objects are kept in a storage pool that does not have a hierarchical structure.
Object storage uses unique identifiers that allow data to be stored anywhere in the storage pool. Storing data using object storage gives healthcare organizations more possibilities for data analytics and offers a scalable infrastructure.
While object storage is not the fastest storage solution, it’s a good place to store large amounts of data because it is designed to be balanced among all systems.
Blockchain is a potential remedy for stagnant data lakes. Knowing how to trace the data back to where it originated to give it context is one of the most significant challenges that unstructured data in the data lake currently poses.
Blockchain creates an unchanagble ledger that can trace permission, access, and transmission of data to create a controlled data lake, according to Obesity PPM CEO Heather Flannery.
“There will be clearer definitions of property,” Flannery explained to HITInfrastructure.com. “It will also can have patient ownership pools in that same context.”
“All of that is going to drive volumes into the cloud and deliver a scalable economic and secure mechanism by which we could start to really work with and compute on sophisticated, highly variable health data,” she continued. “It’s an important opportunity to see the evolution of cloud computing really drive to it value points.”
Blockchain can potentially build a data lake that is assembled using blockchain technology. This will give organizations a clear and traceable repository of data pulled together from hundreds or even thousands of desperate data sources. This is the kind of data lake that can be used to make counterintuitive discoveries using AI and machine learning.
While this use of blockchain has yet to be realized, it is a future possibility using a cloud environment.
As healthcare data continues to grow exponentially, organizations need to focus on their data lakes and determine how to approach the growing storage and compute needs to make that data actionable. Once the data in the data lake is actionable, it can be used to give clinicians better insight into patient health.