HITInfrastructure

Storage News

How to Approach Building a Healthcare Data Lake Roadmap

Navigating how to prepare a healthcare data lake can be a challenge, but focusing on flexibility and scalability while understanding data usage is key.

healthcare data lake

Source: Thinkstock

By Elizabeth O'Dowd

- Collecting, storing, and using data produced by patients is a major challenge, and organizations often have data they can’t sort through and use efficiently. Healthcare data lakes contain valuable information that can be used to improve patient care, but organizing the data and making it available takes significant IT infrastructure planning.

Organizations preparing to make collected health data actionable should create a roadmap that will enable them to leverage data to improve workflow and patient care.

“The core data lake is really becoming an advanced clinical research information system to accelerate the speed to knowledge,” EMC Global Healthcare Business CTO Dave Dimond told HealthITAnalytics.com in a previous interview.

“It’s very powerful technology that is a key part of retooling how providers are engaging patients at the point of care, for example, and how they are building a bridge between them. It’s about creating a learning health system.”

Big data analytics and population health are two uses for the data collected in the data lake. Organizations interested in using data to enhance population health and analytics need to understand the nature of the data being collected, how to store and access that data, and how to make the data actionable.

READ MORE: Understanding HIPAA-Compliant Cloud Options for Health IT

Planning for structured and unstructured data

Data collected by clinicians, patients, and connected devices is structured or unstructured. Structured data is data stored within fixed confines, such as a file. Structured data is easier to analyze and store because it has straightforward boundaries and is created and stored in a standardized format.

Patient demographic information, diagnosis and procedure codes, medication codes, and certain other data from the EHR are typically generated in a standardized, structured way. Traditional data warehouses are usually equipped to handle structured data.

Unstructured data does not have a standardized format and is not organized. Unstructured data comes from many different data sources and can contain images, numbers, and complex data sets.

Unstructured data lives in the data lake and is often too vast to be retrieved conveniently or used for analytics.

READ MORE: Healthcare Data Storage Options: On-Premise, Cloud and Hybrid Data Storage

Organizations need tools to handle structured and unstructured data, so it can be organized in a way that it can be made actionable. Different tools can sort through the data to make it more accessible and actionable. Utilizing tools such as Hadoop can turn data lakes from storage dumps to active tools.

Hadoop is an open-source distributed data storage and analytics application. Hadoop is not a data warehouse but acts as a software framework to handle structured and unstructured data.

Hadoop distributes large amounts of data to different processing nodes, then combines the collected results. This approach allows data to be processed faster, since the system is working with smaller batches of localized data instead of the contents of the entire warehouse.  

Hadoop works to store and analyze the data using mainly Hadoop Distributed File System (HDFS) and MapReduce.

HDFS is the primary distributed storage used by Hadoop applications. HDFS is not a physical database, but it collects data and stores it in clusters until an organization is ready to use it.

READ MORE: How Hyperconverged Infrastructure Improves Health IT Functionality

Hadoop separates unstructured data into nodes that are individual parts of a larger data structure. The nodes are linked together and able to combine the data stored within to produce results based on parameters set by an organization.

Assessing cloud service models for the data lake

Storing data in the cloud gives organizations a level of flexibility that they often can’t achieve with on-premises deployments.

Cloud data storage also saves organizations money by allowing them to purchase more storage space as needed, rather than investing in additional on-premise servers.

According to a HIMSS study, connectivity “should easily ‘scale up,’ as more applications are moved to the cloud or more compute cycles are accessed for analytics.”

Moving data to the cloud not only gives organizations an easier way to expand, but it cuts back on the cost of hardware for on-premises servers and additional IT staff needed to manage and maintain on-premises servers.

This space gives organizations the resources to deploy tools like Hadoop and gain more control over their IT infrastructure.

Organizations then need to decide if they wish to deploy their tools in the public cloud, private cloud, or a combination of both. Public cloud is the most scalable data storage solution. Storage space can be added or dropped as the size of an organization changes. This makes public cloud popular for temporary projects as well as data migration.

Private cloud gives organizations more control over where their data resides and its accessibility to users. The private cloud gives health IT staff direct control over the contents stored in the cloud. Healthcare organizations may benefit from private cloud because they can keep a close eye on PHI.

The deciding factors between public and private cloud are budget, staff, and the amount of data that needs to be stored. Public cloud is often the less expensive option for health systems that have a lot of unstructured data and a lower budget that can’t cover private cloud deployments.

No matter what a hospital’s budget is, the more data that is produced the more expensive it will be to store it. Keeping that in mind when planning on how to support the data lake will help cloud storage become more manageable.

Considering object storage for the data lake

“Object storage provides an inexpensive way to store vast pools of data, multiple petabytes up to exabyte scale within a single space,” said Key Information Systems Director of Cloud Service Clayton Weise. “The data stored using object is always accessible, unlike tape where you have to know the serial number, track the tape, and physically retrieve it.”

Object storage manages data as objects instead of files or blocks. Objects are kept in a storage pool that does not have a hierarchical structure.

Instead, object storage uses unique identifiers that allow data to be stored anywhere in the storage pool. Storing data using object storage gives healthcare organizations more possibilities for data analytics and offers a scalable infrastructure.

“One of the biggest challenges is the exponential growth in the amount of data healthcare organizations are keeping,” said Weise. “Hospital regulatory requirements vary and there are certain rules and regulations that require patient data to be kept seven years. Some hospitals keep information for as long as the patient’s alive, and even then, they may not be deleting it.”

Data that is stored for use in analytics does not need to be accessed regularly like an application. Object storage is not where the analytics solution would run, because it’s not the fastest storage solution, but it is an option for storing large amounts of data in a way that makes it accessible when it’s needed.

No matter how the data lake is approached, organizations should classify their data and understand what it will be used for. Once data use is determined, building a roadmap to utilize the right storage tools for the data produced become easier. Organizing data and making it accessible when needed is a key step in making the data actionable for analytics.

X

Sign up for our free newsletter covering the latest IT technology for Hospitals:

Our privacy policy

no, thanks