- Organizations looking to embrace data analytics for improved patient care may want to consider Hadoop as a solution for their healthcare data infrastructure.
Hadoop is an open-source distributed data storage and analytics application. Hadoop is not a data warehouse per se, but acts as a software framework to handle structured and unstructured data. Hadoop distributes large amounts of data to different processing nodes, then combines the collected results. This approach allows data to be processed faster, since the system is working with smaller batches of localized data instead of the contents of the entire warehouse.
The majority of healthcare organizations are still in search of the most efficient big data analytics tools to improve patient care and allow them to participate in predictive analytics and population health management.
One of the major challenges for healthcare providers is understanding and reconciling the two major types of data: structured and unstructured information.
Structured data is data stored within fixed confines, such as a file. Structured data is easier to analyze and store because it has straightforward boundaries and is created and stored in a standardized format. Patient demographic information, diagnosis and procedure codes, medication codes, and certain other data from the electronic health record are typically generated in a standardized, structured way. Traditional data warehouses are usually equipped to handle structured data.
Unstructured data may give healthcare organizations more trouble. Unstructured data comes in many forms including, but not limited to emails, audio files, videos, text documents, and social media posts. Unstructured data is undefined and can’t be analyzed the same way as structured data.
Electronic health records (EHR) pose a unique challenge to healthcare organizations because many EHRs allow free text input for clinical notes and other narrative data collection fields.Unstructured data needs to be extracted, processed, and normalized in order to be analyzed. Extraction takes time and is another expense for organizations who may be under strict budget restrictions.
HDFS is the primary distributed storage used by Hadoop applications. HDFS is not a physical database, but it collects data and stores it in clusters until an organization is ready to use it.
Hadoop separates unstructured data into nodes that are individual parts of a larger data structure. The nodes are linked together and able to combine the data stored within to produce results based on parameters set by an organization.
MapReduce processes the data. MapReduce is essentially a series of Java applications that pull out the requested data from the Hadoop clusters.
Implementing Hadoop as part of a data warehouse allows organizations to handle and process data that may have been previously impossible to analyze.
Fully implementing Hadoop into a data warehouse may require updates to servers. Investing in more on-premise servers or considering a hybrid storage solution will prevent scalability and capacity issues. Hadoop is a fairly large implementation and organizations need to consider the kinds of data they expect to analyze and if their current database can handle it.
Healthcare organizations always need to consider cost-effectiveness when implementing a new solution into their infrastructure. Organizations need to be fully committed and ready to realize the benefits of a solution like Hadoop.
Traditional databases and data warehouses have not outlasted their usefulness and can still be effectively implemented in hybrid Hadoop solutions.
According to a blog post by big-data-as-a-service vendor Qubole, “hybrid systems, which integrate Hadoop platforms with traditional relational databases, are gaining popularity as cost-effective ways for companies to leverage the benefits of both platforms.”
Considering a database solution on the scale of Hadoop is a necessary first step for the healthy growth of an organization's health IT infrastructure.
Healthcare organizations continue to seek more effective ways to treat patients which can be achieved by collecting and analyzing as much data as possible. Organizations collecting data on both patients and employees can more easily see where improvements need to be made and where ineffective efforts can be reduced.