For ZirMed Inc., a vendor of software as a service applications for healthcare administration uses based in Louisville, Ky., SQL querying support was a crucial factor in making the deployment of a Hadoop cluster feasible. “If we didn’t have a SQL level on top of Hadoop, we would not have implemented Hadoop,” said Chris Schremser, ZirMed’s CTO.
Schremser said that programming Java in jobs in MapReduce wouldn’t have been “conducive to developer productivity” in his organization. Instead, ZirMed is using Hortonworks’ Hadoop distribution along with Apache Hive, open source software that lets SQL-savvy developers and business users at the company query data stored in the Hadoop Distributed File System (HDFS).
ZirMed installed the Hadoop cluster in the fall of 2014, with 29 compute nodes and 1.2 PB of raw storage capacity. The system is being used to collect data on medical payments and insurance claims from the company’s customers for analysis. The information is straightforward transaction data, not the unstructured or semi-structured forms of data often found in HDFS, and the jobs that Schremser’s team runs against it with Hive aren’t highly interactive. But he called them “big queries” — for example, answering questions on insurance eligibility that require churning through more than two years of patient records and transactions.
“We don’t have an unstructured [data] problem today,” Schremser said. “Our issue is quantity. We had so much data that querying it could be a problem.”
Previously, ZirMed ran some queries on a data warehouse appliance and others directly on its transaction processing system because of storage limitations in the data warehouse. In the latter case, query jobs that took hours — or multiple days — to complete against the transaction database have been reduced to a handful of minutes on the Hadoop cluster, according to Schremser.
ZirMed also put Hadoop and Hive into production with an eye toward reduced costs going forward, he continued. According to a case study write-up posted on the Hortonworks website, the hardware for the Hadoop cluster cost $235,000, about 30% of what Schremser said the company had spent on the data warehouse system. And for that price, the cluster contains five times more usable storage space than the warehouse provided.
The article was originally posted here.