Hive |

I had the pleasure of attending Cloudera’s recent analyst summit. Presenters reviewed the work the company has done since its founding six years ago and outlined its plans to use Hadoop to further empower big data technology to support what I call information optimization. Cloudera’s executive team has the co-founders of Hadoop who worked at Facebook, Oracle and Yahoo when they developed and used Hadoop. Last year they brought in CEO Tom Reilly, who led successful organizations at ArcSight, HP and IBM. Cloudera now has more than 500 employees, 800 partners and 40,000 users trained in its commercial version of Hadoop. The Hadoop technology has brought to the market an integration of computing, memory and disk storage; Cloudera has expanded the capabilities of this open source software for its customers through unique extension and commercialization of open source for enterprise use. The importance of big data is undisputed now: For example, our latest research in big data analytics finds it to be very important in 47 percent of organizations. However, we also find that only 14 percent are very satisfied with their use of big data, so there is plenty of room for improvement. How well Cloudera moves forward this year and next will determine its ability to compete in big data over the next five years.

Cloudera’s technology supports what it calls an enterprise data hub (EDH), which ties together a series of integrated components for big data that include batch processing, analytic SQL, a search engine, machine learning, event stream processing and workload management; this is much like the way relational databases and tools evolved in the past. These features also can deal with the types of big data most often used, according to our research: 40 percent or more use five types, from transactional data (60%) to machine data (42%). Hadoop combines layers of the data and analytics stack from collection, staging and storage to data integration and integration with other technologies. For its part Cloudera has a sophisticated focus on both engineering and customer support. Its goal is to enable enterprise big data management that can connect and integrate with other data and applications from its range of partners. Cloudera also seeks to facilitate converged analytics. One of these partners, Zoomdata, demonstrated the potential of big data analytics in analytic discovery and exploration through its visualization on the Cloudera platform; its integrated and interactive tool can be used by business people as well as professionals in analytics, data management and IT.

Cloudera latest major release with Cloudera Enterprise 5 brought a range of enterprise advancements from in-memory processing, resource management, data management, data protection to name a few. Cloudera offers a range of product options that they announced to make it easier to embrace their Hadoop technology. Cloudera Express is its free version of Hadoop, and it provides three editions licensed through subscription: basic, flex and data hub. The Flex Edition of Cloudera Enterprise has support for analytic SQL, search, machine learning, event stream processing and online NoSQL through the Hadoop components HBase, Impala, Spark and Navigator; a customer organization can have one of these per Hadoop cluster. The Enterprise Data Hub (EDH) Edition enables use of any of the components in any configuration. Cloudera Navigator is a product for managing metadata, discovery and lineage, and in 2014 it will add search, annotation and registration on metadata. Cloudera uses Apache Hive to support SQL through HiveQL, and Cloudera Impala provides a unique interface to the Hadoop file system HDFS using SQL. This is in line with what our research shows organizations prefer: More than half (52%) use standard SQL to access Hadoop. This range of choices in getting to data within Hadoop helps Cloudera’s customers realize a broad range of uses that include predictive customer care, market risk management, customer experience and other areas where very large volumes of information can be applied for applications that were not cost-effective before. With EDH Edition Cloudera can compete directly with large players IBM, Oracle, SAS and Teradata, all of which have ambitions to provide the hub of big data operations for enterprises.

Having open source roots, community is especially important to Hadoop. Part of building a community is providing training to certify and validate skills. Cloudera has enrolled more than 50,000 professionals in its Cloudera University and works with online learning provider Udacity to increase the number of certified Hadoop users. It also has developed academic relationships to promote Hadoop skills being taught to computer science students. Our research finds that this sort of activity is necessary: The most common challenge in big data analytics processes for two out of three (67%) organizations is not having enough skilled resources; we have found similar issues in the implementation and management of big data. The other aspect of a community is to enlist partners that offer specific capabilities. I am impressed with Cloudera’s range of partners, from OEMs and system integrators to channel resellers such as Cisco, Dell, HP, NetApp and Oracle to support in the cloud from Amazon, IBM, Verizon and others.

To help it keep up Cloudera announced it has raised another $160 million from the likes of T. Rowe Price, Michael Dell Ventures and Google Ventures to add to financing from venture capital firms. With this funding Cloudera outlined its investment focus for 2014 which will concentrate on advancing database and storage, security, in-memory computing and cloud deployment. I believe that it will need to go further to meet the growing needs for integration and analytics and prove that it can provide a high-value integrated offering directly as well as through partners. Investing in its Navigator product also is important, as our research finds that quality and consistency of data is the most challenging aspect of the big data analytics process in 56 percent of organizations. At the same time, Cloudera should focus on optimizing its infrastructure for the four types of data discovery that are required according to our analysis.

Cloudera’s advantage is being the focal point in the Hadoop ecosystem while others are still trying to match its numbers in developers and partners to serve big data needs. Our research finds substantial growth opportunity here: Hadoop will be used in 30 percent of organizations through 2015 and another 12 percent are planning to evaluate it. Our research also finds a significant lead for Cloudera in Hadoop distributions, but other options like Hortonworks and MapR are growing. The research finds that the most of these organizations are seeking the ability to respond faster to opportunities and threats; to do that they will need to have a next generation of skills to apply to big data projects. Our research in information optimization finds that over half (56%) of organizations are planning to use big data and Hadoop will be a key focus for those efforts. Cloudera has a strong position in the expanding big data market because it focuses on the fundamentals of information management and analytics through Hadoop. But it faces stiff competition from the established providers of RDBMSs and data appliances that are blending Hadoop with their technology as well as from a growing number of providers of commercial versions of Hadoop. Cloudera is well managed and has finances to meet these challenges; now it needs to be able to show many high-value production deployments in 2014 as the center of business’s big data strategies. If you are building a big data strategy with Hadoop, Cloudera must be in the evaluation priority for an organization.

Regards,

Mark Smith

CEO & Chief Research Officer

The big-data landscape just got a little more interesting with the release of EMC’s Pivotal HD distribution of Hadoop. Pivotal HD takes Apache Hadoop and extends it with a data loader and command center capabilities to configure, deploy, monitor and manage Hadoop. Pivotal HD, from EMC’s Pivotal Labs division, integrates with Greenplum Database, a massively parallel processing (MPP) database from EMC’s Greenplum division, and uses HDFS as the storage technology. The combination should help sites gain from big data a key part of its value in information optimization.

Greenplum and EMC have been working with Hadoop technology to provide robust database and analytic technology offerings. EMC is using Hadoop and HDFS as a foundation to support a new generation of information architectures, on top of which the company provides a value-added layer of data and analytic processing to support a range of big data needs. The aim is to address one of the benefits of big data technology, which is to increase the speed of analysis; our big data benchmark research found that to be a key benefit for 70 percent of organizations.

EMC is placing a bet by building its distribution on top of Apache Hadoop 2.02, which has yet to be officially released. The company is testing its software on a thousand-node cluster to ensure it will be ready. While EMC calls Pivotal HD the most powerful Hadoop distribution, it is one of many new providers that are building on Hadoop technologies and commercializing it for organizations looking for direct support and services or looking for value-added technology on top of Hadoop. Oddly, however, EMC’s new offering appears to be competitive with its own licensing of MapR for a product it calls Greenplum MR.

EMC is calling the advanced database processing technology with Pivotal HD a new name of HAWQ. It provides the ability to use ANSI SQL in an optimized manner against big data through a query parser and optimizer with its own HAWQ nodes process query execution against HDFS data nodes. HAWQ also has its own Xtension Framework for adaptability to other technologies. HAWQ improves upon the performance of regular SQL as it is a specialized technology to manage distributed and optimized queries to data in Hadoop.

By supporting SQL as the language to get to Hadoop, HAWQ simplifies standardized access to big data through this approach that provides query optimization through its query planning and pipelining methods. Providing a SQL interface and an ODBC connection is not new; many Hadoop distributions now provide ODBC connectivity, including Cloudera, Hortonworks and MapR. EMC, however, uses its optimized query and SQL connection in HAWQ as an accelerator, which lets it stack its software technology up against any data and analytic technology, not just Hadoop. The question for organizations thinking about making an investment in this approach is whether they are limiting their access to future Hadoop advancements by investing in HAWQ technology that operates with only the Pivotal HD distribution or does the gains provide immediate value to separate any Hadoop challenges in optimizing its infrastructure. It is my belief that if an organization adopts this path of HAWQ, it will need to ensure it invests in an information architecture that includes integration technology at the HDFS level, as businesses will inevitably be operating against varying flavors of Hadoop.

Another area of differentiation EMC promises for HAWQ is in the area of performance. EMC claims exponential performance improvement using its query optimizer and SQL versus using Hive to access HDFS or Cloudera Impala and native Hadoop. In fact it claims 19 to 648 times faster performance using its own benchmark. Since these benchmarks were not run independently, it is hard to place significant value in them for now. I made inquiries to many Hadoop software providers, including Cloudera, and they said these metrics are probably not that accurate and invited performance comparisons against their technologies. Clearly these benchmarks should have been released to the Hadoop community for its members to design optimized queries using Hive for more accurate comparisons, but EMC is hoping that its results will entice IT professionals to try it for themselves.

EMC’s stature in the market and its work with a broad range of technology partners makes it an important player in the big data market. Tableau Software is one of those partners, providing discovery on data from HAWQ and Pivotal HD for analytics. Cirro also announced support for Pivotal HD, enabling a new generation of what I call big data integration. These partners are good examples and provide EMC a more complete stack of technologies for operating in a more enterprise approach for big data from analyst to connectivity to other data sources.

EMC can deploy its big data technology across a variety of deployment methods, including public cloud with OpenStack and Amazon Web Services (AWS), private cloud using VMware, and on-premises. Our big data research shows faster growth planned for hosted (59%) and software as a service (65%) than for future on-premises deployments. While EMC is not allowed to publicly mention its customer references, and I have yet to validate them, the company says they include some of the largest banks and manufacturers.

Meanwhile, the Hadoop community’s new project Tez provides an alternative to bypass MapReduce to improve performance. It uses Hadoop YARN for a more efficient run time and better performance for queries. Also, the Stinger Initiative is a project to improve interactive query support for Hive.

EMC acknowledges open source efforts that focus on improving the performance of accessing HDFS and look forward to those advancements and where they can be extracted into its Pivotal HD product but points to its query optimizer and ANSQ SQL as a better approach. It also did not deny that its performance comparisons could have been more optimized. But EMC is betting that its HAWQ efforts and its reliance on the next release of Apache Hadoop 2 will place it in a good market position, leveraging open source technology that is expected to be released in 2013.

This move to introduce Pivotal HD Enterprise and HAWQ is clearly an opportunity to accelerate EMC’s efforts. Greenplum’s technology needed assistance to grow its adoption as it competes with approaches that encompass not only Hadoop but also in-memory, appliance and RDBMS technology. Only time will tell how EMC’s focus on big data with Pivotal HD and HAWQ will play out. The battle among big data providers continues to be very competitive, with dozens of approaches. As each company moves from experimentation to development to production, it must carefully determine what technology will best meet its unique needs. Organizations should evaluate HAWQ and Pivotal HD on not just the merits of performance or providing SQL access but on the architectural and management needs of IT that span from adaptability, manageability, reliability and usability and the business value that should be ascertained with this technology compared to other Hadoop and big-data technology approaches.

Regards,

Mark Smith

CEO & Chief Research Officer