It’s time to establish big data standards
The deployment of big data tools is being held back by the lack of standards in a number of growth areas.
Technologies for streaming, storing, and querying big data have matured to the point where the computer industry can usefully establish standards. As in other areas of engineering, standardization allows practitioners to port their learnings across a multitude of solutions, and to more easily employ different technologies together; standardization also allows solution providers to take advantage of sub-components to expeditiously build more compelling solutions with broader applicability.
Unfortunately, little has been done to standardize big data technologies so far. There are all sorts of solutions but few standards to address the challenges just mentioned. Areas of growth that would benefit from standards are:
- Stream processing
- Storage engine interfaces
- Querying
- Benchmarks
- Security and governance
- Metadata management
- Deployment (including cloud / as a service options)
- Integration with other fast-growing technologies, such as AI and blockchain
The following sections will look at each area.
Streaming
Big data came about with the influx of the high volumes and velocity of streaming data. Several products offer solutions to process streaming data, both proprietary and open source: Amazon Web Services, Azure, and innumerable tools contributed to the Apache Foundation, including Kafka, Pulsar, Storm, Spark, and Samza. But each has its own interface. Unlike SQL, there is no standard API or interface to handle this data, although Apache is now promoting a meta-interface called Beam. This makes it hard for solution providers to integrate with these rapidly evolving solutions.
Also, there is no easy way for Internet of Things (IoT) application developers to leverage these technologies interchangeably, and have portability so they don’t get tied down by proprietary interfaces—essentially the same guiding principles as were behind the ANSI SQL standards.
Storage engine interfaces
With the proliferation of a large number of NoSQL storage engines (CouchDB, Cassandra, HBase, MongoDB, etc.) we again face a plethora of incompatible APIs. In addition, new types of applications call for a radical rethinking of how to process data. Such rethinking includes document stores (with JSON becoming the prevalent data interchange format), and graph databases: Gremlin, SPARQL (which is a W3C standard), and Cypher as interfaces to Neo4J, JanusGraph, and other databases. GIS systems provide a very different model for interacting with their complex form of data. Apache Lucene, and related search engines, are also unique in the extensive capabilities they provide.
Applications cannot swap storage engines if needed. Also, SQL query engines such as Apache Trafodion, EsgynDB, Apache Spark, Apache Hive, and Apache Impala must interact independently with each storage engine.
Just as ODBC and JDBC facilitated the development of many BI and ETL tools that work with any database engine, a standard interface could facilitate access to data from any of these storage engines. Furthermore, it would substantially expand the ecosystem of solutions that could be used with the storage engine.
Finally, even though parallelism is important for data flow between the query and the storage engine, it is not facilitated by any standard interface. Partitioning can vary during such flows.
Querying
Data models supported by NoSQL databases differ just as much as their interfaces. The main standard with some applicability to big data is ANSI SQL. Although it was explicitly rejected in the first decade of the 2000s by many of the NoSQL databases, it has now been adopted by many of them as an alternative API because of its prevalence, its familiarity amongst developers, and the ecosystem supporting it. SQL is still evolving and is doing a credible job in handling the big data challenges. For instance, JSON support and Table Valued predicates were added in the 2016 standard.
But even SQL has not keep pace with the changes in the big data space, given that standards take a lot of collaboration, deliberation, and effort to get right and set. Two other familiar standards in the relational database world—ODBC and JDBC—have not changed much for quite some time, especially given the needs of big data to handle parallelism for large volumes of data, the variety of data structures and models, and the changed paradigm of streaming data velocity.
The SQL standard needs to evolve to support:
- Streaming data
- Publish/subscribe interfaces
- Windowing on streams:
- Time, count, and content triggers
- Tumbling, sliding, and session windows
- Event versus time processing
- Rules for joining streams of data
- Interfaces to sophisticated search solutions
- Interfaces to GIS systems
- Interfaces to graph databases, so that users can submit standard queries against graph databases and map the results to a tabular format, similar to the elegant JSON-to-relational mapping in the ANSI SQL 2016 standard for JSON
Benchmarks
The workloads for big data span the gamut from streaming to operational to reporting and ad hoc querying to analytical; many of these have real-time, near real-time, batch, and interactive aspects. Currently, no benchmark assesses the price/performance of these hybrid operational and analytical processing workloads. Many vendors claim to support these varied workloads, but definitions are lacking, and no benchmarks exist to test them.
To evaluate potential big data products, most customers turn to benchmarks created by the Transaction Processing Performance Council (TPC). The TPC-DS standard, intended to benchmark BI and analytical workloads, offered considerable promise. But this promise was subverted in two ways. First, vendors present customers with altered versions of the benchmark, distorted to favor the vendor’s product. Many of the TPC-DS results shared by vendors do not account for common usage, including queries and other workloads running at different levels of concurrency and at the scale of big data, as outlined in the specification. Secondly, unlike most TPC standards, the TPC-DS standard was never bolstered by audited results that enable customers to assess relative price/performance.
Security and governance
There are various security and governance infrastructures for big data deployments. For instance, the Hadoop environment has Apache Ranger and Apache Sentry. Each cloud provider has security information and event management systems. For applications and solution providers to integrate with these environments is difficult, again, since each implementation has a different API.
Standardizing the API for security administration and monitoring would be very beneficial, allowing enterprises to use standard mechanisms to configure and enforce security. These standards would enable more solutions to use these security systems. Consider the integration of crucial security events across various enterprise data sources. If the security products used a standard API, it would be a lot easier for them to interface with any data source, providing the client more choice and flexibility. The same holds true when deploying role and user privileges across enterprise data sources.
More conveniently, when data is moved to another data storage system, or accessed by another sub-system, the access rights to that data could move automatically and without sysadmin effort, so that the same people would still have access to the same fields and rows, regardless of the tools they use to access that data.
Metadata management
With the proliferation of data across multiple storage engines, the metadata for that data needs a central repository. Regardless of whether a table is in HBase, ORC, Parquet, or Trafodion, if it were registered in one set of metadata tables, it would be much easier for client tools to access this data than the current situation, where clients have to connect to different metadata tables.
The proper goal here is to standardize the information schema for these client tools and centralize all security administration. That would present a federated view of all the database objects.
Extending this metadata with business information for the objects would facilitate governance and data lineage, instead of having to provide these services again across different metadata repositories. This would make metadata or master data management easier.
Deployment
Each cloud provider requires its own way to provision, configure, monitor, and manage a database and the resources it needs. This means that any client who wants to change cloud providers, or make databases on different providers work together, must change all their procedures and scripts, and perhaps much more.
A standard set of APIs would make this task easier, regardless of whether the customer was deploying the database on a public or private cloud, or as a hybrid deployment. Standards have been proposed but have not succeeded in the market. For instance, OpenStack has open source, community-driven interfaces for many of these tasks, but these have gained no adoption among the services chosen by most customers (Amazon.com, Google, Microsoft’s Azure). VMware defined a vSphere standard some years ago, but it is almost completely ignored.
When cloud providers offer comparable services, these should be standard as well. For instance, an application that needs object storage such as AWS S3 or Azure Blob, should be able to get access through a standard interface.
Integration with other emerging technologies
It is also important to think about how to standardize the integration between databases and technologies such as machine learning algorithms and tools such as TensorFlow, R, Python libraries, the analysis of unstructured data such as sentiment analysis, image processing, natural language processing (NLP), blockchain, etc. Today, each solution in these areas has a unique interface. Therefore, integrating them with the database is always a custom effort.
Whereas user-defined functions and table user-defined functions have good standards, there is no way for one database to call user-defined functions, or even stored procedures, written for some other database. It would be so much more effective if users could use a large library of UDFs developed for any database as a plug-and-play technology. Such a library would also provide the developers of these functions a large customer base where their functions could be deployed.
Conclusion
The deployment of big data tools is now being held back by the lack of standards in the areas I have listed. The development of these standards is in no way intended to thwart innovation or keep providers from providing unique solutions. On the contrary—look at the ANSI SQL standard: it has facilitated the dramatic growth of a large number of database solutions.
An important aspect of standards is ensuring compliance. The National Institute of Standards and Technology (NIST) in the U.S. Department of Commerce did that for SQL. Although SQL has admirably broad adoption as a standard, this does not guarantee smooth interoperability. First of all, SQL has evolved, so vendors can pick and choose which version of the standard to adhere to. Even there, how much of that version they adhere to is not clear without a certification. Second, numerous non-standard enhancements are offered.
Some of the areas identified may provide more value to users and providers. A prioritization determining which standard would provide the most value, based on input from the user and vendor community, could help guide efforts to develop standards.
These efforts could be facilitated via new standards bodies or with the cooperation and under the tutelage of existing standards bodies such as ANSI, ISO, TPC, and W3C. The committee members of these standards organizations have tremendous experience in developing excellent standards. They can skillfully navigate the bumpy road to achieve consensus across participants who otherwise compete. But it is up to the end users and providers to apply the pressure. Do we think we can start a movement to do so?