What Problem Is DataOps Trying To Solve?

2022-10-19 21:00:49
Peng Feng
Source
Translated 819
Summary : In the article below, we'll give a brief overview of DataOps and why it's important to every company interested in getting real value out of their data.

Image Source: Salestechseries

I started using Hadoop at my first job (Ask.com) in 2008 when the company had to shift to Hadoop because the expensive Oracle clusters couldn't handle the increasing analysis workload. Then at my second job as a data engineer at Twitter, I was on the front line and promoted how to use data to empower almost all of its products (instead of calling it "big data," I prefer to call it "data"). Since 2008, I've seen the power of data and how it affects the world. If you read the article about how Cambridge Analysis influenced the 2016 US election, you will have felt the significance of this change.


However, more than 10 years after the buzzword 'Big Data' emerged, it seems to work for only a few companies. In Silicon Valley, almost all unicorn companies use big data extensively to drive their success. In China, companies like BAT have mastered the art of big data, and super unicorn companies like ByteDance are largely based on big data technology, but there are still a lot of jokes about how hard it is to use. And the sad truth is that big data is either still a buzzword for most companies or, indeed, difficult to implement. Fortunately, a new discipline is emerging as the key to providing data capabilities for the average company: DataOps.


A name that is distinctly similar to DevOps, and a software development role similar to DevOps, is how data engineers want to simplify the use of data and truly enable data to be the engine of business success. Today, we'll give a brief overview of DataOps and why it's important to every company interested in getting real value out of their data.

What is DataOps

The definition of DataOps in Wikipedia is:

DataOps is a set of practices, processes, and technologies that combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement in the area of data analytics.

The DataOps page on Wikipedia, created in February 2017, provides a detailed description of this new discipline. The definition of DataOps is certain to evolve, but its key objective is to improve the quality and shorten the cycle time of data analysis.


In the 2018 Gartner Data Management Software Technology Maturity Curve, DataOps appears for the first time in the "Innovation Trigger" initial stage. In the 2021 curve, DataOps rapidly evolves to the "Peak of Inflated Expectation" edge. In the meantime, several startups have emerged in Silicon Valley that is building data products based on the DataOps concept and are backed by venture capital, such as FiveTran and AirBytes, which are based on integrated development, DBT, which is focused on SQL development management, and Astronomer, which is centered on scheduling.

Hype Cycle for data management 2021

DataOps can lower the barriers to data analysis, but it does not make it an easy task. Implementing a successful data project still requires a lot of work, such as a deep understanding of the relationship between data and business, good data usage practices, and cultivating a company's data-driven culture. However, DataOps is expected to significantly increase efficiency and lower the barrier to using data. Companies can start to use data faster, earlier, and better, and with lower costs and risks.


Most applications of Big Data can be categorized as AI (Artificial Intelligence) or BI (Business Intelligence). AI, in this context, refers to a broad range of artificial intelligence functions, including machine learning, data mining, and other techniques for deriving previously unknown knowledge from data. BI is more about using statistical methods to large aggregate amounts of data into simpler reports that are accessible and understandable. In short, AI uses various data algorithms to calculate new things, and BI is about counting numbers that people can understand.


Writing AI / BI programs is not difficult. You can set up a TensorFlow face recognition program in a few hours. Or use Matlab to produce some data or Excel for this purpose is not too difficult. The problem is that to use the production results to support user-facing products or to decide the fate of your company based on these magic numbers, you need to do more than work manually.

A survey by Dimensional Research (as pictured above) found that the following issues are the most difficult for companies wanting to implement Big Data applications:

  • Ensure data quality
  • Control costs
  • Meeting business needs and expectations
  • Quantify the value of Big Data projects
  • Find it hard to hire people with big data expertise
  • Fix performance and configuration issues
  • Choose the right data framework
  • Insufficient technical resources
  • Maintain operational reliability
  • Big data projects take longer than expected
  • Too many technologies or vendors to manage
  • Provide access to more consumer data
  • It isn't easy to create operational information
  • Complex problem-solving and debugging

In another study by Google Data Analysts, it was found that for most machine learning projects, only 5% of the time is spent writing ML code, with the other 95% setting up the infrastructure needed to run ML code.

In these two studies, it is easy to see that much of the hard work is not writing code. Preparing the entire infrastructure and efficiently running production-level code is very time-consuming and often comes with various risks.


In the Google study, they quote my former colleagues Jimmy Lin and Dmitry Ryaboy (from the Twitter Analytics team): Much of our work can be described as "data plumbers." DataOps makes the plumber's job easier and more efficient.

DataOps Target Functions

DataOps aims to reduce the overall analysis cycle time. Therefore, from building the infrastructure to using the results of a DataOps application, it is often necessary to implement the following functions:

  • Deployment: This includes both infrastructure and applications. Configuring a new system environment should be quick and easy regardless of the underlying hardware infrastructure. Deploying new applications should take seconds rather than hours or days. 
  • Maintenance: scalability, availability, monitoring, recovery, and reliability of systems and applications. Users cannot worry about maintenance and can focus on business logic.
  • Governance: Data security, quality, and integrity, including auditing and access control. All data is managed in a coherent and controlled manner in a secure environment that supports multi-tenancy.
  • Availability:Users should be able to select the tools they want to use for their data, easily run them, and develop applications as needed. Support for different analysis / ML / AI frameworks should be integrated into the system.
  • Production: Analysis programs can be easily converted into production applications through scheduling and data monitoring, a production-level data pipeline from data extraction to data analysis can be built, and the use of the data should be easy and managed by the system.

In short, it is similar to the DevOps approach: the path from writing code to production deployment, including scheduling and monitoring, should be done by the same person and follow the systems management standards. Similar to DevOps, which provides many standard CI, deployment, and monitoring tools for rapid delivery, by standardizing a large number of Big Data components, novices can quickly build production-grade Big Data platforms and leverage the value of their data.

DataOps Methodology

The main methodology of DataOps is still in a rapid development phase. Companies such as Facebook and Twitter often have a dedicated Data Platform Team that handles data operations and implements data projects. However, their implementations are mostly integrated with the company's existing Ops infrastructure and are therefore not usually applicable to others. We can learn from their success and build a common big data platform that every company can easily implement.


To build the common platform needed for DataOps, we think the following technologies are needed:

  • Cloud architecture: We must use a cloud-based infrastructure to support resource management, scalability, and operational efficiency.
  • Containers: Containers are critical in DevOps implementations, and their role in isolating resources and providing a consistent dev/test/ops environment remains critical to the implementation of data platforms.
  • Real-time and stream processing: Real-time and stream processing are becoming increasingly important in data-driven platforms, and they should be first-class citizens of modern data platforms.
  • Multiple analysis engines: MapReduce is the traditional distributed processing framework, but frameworks such as Spark and TensorFlow are becoming more widely used daily and should be integrated.
  • Integrated application and data management: application and data management, including lifecycle management, scheduling, monitoring, and logging support, is critical for production data platforms.
  • Multi-tenancy and security: Data security is almost the most important issue in a data project: if data cannot be protected, it cannot be used. The platform should provide a secure environment so everyone can use the data and authorize, validate and audit each operation.
  • Dev and Ops tools: The platform should provide effective tools for data scientists to analyze data and generate analysis programs, tools for data engineers for big data pipelines, and a way for others to consume data and results.

Cloud-Native DataOps Application Scenarios

Image Source: Informatica

In cloud-native scenarios, the core requirement of enterprises for big data systems is the need to quickly and efficiently implement diverse and heterogeneous data applications in a unified environment, respond agilely to support business needs, and manage the entire data application lifecycle.


Customers of small to medium size companies want to use DPaaS (Data Platform as a Service) directly in the public cloud, available out of the box, without operations and maintenance, on a fee-for-service basis. At the same time, many data applications are available for reference and direct use, and the resulting data applications can support enterprise production decisions. If private distribution is required, companies can achieve rapid migration.


For medium and large enterprises, building a cloud-native Big Data platform on a public/private cloud will reduce the complexity and cost of operations and maintenance through standardized components, speed up the development of data applications through DataOps tools, and improve the efficiency of resource usage through resource alignment and finer resource scheduling.


For large conglomerates, building a private Data Platform as a Service on a private/hybrid cloud allows business units to share data platform capabilities on a multi-tenant basis, avoiding duplication. At the same time, it unifies data development processes and standards, avoids data silos, improves data sharing capabilities, facilitates application isolation between internal departments and resource accounting, and improves data ROI.


The typical data application scenarios that can be built by using cloud-native DataOps include:

  • Data integration and interactive queries.
  • Real-time big-screen displays.
  • Data-driven applications.
  • Data API services.
  • Machine learning models.
  • BI reporting.

The cloud-native DataOps back office support and management system should cover global data users, data quality inspection and management, data application scheduling, and multi-tenant resource accounting.

Conclusion

Current big data technologies are powerful but still too difficult for ordinary people to use. Deploying a data platform that is suitable for a production environment remains a daunting task. For companies that have started the process, their data platform teams are still doing similar things most of the time.


Some companies are already aware of the issues and are starting to take a different approach to the problem. More and more enterprises are using container-based solutions, and traditional Hadoop-centric platforms are gradually migrating to cloud-native systems.


But the easier way for business users to practice cloud-native DataOps is to find the right tool to help them practice the DataOps methodology.


--


About Author :


Peng Feng, co-founder & CEO of SmartCloud, has over 20 years of experience in software development, big data, and cloud computing. He was the big data architect and technical leader of Twitter, the engineering director of ask.com, and an angel investor in Silicon Valley. He graduated with a Ph.D. in Computer Science from the University of Maryland, the USA, and a BSc and MSc in Computer Science from Wuhan University.

Write a Comment
Comment will be posted after it is reviewed.