Chapter 1. Data Analytics in the Modern Enterprise
When you were growing up, did you ever wonder why you had to study history in school? I did. But it wasn’t until I was much older that I realized history’s importance. Understanding how things were in the past and where they are now allows us to build on them and anticipate what is to come next—and, in some cases, to also influence what is to come next. As I’m writing this, there is a lot of buzz about AI and machine learning (ML). But how did we go from simple calculations to chatbots conversing coherently with humans?
From statistical extrapolation of sales figures to planning stocks based on predicted sales, enterprises have come a long way. This journey from data producing to data consuming and finally data driven is what I explore in this chapter. First, I will delve into the details of how data analytics has evolved over time, what predictive analytics is at its roots, and what its place is in the analytical world. Then, I will discuss its role in data modeling and ML and the tools available today that can help you with data modeling and ML.
The Evolution of Data Analytics
If you still like shopping in person, you know that shopping is as much about the experience as it is about the actual purchase. Experienced sales professionals understand this. What separates the good salespeople from the great ones is their ability to effectively read customers, ask them questions, and understand and direct them toward a successful sale.
The moment you enter a shop you are eyed by a salesperson on the floor. Once they have established that you are not just window-shopping, they approach you politely and ask what you’re looking for. The good ones will direct you to that section, but the great ones will ask you a second and maybe a third question. Let’s imagine such a conversation:
You (Y): “I am looking for a formal shirt.”
Great Salesperson (GS): “Awesome, let me help you with that. May I ask what occasion you are shopping for in order to help you better?”
Y: “Oh, it’s the annual sales kickoff at our office, and I want to look sharp.”
GS: “Thanks for sharing that. Since it’s for a formal gathering, might I recommend blues and pinks or something in classic white?”
Y: “Classic white is more my jam.”
GS: “Excellent choice. Let me show you our collection of formal white shirts.”
Y: “Great!”
GS: “Since this is for a formal meeting, would you be open to looking at some formal trousers? I have some really great ones in khaki that will complement the shirt you’ve chosen.”
Y: “I wasn’t planning on it, but let’s have a look.”
GS: “Might I suggest the slim-fit ones? Perhaps you should try the combination first, before you make up your mind.”
Y: (After putting on the trousers and shirt) “I like it. It looks good.”
GS: “I couldn’t agree more.”
Y: “Great, that will be all then.”
GS: “Sure, let me help you with that. Before you change back into your clothes, can I just ask you for one final indulgence? There’s a new collection of lightweight formal shoes that I feel will complete the look you’re going for. I can quickly bring you a pair in your size, and you can try them on and see how you feel about them.”
Y: “I’m really not looking for shoes, but I can spend another minute, I guess.”
GS: “Awesome, how about this beige pair with a touch of maroon…”
Most of us are all too familiar with going into a clothes store to buy one thing and ending up buying three. The same concept applies whether you are buying a car, hunting for furniture, or shopping for groceries. While the order might change, the fundamentals remain the same: read the customer, ask questions, understand the requirement, and direct the sale.
Now let’s step into the world of ecommerce. Think Amazon, eBay, or your favorite local ecommerce store. Open the app or the website and you have stepped into their shop. The platform is your very “smart” salesperson. In most cases, it already knows your name, your age, where you live, and your preferences, because of your purchase history. Based on all this, the platform personalizes your experience. This could involve a multitude of things, from color themes to targeted product placement to discount codes for your favorite brands. The idea is to show you more of what you are likely to buy. Now imagine a physical store transforming to your preferences as you enter it. This level of personalization in the sales process is unprecedented in the brick-and-mortar world of shopping. As you search, filter, and browse the mobile app, your actions are the equivalent of your body language in person. What categories are you browsing? How much time did you spend on a particular item? Are there any items that you have added to your cart? These and many other metrics are constantly being evaluated to provide you with a unique shopping experience, make the next sale, increase the size of your shopping cart, and compel you to come back later for more.
The next time you are shopping online, notice the “Frequently bought together,” “Customers who viewed this item also viewed this,” and “What other items do customers buy” sections. These are all examples of personalization of content based on your activity history, the behavior of users similar to you, and demographics, among other things. The platform understands your needs, interacts with you, and directs you. At this point, you are probably wondering one of two things: “What does this have to do with predictive analytics?” or “Why don’t we stick to the great salesperson instead of these heartless platforms?” The answer to the first question comes much later in the book, but I will try to answer the second question right now.
Imagine a physical store that has virtually unlimited capacity in terms of the products it can sell, somewhat like a magic store that can keep growing (Figure 1-1). Now imagine you have a genie that can sift through all these products to get you what you want in a matter of seconds. The genie is your friend and knows your likes and dislikes. The genie is also friends with other genies and therefore knows what people similar to you are buying these days. The really good genies can find you the best price for what you are looking for. Shopping at this magic store gives you magic coins that you can use to buy other things. When you’re finished shopping you don’t need to wait in a queue at the checkout counter. You just say the word and the items are paid for. Last but not least, you don’t need to carry those shopping bags home. The genie will get them to your home in a jiffy.
Wouldn’t that be great? Now stop imagining, because what I just described is your everyday ecommerce platform. Add to that same-day delivery, easy returns, and being able to shop from the comfort of your bedroom 24/7, and you’ll never want to go back to shopping at a mall again.
But did all this happen overnight? Of course not. As early as the late 19th century, governments started collecting census data. They wanted to know more about the population so that they could plan for taxation, food distribution, and army recruitment, among other things. Fast-forward a few decades and the same data was being used to figure out where to build the next road or how to optimize power distribution. Governments realized that subsequent collections of data could be compared with previous collections to understand population growth and help with planning.
Now consider that industrial workers have been clocking in and out of work long before computers were invented. While the primary reason for this form of data collection was to calculate workers’ daily wages, business owners later realized they could use this data for further analysis—comparing, for example, the production efficiency of two different factory locations or figuring out optimal workforce timing to maximize production. At the time, the main sources of data were the workforce, inventory, and revenue (Figure 1-2).
With the advent of computers, the way in which data was collected, stored, and processed was profoundly impacted. While the first computers were mainly focused on computation, soon they were being used to not only process information, but store it as well. Some years later, information storage moved from flat files to databases. While there were a few initial attempts to build databases, a clear breakthrough occurred in the early 1970s in the form of relational databases.1 Relational databases allowed users to store data in the form of entities and relationships, while optimizing for storage. Most of them initially supported Structured Query Language (SQL). This standard language allowed users to easily explore and manipulate (relatively) large amounts of data stored in relational databases. Common data sources soon expanded to include employee, operational, and financial data in these databases. Organizations could extract monthly revenue reports or find the total sales of a certain product during a particular season.
As the internet became mainstream, computers became connected to each other and users began interacting with their own computers as well as with other computers on the World Wide Web. The dot-com boom saw the advent of search engines like Yahoo! and Google that were collecting large amounts of HTTP information and data on user click rates. Searching everything on the web was not the same as querying a structured database. Hence, companies came up with a different breed of databases. Developers needed a better way to handle data that was neither as well structured as an enterprise application nor centralized in a single location or enterprise. This family of databases came to be known as noSQL.
The need to search eventually gave birth to search engines. Governments wanted to store and process census data with genealogical references, and that led to graph databases. Large amounts of logging data for click streams led to time series databases. These databases were as different from each other as they were from relational databases. However, they were meant for their niche use cases and were designed for storing and processing large amounts of data.
Around this time, organizations realized they needed to collect all the data in the different databases and bring it together in the form of a centralized decision-making system. This was the origin of what we know as data warehouses today. Simply put, data warehouses brought data in from multiple sources, transformed it, and made it available to business users for analysis and decision making. As the data grew, so did the infrastructure requirements for data warehouses, to the point that it was no longer economically feasible to invest in these mammoths. This gave rise to horizontally scaling distributed storage and processing frameworks that allowed data collection and processing at unprecedented scales.
Somewhere between relational databases and distributed frameworks, organizations started utilizing data to drive businesses. Business intelligence (BI) practices evolved, and several tools were created to allow businesses to drive actionable insights from their data. Furthermore, organizations started to analyze large datasets to identify different patterns so that they could predict customer and business needs. While the internet opened up the world of data, processing these large amounts of data required significant investment in terms of compute and storage infrastructure. This meant that only very large organizations and governments could really tap into the true power of data. Public cloud providers such as Amazon Web Services (AWS), Google Cloud Platform (GCP), Azure, and Alibaba Cloud removed the entry barriers to collecting, transforming, and analyzing large sets of data. Infrastructure investment moved from a capex (capital expenditure) model to an opex (operational expenditure) model, and we saw further optimizations in compute and storage subsystems.2,3 New data sources today include social media platforms, mobile user data, and Internet of Things (IoT) data, among others. Since infrastructure is commoditized, any organization (big or small) is able to process large amounts of information to make data-driven decisions.
Ever-improving connectivity and democratization of infrastructure have allowed for tremendous advancements in the fields of AI and ML in the past decade. AI/ML is now being used to map the stars in the universe, build robots, power self-driving vehicles, predict fraud in financial transactions, and build interactive chatbots for companies, among many other applications. While there was a clear distinction between operational and analytical workloads, these lines are now blurring. Data-driven decision making dictates that data analysis is no longer a sidecar running batch reports. It is an integral part of the daily operations of any modern enterprise. The journey from punch cards to AI has been long and arduous and is a culmination of the continued work and dedication of some of the most brilliant minds in history.
Different Types of Data Analytics
You can think of data analytics as the (art and) science of talking to your business. It helps us understand our business better by giving us insights into what happened in the past, what might be happening right now, and why certain things might have happened. Armed with this information, analytics can also help us (within scientific probability) gauge what is likely to happen and figure out what to do next. Organizations employ several data analytics practices to collect, process, analyze, and store their data (Figure 1-3). Depending on the literature that you are referring to, these might vary; however, the ones that appear in most places are the following:
-
The past (and present):
-
Descriptive analytics helps us understand what happened.
-
Diagnostic analytics helps us identify why something might have happened.
-
-
The future:
-
Predictive analytics can help us determine what might happen next.
-
Prescriptive analytics is about what we can/should do next.4
Note
My frequent use of terms such as might and likely is deliberate. It does not mean we are talking about something scientifically inaccurate. It’s to drive home the point that most of what we are discussing is based on scientific probabilities. Therefore, it’s important not only to understand the data and its conclusions but also to be able to figure out the level of accuracy of our deductions.
-
Descriptive Analytics
I’m baffled by the amount of time my doctors spend browsing and typing on their computers versus talking to me. And what really grinds my gears is that I can’t see what they are doing.
Let’s analyze a typical medical appointment. The nurse (or assistant) asks for your name and identification. This gives them access to your registration details and medical history. Then, they ask you to accompany them to an exam room where they ask you the reason for your visit. They check your weight, height, blood pressure, and any known allergies, and then you patiently wait to see the doctor. Once the doctor arrives, they usually ask you about your symptoms, when they started, their frequency, and their intensity. You can think of this entire phase as historical and current data collection.
After this, there is a bit (or a lot) of computer time involved (depending on the doctor), at the end of which the doctor either gives you a potential diagnosis or recommends further tests. Medical professionals have access to medical data that is classified by age, demographics, gender, and predisposition to other medical conditions, among other things. For example, they can check your height and weight and plot you on a body mass index (BMI) chart to see whether you are in the healthy range. They can check your blood pressure and see whether you are at risk for hypertension. They can combine these results to see whether you are at risk for heart disease. You can think of this as trend analytics that allows them to map your health.
Note
Recording and understanding your blood pressure and other health metrics is part of descriptive analytics. However, mapping these metrics to make predictive conclusions about your health goes beyond the realm of such analysis.
Another great example of data analytics and trend analysis can be found with streaming platforms such as Netflix, Amazon Prime, and OSN+. While the details of implementation and the accuracy of results might vary, most of these platforms collect large amounts of user activity data to understand the content users are watching in various demographics and during a certain time period. This data is then analyzed to make recommendations to users in various geographies and age groups on currently trending content. The data can be further processed to understand what kinds of content, genres, and actors, among other things, are being favored by audiences at this time. This can help with decision making with regard to continuation of existing contracts, future productions, and marketing campaigns.
Descriptive analytics involves analyzing historical and current data to understand the trends around what has happened. Various techniques are used to understand these trends. You can aggregate or roll up the data into averages of large datasets. You can drill into details of the data, take a slice of the data (e.g., Netflix usage) to understand the most-watched series in a particular month, or further dice the data by creating smaller datasets from larger ones. As we have already seen, this will not only allow trend analysis for better decision making, but also help identify outliers for detecting anomalies. We saw a more personalized example of this in the medical use case. However, heavy use of prescriptive analytics can also be found in other verticals, such as automated fraud detection in the financial industry.
Diagnostic Analytics
Once we understand what happened (or is happening), the next logical step is to figure out why it happened. Diagnostic analytics answers this question. The analysis can be in the form of a root cause analysis, or it can involve analyzing trends to understand the relationships between different but seemingly related occurrences. Diagnostic analytics can be used to draw better conclusions about why something happened.
It is important to differentiate causality from correlation. Two (or more) items can be correlated if a relationship can be established in the way they change. In statistical terms, two variables are positively correlated if they both move together in the same direction. This means that if one increases, the other increases with it. Two variables are negatively correlated if they move in opposite directions. However, this does not mean that one is causing the change to the other. For example, the rainy season is often called gloomy weather. That means some people feel sad or depressed when it rains. We can conclude that there is a correlation between rainy weather and individuals feeling down. But the conclusion that rain causes depression would be incorrect. Scientifically speaking, exposure to sunlight releases a hormone called serotonin in the brain. Serotonin is known to be associated with boosting mood and helping with focus. Hence, a more logical conclusion would be that the lack of sunlight results in reduced energy levels among the masses. Diagnostic analytics aims to understand the relationships between variables and trends to establish correlation and causation.
Diagnostic analytics can be performed manually, as in the preceding example where we analyzed the relationship between rain and people’s mood; by using regression (e.g., determining the relationship between marketing spend and the total sales of an item in retail to identify important ranges and outliers); or by using data analysis tools such as Microsoft Excel. Going back to the medical use case, if a doctor has established that a patient has hypertension, they would then strive to understand why the patient has hypertension. When doing so, they have access to tools such as common questions to ask, common causes of hypertension, and complications related to hypertension. This analysis helps the doctor get to the root cause of the problem for which hypertension is the symptom. While medication can help a patient keep their blood pressure in check, every doctor will tell you that the proper way to do that is for the patient to make lifestyle changes that address the root cause of their high blood pressure.
Looking back at the Netflix example, once Netflix has figured out what titles are the most watched in a month, it would want to understand why this is the case. For example, it might realize that in the summer months the titles from the animation genre are the most watched. A root cause analysis might uncover that this is because most school-age children are at home in the summer. Over time, such an analysis can be automated and fed into Netflix’s decision-making process with regard to planning and launching new titles and continuation (or discontinuation) of older ones.
Predictive Analytics
Predictive analytics is a branch of data analysis that deals with using historical and current data to predict future events. We have not yet created machines that can predict the future. However, we are inching toward a distant future that just might allow us to do that. Coming back to reality, the predictions could be a simple binary; that is, the likelihood of something happening or not. Or they could be more complex, such as predicting a selling price of fuel that optimizes profits based on location and weather conditions. Predictive analysis can be done the old-school way, by analyzing data manually; or it can be done using predictive data modeling and ML. It is one of the most critical data analytical practices employed by organizations to collect, process, and analyze their data. The idea is to use data to understand the business better and produce actionable insights that will lead to prescriptive analytics.
Tip
As we talk about the various forms of analytics, it is important to understand that organizations are at different stages in their analytics journey. These broad categories mostly feed into each other and will usually have some form of overlap. When working with deep technology, it often helps to go back to the business problem you are trying to solve and then to figure out the best way to achieve the solution. Just as you would not use a tank to kill a fly, it’s not always better to use the most complex analytics solution.
Now let’s discuss the details of the business problem that predictive analytics helps solve, the different technologies that data analysts and organizations can leverage, and how to set up predictive analytics for various use cases. Continuing with our medical use case, if the doctor learns that their patient has high blood pressure caused by a high-stress lifestyle, the doctor will more often than not investigate other factors contributing to the patient’s high blood pressure. For example, a patient who is overweight and has high levels of cholesterol along with hypertension is at a high risk of developing heart disease. This prediction can help the doctor treat the patient’s hypertension as well as develop a holistic approach to improving the patient’s overall health.
Businesses like Netflix take this one step further. By understanding customer behavior on the platform, Netflix can predict what the customer is likely to watch next. Content personalization is one of the major use cases for predictive analytics that we’ll dig much deeper into in later chapters. Simply put, companies such as Netflix may analyze what kind of content the user watches, how much time they spend on the platform and on different categories, and their age group to determine, using predictive modeling, what kind of content the user is likely to watch. If you were to observe the content on different Netflix profiles within the same household, you would realize that the recommended content varies among the profiles. This is because of deep content personalization, which caters to users’ individual needs and preferences. If Netflix were only using descriptive analytics for personalization, we would see very similar content across users within the same geography.
Prescriptive Analytics
Prescriptive analytics is the last leg in data-driven decision making. It helps businesses determine what is the optimal next step. To answer the question of what to do next, businesses evaluate multiple variables to chalk out the best next course of action. Like other forms of analytics, prescriptive analytics can be manual, via analytical tools; or it can be automated, using algorithms.
When we address a problem, we are working with a certain set of limitations and constraints. A problem might have more than one feasible solution. If these solutions are limited in number, it is easy to go through them one by one to figure out which one produced the best outcome within the given constraints. There are, however, some problems that are unsolvable, and for those, one needs to relax the limitations and constraints to solve them. Of the solvable problems, there are those that are optimally solvable, meaning we can say with mathematical certainty that a particular solution is the best. Optimization techniques that help achieve this include linear, nonlinear, and stochastic mathematical programming.
But when a method can provide a solution but we are unable to confirm that it is the best feasible solution, the problem is understood to be heuristically solvable. This means we end up with better solutions, but we cannot confirm whether one of them is the best solution. Heuristic problem-solving can include simulation modeling.5 Simulations can be thought of as computer representations of real-world scenarios. While the simulation is a representation of reality, it is nearly impossible to capture all the nuances and variables of the real world. However, it is still very effective at estimating outcomes based on current conditions or potential future actions. For example, a model can simulate the spread of a disease across a certain geography with reasonable accuracy so that preventive or corrective actions can be taken in advance.
Going back to our patient who is likely to develop heart disease, the doctor would probably conduct a few more tests and interviews to determine the best course of action. Note that the course of action is dependent not only on our prediction of the patient developing heart disease, but also on the limitations and constraints of the patient and the doctor. These limitations can be financial, such as the kind of insurance the patient has; or they can be medical, such as other conditions that might prevent the patient from taking certain medications. Based on this, the doctor puts together a treatment plan that is optimized for the patient.
You may ask what Netflix (or other companies) could still do that it hasn’t already done. It figured out what a user is most likely to watch, so why not just propose that content? Well, if the business problem it was trying to solve was to ensure that it showed content the viewer is likely to watch, and if the accuracy of its prediction was high enough, this would be one feasible solution. However, there are many other questions Netflix needs to answer. For example: “If we push the exact content that we predicted, will it maximize the time the user spends on the platform?” “If we maximize this time, will it translate into increased revenue?” “Are we trying to maximize revenue?” “Are we trying to increase the user base?” “What content needs to be pushed to optimize sponsors?” And so on. Based on this, the platform can attempt to find the most optimized content that should be pushed to the user.
Knowledge Acquisition, Machine Learning, and the Role of Predictive Analytics
When driving, your natural reaction when you see a pedestrian is to slow down. The decisions of whether to stop or continue driving and what is a safe distance to be from the pedestrian depend on a number of factors. Is it a child, adult, or senior citizen? Are they looking down at the pavement or looking at the road, waiting to cross? What does their body language tell you? Is there a pedestrian crosswalk in sight? These are just some of the many questions the mind considers and for which it computes the answers to predict the behavior of the pedestrian and decide accordingly.
This process is based on knowledge. In this case, such knowledge is mostly acquired during your initial training from a driving instructor, academic training, and practice. But does your learning stop once you acquire a driver’s license? Probably not. As you continue to drive, you keep learning and improving your skills by exploring different terrains, driving at different times of the day, experiencing different traffic conditions, and so on; hence the term experienced driver. This seemingly trivial activity helps us understand some of the types of knowledge we acquire:
-
Knowledge we are born with (instincts and behaviors)
-
Knowledge we gain from institutions, books, conversations, and the internet (learning from the experiences of others)
-
Knowledge we acquire from our own experiences
-
And more
Our dependence on machines is ever increasing. From simple, everyday tasks such as buying groceries to complex tasks such as driving from point A to point B, we see the use of machines everywhere. There are smartphones, smart vacuums, smart cars, and even smart refrigerators. We are surrounded by them. If a machine (and by “a machine” I mean anything programmable) needs to survive, it is no longer enough for it to just keep doing what it does best. There is an inherent need for the machine to acquire knowledge.
Machine learning is the human attempt to imitate the process of learning and improving in machines. In machine learning, algorithms are trained to make predictions. You can think of these predictions as being as simple as a true-or-false classification such as whether an email message is or isn’t spam, or as complex as a guess about the effective lifespan of a piece of industrial equipment. These predictions form the basis of actionable insights that eventually drive business decisions. But how do machines actually learn? Consider the following:
- Decision process
This takes in the input values, applies a bunch of recipes, and then attempts to make predictions using patterns within the data.
- Error function
This assesses the prediction for its accuracy. When known examples are available (mostly in the form of historical data), the error function can help evaluate whether the prediction was correct, and if it wasn’t, how far off the prediction was from the actual values.
- Optimization
This takes the known examples and the predicted values and tweaks the model (we will talk about what this tweaking means later) to minimize the divergence between them. This process can be automated and continuous to allow the model to achieve a certain level of accuracy.6
What I just described is known as supervised learning. Supervised learning uses labeled datasets (often known as training data) to train algorithms. Think of these labels as useful tags about the training data. For example, a large sample of data about tumor classification can be used to train an algorithm to predict whether a tumor is malignant or benign. The decision process makes predictions, the error function uses the labels to figure out the error of its ways, and then the optimization function adjusts the algorithm to minimize the difference between the predictions and the labels.
As you might have guessed, the other type of learning in machines is called unsupervised learning. It is used when we do not have access to labeled data. An algorithm helps identify relationships and patterns in the data without any external help (such as historical data or user input). But what happens if you have only a small amount of labeled data? In such cases, an approach known as semi-supervised learning is used. This approach uses a smaller labeled dataset with a much larger set of unlabeled data during training. That combination allows the algorithm to learn to label the unlabeled data and then produce predictions. Then there is reinforcement machine learning, where the model is not trained using labeled data. Rather, the algorithm learns similar to how a child learns new things: using trial and error, and mostly using rewards to reinforce successful outcomes and punishment to minimize failures. This feedback to the algorithm allows it to learn from its own experiences.7
The democratization of machine learning in the past couple of decades, for business analysis and for decision making, has been the driving force behind industry disruptors and allowed companies to achieve unprecedented growth in record time. Circling back to where we started, the argument boils down to how programmable machines acquire and consume data. Drawing a parallel to what we discussed earlier:
-
Machines can be programmed with a set of models to operate with.
-
They can learn from historical and current data to constantly optimize their operational models (training and machine learning).
-
They can learn from experience.
Where the machines have a leg up on us is in their ability to efficiently collect, analyze, and consume very large amounts of data in a very short span of time. Add to this autonomous machine-to-machine communication, and we effectively have a hive where each individual entity is constantly learning and improving from the past and present experiences of others, thus forming the foundation of ever-improving data-driven decision making powered by predictive analytics.
To make data-driven decisions, enterprises need to achieve the following:
-
Collect and store large amounts of data from various data sources
-
Preprocess this data to:
-
Ensure that the data is collected (or adjusted) for the business context of the problem they are trying to solve
-
Clean, classify, and label the data (when applicable)
-
-
Use the data to understand the history and current state of the business:
-
What happened/is happening
-
Why it happened/is happening
-
-
Use the data to predict the future of the business
-
Use these predictions to make data-driven decisions
And they need to do all of that in a loop.
Predictive analytics sits at the heart of this process, driving how a business continuously acquires knowledge through data, training, and experience. It helps convert raw data into knowledge that can then be used to make predictions about the future of the business. So, what are some questions that companies might try to answer?
-
Business question: A hospitality chain would like to understand how much staff it should deploy in each of its locations based on the season and the weather.
-
Needed prediction: How many guests can the chain expect in each of its locations?
-
-
Business question: A logistics company would like to understand how much stock of each product it should maintain for the coming season.
-
Needed prediction: What are the expected sales for each product for the coming season?
-
-
Business question: An electricity provider would like to determine how much energy should be generated for the upcoming cycle.
-
Needed prediction: How much power will consumers utilize for the given block in the upcoming cycle?
-
Without accurate predictions, businesses risk miscalculating these key matrices. If the logistics company were to overstock products, it would run into unwarranted space and labor expenses, not to mention unsold items expiring or becoming obsolete. Likewise, if it were to underestimate product demand, it would result in loss of potential sales and decreased customer satisfaction. Similarly, overproduction of energy is wasteful and overallocation of staff is a waste of human resources; both translate into an unnecessary financial burden on the business. And not being able to produce enough energy or not having enough staff would directly impact the customer experience. Accurate predictions of these metrics are undoubtedly an operational necessity and a key component of data-driven decision making.
Tools, Frameworks, and Platforms in the Predictive Analytics World
Enterprises have leveraged analytics for a long time, and that is not about to change anytime soon. If anything, analytics is becoming more mainstream day by day. We live in a world where competition is fierce and time is a very expensive commodity. Hence, enterprises are continuously looking for new ways to reduce their time to market for new services and are in a continuous cycle of improving customer experience to stay ahead of the competition. Analytics is one of the tools they employ to better understand their business and their customers and to make relevant changes more quickly.
Enterprises understand that competitive advantage cannot be bought but needs to be developed. One of the most common open source technologies out there for building big data analytics systems is Hadoop. Hadoop, at its core, is a distributed storage system that allows users to store very large datasets across a distributed architecture. Usually coupled with Hadoop is MapReduce, which helps process these large datasets across this distributed architecture.
Though based on a whitepaper that came out of Google, Hadoop is an open source project often known as Apache Hadoop, released around the 2005–2006 time frame. Since then, this ecosystem has evolved both for the better and arguably for the worse. Real-time processing requirements are now supported by the Spark and Flink projects. Additionally, abstraction layers have evolved on top of MapReduce, such as Pig (for SQL), and nonrelational database functionality is supported by the likes of HBase running on top of Hadoop. Organizations tend to build these platforms either on premises or in the cloud in a self-managed fashion. There are, of course, third-party vendors that have developed expertise and tools around Hadoop, and they offer their own versions and add-ons. There are also fully managed versions from cloud providers and third-party vendors that are ready to consume and are offered in a software as a service (SaaS) model.
But this is not a book about Hadoop or general analytics, so this is the extent to which we will talk about these subjects. (If you want to learn more about the Hadoop ecosystem, SimpliLearn Solutions offers a great learning resource.) Instead, we will delve deeper into predictive analytics and the languages, libraries, and services needed to develop predictive functionality within organizations.
Languages and Libraries
We’ll start with Python. Python is an object-oriented programming language that has taken the world by storm. Its object-oriented nature makes code easy to traverse and understand, and its dynamic variable binding capability means your variables need not be declared but are automatically assigned datatypes based on the value assigned to them, a process known as dynamic binding, as shown in the following code:
X
=
10
;
(
"Value of X is:"
,
end
=
" "
);
(
X
);
(
"Type of X is:"
,
end
=
" "
);
(
type
(
X
));
#The variable type binding automatically becomes integer
(
"X + 10 = "
,
end
=
" "
);
(
X
+
10
);
#You can do integer addition on X
(
""
);
X
=
"ABC"
;
#X is now assigned the value "ABC"
(
"Value of X is:"
,
end
=
" "
);
(
X
);
(
"Type of X is:"
,
end
=
" "
);
(
type
(
X
));
#The variable type binding automatically becomes string
(
"X + 10 = "
,
end
=
" "
);
(
X
+
str
(
10
));
#Addition results in string concatenation
Value
of
X
is
:
10
Type
of
X
is
:
<
class
'
int
'>
X
+
10
=
20
Value
of
X
is
:
ABC
Type
of
X
is
:
<
class
'
str
'>
X
+
10
=
ABC10
These attributes make Python ideal for rapid application development, thereby optimizing developer productivity and reducing time to market for new functionality. Python allows you to develop packages and modules so that code can be modular and reusable. Python also boasts more than 100,000 libraries that allow users to do all forms of data processing and data visualization.8 One very famous library that will be a major part of our discussions is TensorFlow.
TensorFlow provides an open source library for performing detailed numeric computations. It was developed by Google and was released to the general public in 2015. On top of the library, TensorFlow provides the tools and community support that allow development of applications that can leverage machine learning. Google built TensorFlow in C++, but TensorFlow provides libraries in C++ as well as Python and JavaScript for developing applications.9 In this book, you will learn how to leverage TensorFlow libraries in Python to utilize, train, and test ML models. We will work with prediction functions that can push out results to operational applications. We’ll also look at further advancements with TensorFlow 2.0 and the model-training API in Keras.
Services
While TensorFlow with Python gives us a strong foundation in machine learning and its use for predictive analytics, it’s worthwhile to also explore AI/ML services from major public cloud providers. For the purposes of this book, we will be working with the services from AWS and GCP.
AWS services
We will start with AWS SageMaker. SageMaker is a fully managed service from AWS geared toward machine learning. It caters to the entire ML life cycle, as it allows you to develop and train ML models and then deploy them for production use. SageMaker supports the Jupyter Notebook platform for easy data exploration and analysis. You can quickly get started with commonly used algorithms or bring your own algorithms for further customization. This service is for those who like to get their hands dirty with ML and want to customize their development according to the needs of their application.
At the other end of the spectrum are users who want to stand on the shoulders of giants (the likes of Amazon). To that end, Amazon Personalize is a fully managed service that allows developers to build applications that focus on personalizing the user experience. This can be via product recommendations and ranking or via customized marketing specific to user activity and preferences. The platform does not require a lot of ML knowledge, as it does most of the legwork for you. Everything from deploying the needed infrastructure to processing data, selecting the optimal ML models, training, and then eventually deploying the models is handled by the platform. All developers need to interact with is the API and its integration with their business applications. While Amazon is a retail behemoth, the service can be used for other industry verticals such as streaming and online entertainment services.
A similar service that focuses on a different aspect of the business is Amazon Forecast. Amazon Forecast is a fully managed service that allows businesses to use their historical time series data to help them develop their demand forecast. Think about being able to understand the demand for various products and services in advance and planning inventory and resource allocation accordingly. Similar to Amazon Personalize, Forecast requires little to no knowledge of machine learning and is able to do all the heavy lifting so that businesses need only worry about feeding it historical data and then consuming API-based demand forecasts. In addition to historical data, Forecast can integrate with local weather data to further fine-tune the accuracy of the predictions. While it is abstracting the complexity away from the developers, it gives them visibility into monitoring matrices so that they can continuously gauge the accuracy of forecasted predictions.
Amazon Lookout for Equipment caters to the realm of predictive maintenance, one of the important use cases of predictive analytics. It follows the same theme as the Personalize and Forecast services, taking the complexity of identifying, building, and training ML models for predictive maintenance away from the users. It uses industrial sensor data as input for training, as well as real-time analytics to identify potential faults and failures in the future. This allows industries to perform predictive maintenance and avoid huge costs incurred due to unforeseen downtime or industrial accidents.
When we work with data in AWS there are a number of underlying services that help glue everything together. Many of these services are beyond the scope of this book. The three that we will cover here are AWS Lambda, AWS Kinesis, and AWS S3.
AWS Lambda allows you to write serverless code for applications. In our case, we will mostly be writing code for data processing. AWS Kinesis allows ingestion, processing, and analysis of streaming data in real time. Think about time series IoT data flowing from industrial equipment or heartbeat data from mobile applications. AWS S3 is the object storage service provided by AWS. S3 will be used across data pipelines from ingest to classification and also to store analyzed results. It is a globally distributed and highly available service that offers an economical solution for storage and processing of large datasets.
GCP services
As analytics becomes more mainstream, it is imperative for organizations to have a multicloud strategy. Like their operational applications, analytics platforms need to avoid vendor lock-in. Hence, we will explore the services from GCP that can assist us in our predictive analytics journey. While there seems to be a bit of overlap among the GCP services, we will explore how they augment each other to help enterprises build effective predictive analytics platforms.
Vertex AI is the first of the services that we will explore. It is a fully managed ML platform that allows for building, training, and deploying ML models easily. Similar to what we saw with the Amazon services, the Vertex AI platform requires little to no ML knowledge. It acts as a central repository for all your ML models while providing strong integration with vision, video, and natural language processing (NLP) APIs for cognitive analytics integration. It also provides integration with open source AI/ML platforms, such as Spark, and with GCP’s data services, such as BigQuery and Dataproc. Furthermore, it integrates with common ML frameworks such as TensorFlow, scikit-learn, and PyTorch, and it works across the entire life cycle of creating, training, tuning, deploying, and monitoring ML models.
AutoML is another service from GCP geared toward the use of ML models for analytics. As its services overlap somewhat with the services provided by Vertex AI, we will explore the use of AutoML for automatically building and deploying ML models based on structured data. AutoML also provides other services based on image, video, and text analysis that can act as data sources along with predictions from the ML models for decision making via prescriptive analytics.
As the name suggests, Timeseries Insights API is a fully managed service provided by GCP that allows you to process large amounts of time series data in real time for anomaly detection trend forecasting. This service can be used in applications such as financial fraud analysis (e.g., detection of fraudulent credit card transactions) and industrial predictive maintenance. It is well integrated with Google Cloud storage to allow for seamless storage and processing of company data.
BigQuery is the final GCP service that we will explore. It is Google’s data warehouse as a service with built-in machine learning and business intelligence capabilities. I will cover predictive modeling using BigQuery ML in detail in later chapters. Other interesting features of BigQuery include multicloud data analysis for data sources in AWS and Azure, and analytics on geospatial data.
Conclusion
There are a large number of open source ML frameworks and other analytics services available from cloud providers and third-party vendors. It would take more than an encyclopedia to cover everything. That is not the approach I am taking with this book. My goal is for you to look at the prevalent frameworks and services as a representative subset and understand in detail how they can be used to practically build applications that operationalize predictive analytics. This will allow you to learn the finer details of predictive analytics and will give you a baseline that you can then port to other frameworks and platforms.
1 Kenneth D. Foote, “A Brief History of Analytics”, Dataversity, accessed September 20, 2021.
2 Capital expenditures are large expenses that organizations undertake, usually for multiple years.
3 Operational expenditures are smaller operational expenses that are usually related to the daily requirements of running a business.
4 Catherine Cote, “What Is Predictive Analytics? 5 Examples”, Harvard Business School Online, accessed October 26, 2021.
5 Dursun Delen, Prescriptive Analytics: The Final Frontier for Evidence-Based Management and Optimal Decision Making (Pearson FT Press: Upper Saddle River, NJ, 2019).
6 “What Is Machine Learning (ML)?”, Berkeley School of Information, last updated February 2022.
7 “What Is Machine Learning (ML)?” Berkeley School of Information.
8 Mehedi Hasan, “The 30 Best Python Libraries and Packages for Beginners”, ubuntupit.com, last updated November 12, 2022.
9 “TensorFlow”, Wikipedia, accessed February 22, 2024.
Get Predictive Analytics for the Modern Enterprise now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.