Home > Articles > Approaches for Analytics and Data Science

Approaches for Analytics and Data Science

By John Garrett.
Sample Chapter is provided courtesy of Cisco Press.
Date: Oct 28, 2018.

Chapter Information

Model Building and Model Deployment
Model Building and Model Deployment
Analytics Methodology and Approach
Analytics Methodology and Approach
Logical Models for Data Science and Data
Logical Models for Data Science and Data
Summary
Summary

Chapter Description

In this sample chapter from Data Analytics for IT Networks: Developing Innovative Use Cases, you will explore model building and deployment, analytics methodology and approach, the distinction between the use case and the solution, and logical models for data science and date.

From the Book

Data Analytics for IT Networks: Developing Innovative Use Cases

$47.99 (Save 20%)

Logical Models for Data Science and Data

This section discusses analytics solutions that you model and build for the purpose of deployment to your environment. When I was working with Cisco customers in the early days of analytics, it became clear that setting up the entire data and data science pipeline as a working application on a production network was a bit confusing to many customers, as well as to traditional Cisco engineers.

Many customers thought that they could simply buy network analytics software and install it onto the network as they would any other application—and they would have fully insightful analytics. This, of course, is not the case. Analytics packages integrate into the very same networks for which you build models to run. We can use this situation to introduce the concept of an overlay, which is a very important concept for understanding network data (covered in Chapter 3, “Understanding Networking Data Sources”). Analytics packages installed on computers that sit on networks can build the models as discussed earlier, but when it is time to deploy the models that include data feeds from network environments, the analytics packages often have tendrils that reach deep into the network and IT systems. Further, these solutions can interface with business and customer data systems that exist elsewhere in the network. Designing such a system can be daunting because most applications on a network do not interact with the underlying hardware. A second important term you should understand is the underlay.

Analytics as an Overlay

So how do data and analytics applications fit within network architectures? In this context, you need to know the systems and software that consume the data, and you need to use data science to provide solutions as general applications. If you are using some data science packages or platforms today, then this idea should be familiar to you. These applications take data from the infrastructure (perhaps through a central data store) and combine it with other applications data from systems that reside within the IT infrastructure.

This means the solution is analyzing the very same infrastructure in which it resides, along with a whole host of other applications. In networking, an overlay is a solution that is abstracted from the underlying physical infrastructure in some way. Networking purists may not use the term overlay for applications, but it is used here because it is an important distinction needed to set up the data discussion in the next chapter. Your model, when implemented in production on a live network, is just an overlay instance of an application, much like other overlay application instances riding on the same network.

This concept of network layers and overlay/underlay is why networking is often blamed for fault or outage—because the network underlays all applications (and other network instances, as discussed in the next chapter). Most applications, if looked at from an application-centric view, are simply overlays onto the underlying network infrastructure. New networking solutions such as Cisco Application Centric Infrastructure (ACI) and common software-defined wide area networks (SD-WANs) such as Cisco iWAN+Viptela take overlay networking to a completely new level by adding additional layers of policy and network segmentation. In case you have not yet surmised, you probably should have a rock-solid underlay network if you want to run all these overlay applications, virtual private networks (VPNs), and analytics solutions on it.

Let’s look at an example here to explain overlays. Consider your very own driving patterns (or walking patterns, if you are urban) and the roads or infrastructure that you use to get around. You are one overlay on the world around you. Your neighbor traveling is another overlay. Perhaps your overlay is “going to work,” and your neighbor’s overlay for the day is “going shopping.” You are both using the same infrastructure but doing your own things, based on your interactions with the underlay (walkways, roads, bridges, home, offices, stores, and anything else that you interact with). Each of us is an individual “instance” using the underlay, much as applications are instances on networks. There could be hundreds or even thousands of these applications—or millions of people using the roadway system. The underlay itself has lots of possible “layers,” such as the physical roads and intersections and the controls such as signs and lights. Unseen to you, and therefore “virtual,” is probably some satellite layer where GPS is making decisions about how another application overlay (a delivery truck) should be using the underlay (roads).

This concept of overlays and layers, both physical and virtual, for applications as well as networks, was a big epiphany for me when I finally got it. The very networks themselves have layers and planes of operations. I recall it just clicking one day that the packets (routing protocol packets) that were being used to “set up” packet forwarding for a path in my network were using the same infrastructure that they were actually setting up. That is like me controlling the stoplights and walk signs as I go to work, while I am trying to get there. We’ll talk more about this “control plane” later. For now, let’s focus on what is involved with an analytics infrastructure overlay model.

By now, I hope that I have convinced you that this concept of some virtual overlay of functionality on a physical set of gear is very common in networking today. Let’s now look at an analytics infrastructure overlay diagram to illustrate that the data and data science come together to form the use cases of always-on models running in your IT environment. Note in Figure 2-5 how other data, such as customer, business, or operations data, is exported from other application overlays and imported into yours.

Figure 2-5 Analytics Solution Overlay

In today’s digital environment, consider that all the data you need for analysis is produced by some system that is reachable through a network. Since everyone is connected, this is the very same network where you will use some system to collect and store this data. You will most likely deploy your favorite data science tools on this network as well. Your role as the analytics expert here is to make sure you identify how this is set up, such that you successfully set up the data sources that you need to build your analytics use case. You must ensure these data sources are available to the proper layer—your layer—of the network.

The concept of customer, business, and operations data may be new, so let’s get right to the key value. If you used analytics in your customer space, you know who your valuable customers are (and, conversely, which customers are more costly than they are worth). This adds context to findings from the network, as does the business context (which network components have the greatest impact) and operations (where you are spending excessive time and money in the network). Bringing all these data together allows you to develop use cases with relevant context that will be noticed by business sponsors and stakeholders at higher levels in your company.

As mentioned earlier in this chapter, you can build a model with batches of data, but deploying an active model into your environment requires planning and setup of the data sources needed to “feed” your model as it runs every day in the environment. This may also include context data from other customer or business applications in the network environment. Once you have built a model and wish to operationalize it, making sure that everything properly feeds into your data pipelines is crucial—including the customer, business, operations, and other applications data.

Analytics Infrastructure Model

This section moves away from the overlays and network data to focus entirely on building an analytics solution. (We revisit the concepts of layers and overlays in the next chapter, when we dive deeper into the data sources in the networking domain.) In the case of IT networking, there are many types of deep technical data sources coming up from the environment, and you may need to combine them with data coming from business or operations systems in a common environment in order to provide relevance to the business. You use this data in the data science space with maturity levels of usage, as discussed in Chapter 1. So how can you think about data that is just “out there in the ether” in such a way that you can get to actual analytics use cases? All this is data that you define or create. This is just one component of a model that looks at the required data and components of the analytics use cases.

Figure 2-6 is a simple model for thinking about the flow of data for building deployable, operationalized models that provide analytics solutions. We can call this a simple model for analytics infrastructure, and, as shown in the figure, we can contrast this model with a problem-centric approach used by a traditional business analyst.

Figure 2-6 Traditional Analyst Thinking Versus Analytics Infrastructure Model

No, analytics infrastructure is not artificial intelligence. Due to the focus on the lower levels of infrastructure data for analytics usage, this analytics infrastructure name fits best. The goal is to identify how to build analytics solutions much the same way you have built LAN, WAN, wireless, and data center network infrastructures for years. Assembling a full architecture to extract value from data to solve a business problem is an infrastructure in itself. This is very much like an end-to-end application design or an end-to-end networking design, but with a focus on analytics solutions only.

The analytics infrastructure model used in IT networking differs from traditional analyst thinking in that it involves always looking to build repeatable, reusable, flexible solutions and not just find a data requirement for a single problem. This means that once you set up a data source—perhaps from routers, switches, databases, third-party systems, network collectors, or network management systems—you want to use that data source for multiple applications. You may want to replicate that data pipeline across other components and devices so others in the company can use it. This is the “build once, use many” paradigm that is common in Cisco Services and in Cisco products. Solutions built on standard interfaces are connected together to form new solutions. These solutions are reused as many times as needed. Analytics infrastructure model components can be used as many times as needed.

It is important to use standards-based data acquisition technologies and perhaps secure the transport and access around the central data cleansing, sharing, and storage of any networking data. This further ensures the reusability of your work for other solutions. Many such standard data acquisition techniques for the network layer are discussed in Chapter 4, “Accessing Data from Network Components.”

At the far right of the model in Figure 2-6, you want to use any data science tool or package you can to access and analyze your data to create new use cases. Perhaps one package builds a model that is implemented in code, and another package produces the data visualization to show what is happening. The components in the various parts of the model are pluggable so that parts (for example, a transport or a database) could be swapped out with suitable replacements. The role and functionality of a component, not the vendor or type, is what is important.

Finally, you want to be able to work this in an Agile manner and not depend on the top-down Waterfall methods used in traditional solution design. You can work in parallel in any sections of this analytics infrastructure model to help build out the components you need to enable in order to operationalize any analytics model onto any network infrastructure. When you have a team with different areas of expertise along the analytics infrastructure model components, the process is accelerated.

Later in the book, this model is referenced as an aid to solution building. The analytics infrastructure model is very much a generalized model, but it is open, flexible, and usable across many different job roles, both technical and nontechnical, and allows for discussion across silos of people with whom you need to interface. All components are equally important and should be used to aid in the design of analytics solutions.

The analytics infrastructure model (shown enlarged in in Figure 2-7) also differs from many traditional development models in that it segments functions by job roles, which allows for the aforementioned Agile parallel development work. Each of these job roles may still use specialized models within its own functions. For example, a data scientist might use a preferred methodology and analytics tools to explore the data that you provided in the data storage location. As a networking professional, defining and creating data (far left) in your domain of expertise is where you play, and it is equally as important as the setup of the big data infrastructure (center of the model) or the analysis of the data using specialized tools and algorithms (far right).

Figure 2-7 Analytics Infrastructure Model for Developing Analytics Solutions

Here is a simple elevator pitch for the analytics infrastructure model: “Data is defined, created, or produced in some system from which it is moved into a place where it is stored, shared, or streamed to interested users and data science consumers. Domain-specific solutions using data science tools, techniques, and methodologies provide the analysis and use cases from this data. A fully realized solution crosses all of the data, data storage, and data science components to deliver a use case that is relevant to the business.”

As mentioned in Chapter 1, this book spends little time on “the engine,” which is the center of this model, identified as the big data layer shown in Figure 2-8. When I refer to anything in this engine space, I call out the function, such as “store the data in a database” or “stream the data from the Kafka bus.” Due to the number of open source and commercial components and options in this space, there is an almost infinite combination of options and instructions readily available to build the capabilities that you need.

Figure 2-8 Roles and the Analytics Infrastructure Model

It is not important that you understand how “the engine” in this car works; rather, it is important to ensure that you can use it to drive toward analytics solutions. Whether using open source big data infrastructure or packages from vendors in this space, you can readily find instructions to transport, store, share, and stream and provide access to the data on the Internet. Run a web search on “data engineering pipelines” and “big data architecture,” and you will find a vast array of information and literature in the data engineering space.

The book aims to help you understand the job roles around the common big data infrastructure, along with data, data science, and use cases. The following are some of the key roles you need to understand:

Data domain experts—These experts are familiar with the data and data sources.
Analytics or business domain experts—These experts are familiar with the problems that need to be solved (or questions that need to be answered).
Data scientists—These experts have knowledge of the tools and techniques available to find the answers or insights desired by the business or technical experts in the company.

The analytics infrastructure model is location agnostic, which is why you see callouts for data transport and data access. This overall model approach applies regardless of technology or location. Analytics systems can be on-premises, in the cloud, or hybrid solutions, as long as all the parts are available for use. Regardless of where the analytics is used, the networking team is a usually involved in ensuring that the data is in the right place for the analysis. Recall from the overlay discussion earlier in the chapter that the underlay is necessary for the overlay to work. Parts of this analysis may exist in the cloud, other parts on your laptop, and other parts on captive customer relationship management (CRM) systems on your corporate networks. You can use the analytics infrastructure model to diagram a solution flow that results in a fully realized analytics use case.

Depending on your primary role, you may be involved in gathering the data, moving the data, storing the data, sharing the data, streaming the data, archiving the data, or providing the analytics analysis. You may be ready to build the entire use case. There are many perspectives when discussing analytics solutions. Sometimes you will wear multiple hats. Sometimes you will work with many people; sometimes you will work alone if you have learned to fill all the required roles. If you decide to work alone, make sure you have access to resources or expertise to validate findings in areas that are new to you. You don’t want to spend a significant amount of time uncovering something that is already general knowledge and therefore not very useful to your stakeholders.

Building your components using the analytics infrastructure model ensures that you have reusable assets in each of the major parts of the model. Sometimes you will spend many hours, days, or weeks developing an analysis, only to find that there are no interesting insights. This is common in data science work. By using the analytics infrastructure model, you can maintain some parts of your work to build other solutions in the future.

The Analytics Infrastructure Model In Depth

So what are the “reusable and repeatable components” touted in the analytics infrastructure model? This section digs into the details of what needs to happen in each part of the model. Let’s start by digging into the lower-left data component of the model, looking at the data that is commonly available in an IT environment. Data pipelines are big business and well covered in the “for fee” and free literature.

Building analytics models usually involves getting and modeling some data from the infrastructure, which includes spending a lot of time on research, data munging, data wrangling, data cleansing, ETL (Extract, Transform, Load), and other tasks. The true power of what you build is realized when you deploy your model into an environment and turn it on. As the analytics infrastructure model indicates, this involves acquiring useful data and transporting it into an accessible place. What are some examples of the data that you may need to acquire? Expanding on the data and transport sections of the model in Figure 2-9, you will find many familiar terms related to the combination of networking and data.

Figure 2-9 Analytics Infrastructure Model Data and Transport Examples

Implementing a model involves setting up a full pipeline of new data (or reusing a part of a previous pipeline) to run through your newly modeled use cases, and this involves “turning on” the right data and transporting it to where you need it to be. Sometimes this is kept local (as in the case of many Internet of Things [IoT] solutions), and sometimes data needs to be transported. This is all part of setting up the full data pipeline. If you need to examine data in flight for some real-time analysis, you may need to have full data streaming capabilities built from the data source to the place where the analysis happens.

Do not let the number of words in Figure 2-9 scare you; not all of these things are used. This diagram simply shares some possibilities and is in no way a complete set of everything that could be at each layer.

To illustrate how this model works, let’s return to the earlier example of the router problem. If latency and sometimes router crashes are associated with a memory leak in some software versions of a network router, you can use a telemetry data source to access memory statistics in a router. Telemetry data, covered in Chapter 4, is a push model whereby network devices send periodic or triggered updates to a specified location in the analytics solution overlay. Telemetry is like a hospital heart monitor that gets constant updates from probes on a patient. Getting router memory–related telemetry data to the analytics layer involves using the components identified in white in Figure 2-10—for just a single stream. By setting this up for use, you create a reusable data pipeline with telemetry-supplied data. A new instance of this full pipeline must be set up for each device in the network that you want to analyze for this problem. The hard part—the “feature engineering” of building a pipeline—needs to happen only once. You can easily replicate and reuse that pipeline, as you now have your memory “heart rate monitor” set up for all devices that support telemetry. The left side of Figure 2-10 shows many ways data can originate, including methods and local data manipulations, and the arrow on the right side of the figure shows potential transport methods. There are many types of data sources and access methods.

Figure 2-10 Analytics Infrastructure Model Telemetry Data Example

In this example, you are taking in telemetry data at the data layer, and you may also do some local processing of the data and store it in a localized database. In order to send the memory data upstream, you may standardize it to a megabyte or gigabyte number, standardize it to a “z” value, or perform some other transformation. This design work must happen once for each source. Does this data transformation and standardization stuff sound tedious to you? Consider that in 1999, NASA lost a $125 million Mars orbiter due to a mismatch of metric to English units in the software. Standardization, transformation, and data design are important.

Now, assuming that you have the telemetry data you want, how do you send it to a storage location? You need to choose transport options. For this example, say that you choose to send a steady stream to a Kafka publisher/subscriber location by using Google Protocol Buffers (GPB) encoding. There are lots of capabilities, and lots of options, but after a one-time design, learning, and setup process, you can document it and use it over and over again. What happens when you need to check another router for this same memory leak? You call up the specification that you designed here and retrofit it for the new requirement.

While data platforms and data movement are not covered in detail in this book, it is important that you have a basic understanding of what is happening inside the engine, all around the “the data platform.”

The Analytics Engine

Unless you have a dedicated team to do this, much of this data storage work and setup may fall in your lap during model building. You can find a wealth of instruction for building your own data environments by doing a simple Internet search. Figure 2-11 shows many of the activities related to this layer. Note how the transport and data access relate to the configuration of this centralized engine. You need a destination for your prepared data, and you need to know the central location configuration so you can send it there. On the access side, the central data location will have access methods and security, which you must know or design in order to consume data from this layer.

Figure 2-11 The Analytics Infrastructure Model Data Engine

Once you have defined the data parameters, and you understand where to send the data, you can move the data into the engine for storage, analysis, and streaming. From each individual source perspective, the choice comes down to push or pull mechanisms, as per the component capabilities available to you in your data-producing entities. This may include pull methods using polling protocols such as Simple Network Management Protocol (SNMP) or push methods such as the telemetry used in this example.

This centralized data-engineering environment is where the Hadoop, Spark, or commercial big data platform lives. Such platforms are often set up with receivers for each individual type of data. The pipeline definition for each of these types of data includes the type and configuration of this receiver at the central data environment. Very common within analytics engines today is something called a publisher/subscriber environment, or “pub/sub” bus. Apache Kafka is a very common bus used in these engines today.

A good analogy for the pub/sub bus is broadcast TV channels with a DVR. Data feeds (through analytics infrastructure model transports) are sent to specific channels from data producers, and subscribers (data consumers) can choose to listen to these data feeds and subscribe (using some analytics infrastructure model access method, such as a Kafka consumer) to receive them. In this telemetry example, the telemetry receiver takes interesting data and copies or publishes it to this bus environment. Any package requiring data for doing analytics subscribes to a stream and has it copied to its location for analysis in the case of streaming data. This separation of the data producers and consumers makes for very flexible application development. It also means that your single data feed could be simultaneously used by multiple consumers.

What else happens here at the central environment? There are receivers for just about any data type. You can both stream into the centralized data environment and out of the centralized environment in real time. While this is happening, processing functions decode the stream, extract interesting data, and put the data into relational databases or raw storage. It is also common to copy items from the data into some type of “object” storage environment for future processing. During the transform process, you may standardize, summarize, normalize, and store data. You transform data to something that is usable and standardized to fit into some existing analytics use case. This centralized environment, often called the “data warehouse” or “data lake,” is accessed through a variety of methods, such as Structured Query Language (SQL), application programming interface (API) calls, Kafka consumers, or even simple file access, just to name a few.

Before the data is stored at the central location, you may need to adjust these data, including doing the following:

Data cleansing to make sure the data matches known types that your storage expects
Data reconciliation, including filling missing data, cleaning up formats, removing duplicates, or bounding values to known ranges
Deriving or generating any new values that you want included in the records
Splitting or combining data into meaningful values for the domain
Standardizing the data ingress or splitting a stream to keep standardized and raw data

Now let’s return to the memory example: These telemetry data streams (subject: memory leak) from the network infrastructure must now be made available to the analytics tools and data scientists for analysis or application of the models. This availability must happen through the analytics engine part of the analytics infrastructure model. Figure 2-12 shows what types of activities are involved if there is a query or request for this data stream from analytics tools or packages. This query is requesting that a live feed of the stream be passed through the publisher/subscriber bus architecture and a normalized feed of the same stream be copied to a database for batch analysis. This is all set up in the software at the central data location.

Figure 2-12 Analytics Infrastructure Model Streaming Data Example

Data Science

Data science is the sexy part of analytics. Data science includes the data mining, statistics, visualization, and modeling activities performed on readily available data. People often forget about the requirements to get the proper data to solve the individual use cases. The focus for most analysts is to start with the business problem first and then determine which type of data is required to solve or provide insights from the particular use case. Do not underestimate the time and effort required to set up the data for these use cases. Research shows that analysts spend 80% or more of their time on acquiring, cleaning, normalizing, transforming, or otherwise manipulating the data. I’ve spent upward of 90% on some problems.

Analysts must spend so much time because analytics algorithms require specific representations or encodings of the data. In some cases, encoding is required because the raw stream appears to be gibberish. You can commonly do the transformations, standardizations, and normalizations of data in the data pipeline, depending on the use case. First you need to figure out the required data manipulations through your model building phases; you will ultimately add them inline to the model deployment phases, as shown in the previous diagrams, such that your data arrives at the data science tools ready to use in the models.

The analytics infrastructure model is valuable from the data science tools perspective because you can assume that the data is ready, and you can focus clearly on the data access and the tools you need to work on that data. Now you do the data science part. As shown in Figure 2-13, the data science part of the model highlights tools, processes, and capabilities that are required to build and deploy models.

Figure 2-13 Analytics Infrastructure Model Analytics Tools and Processes

Going back to the streaming telemetry memory leak example, what should you do here? As highlighted in Figure 2-14, you use a SQL query to an API to set up the storage of the summary data. You also request full stream access to provide data visualization. Data visualization then easily shows both your technical and nontechnical stakeholders the obvious untamed growth of memory on certain platforms, which ultimately provides some “diagnostic analytics.” Insight: This platform, as you have it deployed, leaks memory with the current network conditions. You clearly show this with a data visualization, and now that you have diagnosed it, you can even build a predictive model for catching it before it becomes a problem in your network.

Figure 2-14 Analytics Infrastructure Model Streaming Analytics Example

Analytics Use Cases

The final section of the analytics infrastructure model is the use cases built on all this work that you performed: the “analytics solution.” Figure 2-15 shows some examples of generalized use cases that are supported with this example. You can build a predictive application for your memory case and use survival analysis techniques to determine which routers will hit this memory leak in the future. You can also use your analytics for decision support to management in order to prioritize activities required to correct the memory issue. Survival analysis here is an example of how to use common industry intuition to develop use cases for your own space. Survival analysis is about recognizing that something will not survive, such as a part in an industrial machine. You can use the very same techniques to recognize that a router will not survive a memory leak.

Figure 2-15 Analytics Infrastructure Model Analytics Use Cases Example

As you go through the analytics use cases in later chapters, it is up to you and your context bias to determine how far to take each of the use cases. Often simple descriptive analytics or a picture of what is in the environment is enough to provide a solution. Working toward wisdom from the data for predictive, prescriptive, and preemptive analytics solutions is well worth the effort in many cases. The determination of whether it is worth the effort is highly dependent on the capabilities of the systems, people, process, and tools available in your organization (including you).

Figure 2-16 shows where fully automated service assurance is added to the analytics infrastructure model. When you combine the analytics solution with fully automated remediation, you build a full-service assurance layer. Cisco builds full-service assurance layers into many architectures today, in solutions such as Digital Network Architecture (DNA), Application Centric Infrastructure (ACI), Crosswork Network Automation, and more that are coming in the near future. Automation is beyond the scope of this book, but rest assured that your analytics solutions are a valuable source for the automated systems to realize full-service assurance.

Figure 2-16 Analytics Infrastructure Model with Service Assurance Attachment

6. Logical Models for Data Science and Data | Next Section Previous Section

Cisco Press Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from Cisco Press and its family of brands. I can unsubscribe at any time.

Email Address