Key strategies for managing federated data science environments
Large enterprises often have disparate data science platforms that make collaboration across different companies, regions and business units very difficult. This frequently is a result of the underlying data lake being fragmented. However, deploying a single, fully uniform data lake for the entire enterprise can be complex, costly or even prohibited due to strategic, technological or legal reasons.
When the data lake remains fragmented, deploying the right strategy for the entire data science program becomes vital to enabling advanced solutions, properly using resources and avoiding duplicated effort. This white paper explores four strategies for an enterprise to structure itself to maximize the effectiveness of data science by facilitating collaboration and integration among separate units.
Well-structured and ready-to-use data, combined with informed and technologically enabled data science talent, is “the new oil.”
Today’s data landscape
To keep up with a fast-paced economy and Silicon Valley-style entrepreneurship, enterprises worldwide are constantly changing. New businesses are formed, grow quickly and then are often acquired by larger entities. Additionally, established enterprises are often stuck in a prolonged state of migration of legacy IT solutions into big data technologies. These are only a few of the possible reasons why large organizations often have a fragmented analytics platform that lacks coherence and ease of use.
“Data is the new oil”
Data holds tremendous value. Almost everyone has heard the statement “Data is the new oil,” highlighting the potential hidden value of the vast amounts of data that organizations are keeping. However, this statement relies on a hidden assumption and, to be complete, should actually be, “Well-structured and ready-to-use data, combined with informed and technologically enabled data science talent, is ‘the new oil’.”
Today, nearly every company is hoarding data, and thanks to recent technology breakthroughs and the wide adoption of cloud computing, storing data has become relatively inexpensive. Yet, due to the complex and disparate structure of data systems, companies often struggle with monetizing the data they own. This is because the value is unlocked by applying data science, which requires clear and structured access to the data. That is difficult to achieve when data lakes are fragmented and siloed.
However, creating a single data lake is sometimes not a practical solution. In large organizations, this one-size-fits-all approach could backfire by creating huge overhead and process bloat. Moreover, enterprises may have strategic and legal reasons why they cannot fully unify a data lake.
To succeed, a measured approach should be taken that embraces the situation and takes multiple solutions into consideration.
The enterprise approach
The most important aspect of managing disparate and fragmented analytics platforms is to begin by understanding the end goal and assessing possible solutions, while avoiding wishful thinking.
DXC Technology has worked with many enterprise customers that have multiple separate analytics platforms that need to be managed independently. This is successfully achieved through DXC tools and frameworks such as its DXC Hybrid Data Management Architecture Services, which enables a structured approach to technology choice based on usage patterns. Also, managed machine learning helps structure data science projects across different departments.
Managed machine learning
An important element for navigating this fractured environment is unifying the platform at the data-science level rather than at a data-lake level. DXC does this through managed machine learning.
Managed machine learning is a platform composed of several approved and tested open source technologies. These are combined with additional DXC components enabling easy, unified and managed deployment of data science tools across different locations. It was originally developed for the award-winning DXC Robotic Drive, used by BMW to support autonomous driving development and now being applied in other industries. This lightweight system ensures a degree of uniformity across deployments, yet retains flexibility and openness that enable significant local customization. The platform runs in the cloud and consists of tools for model training, experiment tracking, model packaging, storage, containerization and deployment, as well as for a model pipeline setup.
Four strategies and scenarios
Enterprises essentially have four strategies to facilitate integration in data science development across the various departments or companies within them, depending on the specifics of the case:
- Best practices and know-how sharing
- Internal service provider
- Internal algorithm provider
- Federated learning
1. Best practices and know-how sharing scenario
Highlights:
In this scenario, data science units cannot directly share data, algorithms or predictions due to insufficient overlap or an explicit need to keep them separate.
- A large part of the effort in deploying data science is in the application of business know-how rather than code. Know-how means tried and tested ways of data preprocessing, feature engineering, setting up the experiments, model training for the given use cases, best practices for model tuning, etc.
- This approach must provide a structured and normalized system for content sharing and incentivize it throughout the enterprise.
- Technology: The approach includes documentation and code versioning tools (Jira/Git), model management repository.
- Actions: Create documentation on data used, data examples, data preprocessing/feature engineering steps; offer model examples as well as code examples; and provide incentives to employees who participate in the program.
This scenario addresses the case where the data science units in an enterprise cannot directly share data, algorithms or predictions due to insufficient overlap or an explicit need to keep them separate. The basic solution is that they can collaborate by sharing knowledge, know-how and code examples (Figure 1). The elements that are key in any algorithm design are the type and internal structure of the algorithm itself, as well as the data preprocessing and feature engineering. As the bulk of the effort needed to create an algorithm is related to generating the know-how, effective knowledge sharing delivers significant cost savings.
Example: The key to successful prediction of market trends is transforming the data by extracting some features from textual content. Moreover, an optimal model in this case can be a support vector machine that provides the highest prediction accuracy. As this knowledge is critical for success, the creator of this know-how creates detailed documentation of the process along with code examples and shares it with others in the group to reuse. Moreover, in the managed machine learning platform, the model is stored in the model management repository.
Figure 1. Recommended structure for the enterprise for the know-how sharing strategy. All regions contribute to a common knowledge base.
In this scenario, one unit contributes to creating an algorithmbased product, acting as a central service provider, and the desired result is the prediction.
2. Internal service provider scenario
Highlights:
- This approach is applicable if a single unit has all the necessary data for training and managing the algorithm.
- The owner becomes the internal service provider (prediction as a service).
- Technology: Representational state transfer (REST) APIs hide an internal data science toolset.
- Actions: Deploy the managed machine learning platform in the unit and provide the predictions via a service.
This scenario addresses the case where there is only a single unit that can contribute to creating an algorithm-based product and the desired result is the prediction results (not the algorithm itself). Thus, it is recommended that you structure the relationship as if this single unit were a service provider delivering the product to external customers via an API (Figure 2). If the predictions are huge in volume, then instead of an API, the results are in a shared environment.
Example: One unit in the enterprise has a unique market trends dataset that no other part of the group has. This part of the company trains a machine learning algorithm on its data and uses its internal data to perform predictions. Other groups gain access to its predictions via a secure REST API.
Figure 2. Recommended structure for the enterprise for internal service provider scenario. A single region becomes a service provider for the rest of the enterprise.
In this scenario, one unit creates an algorithm-based product, but the desired result is distributing the algorithm to other units, which apply their data.
3. Internal algorithm provider scenario
Highlights:
- This approach applies if a single unit has most of the necessary data for training and serving the algorithm, but other units have data that can augment the algorithm and run the predictions locally.
- The owner becomes the internal algorithm provider and provides the trained algorithm as a service, rather than predictions as a service.
- Technology: Managed machine learning enables reuse of the same analytical model across all regions
- Actions: Deploy managed machine learning to enable the reuse of binary models in all regions and serve the algorithm via a REST service from the central unit to other units.
In this scenario, there is only one unit of the enterprise that can contribute to creating an algorithm-based product, but the desired result is the algorithm itself, which needs to be used inside the other units on their data. In this case, managed machine learning enables seamless creation and serialization of the model that then can be transferred to other entities in the enterprise, thereby enabling reuse of the model (Figure 3).
Example 1: One unit in an enterprise has a market trends dataset that is vastly superior and complete compared to the datasets that other units in the enterprise have. This unit trains a machine learning algorithm on the data, converts the model with its container into a standardized format using managed machine learning, and distributes it across the enterprise. Other companies or regions load the algorithm using managed machine learning and run predictions on their own infrastructure and data.
A slight variation of this scenario is a transfer learning algorithm provider scenario, in which the unit receiving the algorithm does not use it exactly as it was received. The model is becoming a basis for a new model created locally (the knowledge embedded in the model that was received is transferred to a new model created). Then this locally created model (with the knowledge from the central model embedded into it) is used to make predictions. However, this locally created model is not shared with anybody.
Example 2: A central system creates a model for demand prediction, which is trained on a central system that has access to generalized shopping data. However, a specific retail unit wants to adjust this generalized model specifically to its niche. In that case, it receives the main model and the central location, performs additional local training and then proceeds to use the local version of it for predictions. This local version of the algorithm is not used in other units of the enterprise.
Figure 3. Recommended structure for the enterprise for internal algorithm provider scenario. A single region becomes a provider of a machine learning model that can be used for predictions in other regions of the enterprise.
In this scenario, several units have a valuable data set that can contribute to the accuracy of the model training but are unable to have a single environment.
4. Federated learning scenario
Highlights:
- This approach is applicable if all units have data and it is critical to collaborate, yet it is impossible to share the data.
- All units collaborate on algorithm creation by training the model locally, and then sending all the model updates to the central location to be merged and distributed back to all units.
- Technology: Managed machine learning is deployed in all units; a central location houses the model unification module.
- Actions: Deploy managed machine learning in all the units, train the model locally and send the results of training to the central region, which merges the models into one and distributes it back.
This scenario addresses the case where several units in the enterprise have a valuable data set that can contribute to the accuracy of the model training but are unable to have a single environment for a variety of reasons:
- Legal requirements, such as the General Data Protection Regulation (GDPR), prevent sending private information outside Europe
- Practical reasons, such as lack of integration between analytics platforms
- Strategic reasons, such as company policies to manage data independently
In each case, it is still possible to train and use algorithms as an enterprise. The methodology for accomplishing this is called federated learning, which is applicable to healthcare and automotive, among many other industries. Specifically, for autonomous driving systems in the automotive industry, it is illegal for companies to move the video data recorded by cars out of the regions due to privacy concerns. However, algorithms need to be trained on all the data to make sure the features will work worldwide. For example, a vehicle manufactured in the United States should be able to drive on roads in Europe and elsewhere.
The system works by extending the managed machine learning platform to all units doing the training locally and then sending only the model update to the aggregation region, which merges the models locally. Then the combined model is distributed back to all the regions so they can use it locally to make predictions (Figure 4). This approach means that all contributing units act independently. Since no training data needs to be transferred, distributed data lakes do not have to be uniform or share data. Still, the entire enterprise collaborates in training and enjoys the benefits of a better algorithm in its own use cases.
Example: All units in an enterprise have their own marketing data set and want to collaborate in training of a single superior algorithm to make the predictions. All of the units have managed machine learning deployed and train the algorithm locally. Then the model update package is sent from all the units to a single aggregation region, where the model is merged and distributed back to all units. All units load the algorithm using the managed machine learning application and can run predictions on their own infrastructure, and yet no raw data is ever shared.
Figure 4. Recommended structure for the enterprise for the federated learning scenario. All regions contribute to the model, by doing local training and sending their model updates to the central location for merging. The merged model is then distributed back to all the sub-regions to do the predictions locally.
Conclusion
Data, like oil, needs to be refined and properly applied. The engine needs the correct type of gasoline to deliver value. The proper use of data adds tremendous value to organizations that know how to integrate it and use it properly. When applied effectively, the power of machine learning is immense. In healthcare, the successful application of data analytics can do so much, from fighting the spread of a novel virus to helping patients survive cancer. In other industries, such as automotive and transportation, machine learning can speed up the development of autonomous vehicles.
Even though the value of information is unquestioned, enterprises must simultaneously address legal, organizational and technical challenges in order to unlock data. The tried-and-tested strategies outlined in this paper enable better monetization of the data and include: organizational and process blueprints, technical solutions, as well as ready-made deployable managed platforms. These strategies and technologies can maximize the success of data science and unlock the true potential of data.
About the author
Josef Habdank is a solution principal and big data architect at DXC Technology. He is a passionate technologist with extensive experience working on large industrialized artificial intelligence (AI) and data lake solutions for the automotive, retail, travel and healthcare industries. He has pioneered cutting-edge open source technologies for customers worldwide and is a recognized contributor and conference speaker on issues related to big data and AI.