SAP and Hadoop, better together
The rising cost of traditional enterprise data warehouse (EDW) solutions, combined with the fast-growing volumes of data — much of it unstructured — and the need to store and analyze that data are creating a new IT challenge. It’s one that many organizations are attempting to solve by marrying EDW with open-source big data solutions.
In some cases, these marriages have created partnerships that are peaceful, productive and mutually beneficial. But far too many have ended in expensive divorce. This paper describes the steps DXC Technology recommends to create data partnerships that are productive and enduring.
Optimizing your environment
Many organizations have selected an EDW solution based on SAP BW/HANA. While this is a highly structured and organized data warehouse, SAP BW/HANA can be somewhat limited and expensive. To overcome this, smart organizations will undertake an initial assessment to identify potential bottlenecks and suggest candidates for offloading to cheaper, more efficient platforms (Figure 1).
DXC, in cooperation with our partners, supports an assessment process using both manual and semi-automated tools and techniques (Figure 2).
SAP BW/HANA optimization patterns
Once the initial assessment is completed, it’s time to decide which data will be offloaded and how. EDW workload optimization can have different scopes, often depending on the organization’s strategy and degree of cloud readiness. To help organize the process of adopting big data analytics, DXC offers codified optimization patterns (Figure 3). Organizations can use these optimization patterns to either proceed on their own or let DXC guide them on where to begin and how to proceed.
Augmentation
The most popular pattern for a starting point is augmentation, as illustrated in Figure 4. In this pattern, the existing analytics fabric stays unchanged, but it’s surrounded by improved technology to enhance the functionality. This is done by:
- Adding new data sources such as semi-structured or unstructured data
- Introducing additional data-transformation approaches, such as stream processing
- Providing new data mining and machine learning algorithms, such as deep learning, as well as other possibilities that include online scoring and learning models on the full data set
Augmentation of an existing EDW solution assumes that the analytics platform adds new features. In this case, no changes are needed in the source, ETL data warehouse or front end. The DXC Analytics Platform, combined with the less costly storage, provides plenty of modern analytics functionalities, including:
- New data sources (e.g., semi-structured)
- New transformation approaches (e.g., stream, schema on read)
- New analytics possibilities (e.g., deep learning, online scoring)
Under the augmentation approach, the fastest way to provide data for users is to connect reporting applications to the analytics platform via JDBC/ODBC drivers. This can be done long before setting up connections between SAP and Hadoop. Users can still work in SAP BusinessObjects (BO), SAP Lumira or other reporting tools, but now they’ll also see external data collected in the analytics platform.
It’s also possible as part of this scenario to leverage SAP-to-Hadoop integration options. These include smart data access (SDA), used to first gain access to data collected in the analytics platform through HANA Views, then provide it to existing reporting tools; and SAP Vora, an in-memory query engine used to optimize data transformation with Apache Spark integration.
Partial migration
The next-most popular pattern is partial migration. This pattern includes levels of migration that range from the simple offloading of “cold data” (archival data that is rarely or never accessed by users) to the complex migration of extract, transform and load (ETL) processing and the staging layer to the DXC Analytics Platform. This then loads back to the existing EDW data that’s now ready for reporting or analysis.
The partial migration pattern not only provides the same features offered in augmentation, but also gives organizations an opportunity to reduce their analytics costs. It does this by reducing the solution’s size and complexity. Partial migration options include:
- Cold data migration, with an option to load back or access, if needed
- ETL and stage migration, where data ready for reporting goes to SAP, making this a good option for non-SAP data sources
- Heavy area migration, which is best for big tables, resource-consuming aggregation or other transformation; summary data or aggregates may be loaded back to the EDW
- Minimizing SAP BW/HANA, keeping it only for regulatory or standard reporting and using the analytics platform for other activities such as ad-hoc querying, data discovery, machine learning and dashboards
Partial migration — cold data
The first step in this scenario is to assess which part of the EDW is “cold” — that is, containing data that’s either never or only rarely used. This assessment can be done either manually by skilled database administrators or by using automated assessment tools.
After cold data is identified, it can be offloaded to the analytics platform (Figure 5) using SAP near line storage (NLS) or Apache Sqoop, an open-source tool for transferring bulk data between structured databases and those based on Hadoop. This offloading process should be repeated on a regular basis — daily, weekly or monthly, depending on the organization’s needs and resources. Gaining access to the offloaded cold data can be handled in any of several ways:
- Leverage SDA through HANA Views (and Vora to optimize data transformation)
- Use Sqoop to load back the data needed for analysis
- Use direct integration with the analytics platform via JDBC connectors to SAP BO or other reporting tools
- Use the analytics platform’s reporting tools, (e.g., Ambari Views, Zeppelin, Spotfire, Jupyter)
Partial migration — ETL and stage
This is usually (but not only) done in an EDW, if most of the data transformation takes place in the database stage area. In fact, the transformation workload can consume up to 50 percent of all EDW resources. For this reason, in the ETL and stage offloading scenario, the analytics platform is used to keep all stage data in Hadoop Hive (a data warehouse system), and then to transform data to the final schema using the power of Spark. Finally, data ready for reporting can be loaded to the EDW using Sqoop or SAP HANA Smart Data Integration (SDI), as illustrated in Figure 6. This is an especially good approach for non-SAP data sources.
In this type of partial migration, the DXC Analytics Platform can transform essentially all types of data, including:
- Structured data
- Semi-structured data (e.g., logs and clickstreams)
- Nanostructured data (e.g., text and pictures)
- Batch and stream data
- Small portions with huge velocities or big files/tables
Partial migration — heavy area
In some cases, the EDW is used to transform and store detailed data from operational systems that involve huge amounts of data. This could include data for telco service usage, bank operations, retail sales, e-commerce clickstreams, online advertising conversions and internet of things data. In these cases, the analytics platform is the best place to reroute such data. This can be done by either using existing ETL tools or connecting the analytics platform directly to the source (see Figure 7).
Once this is done, the analytics platform will efficiently transform and aggregate these data items in a format ready for final reporting. Aggregated (and much smaller) data can be loaded back as in other previous scenarios. But it’s even better to leave that data in its repository. Then access for users can be provided by:
- Leveraging SDA through HANA Views (also Vora to optimize data transformation)
- Using direct integration to the analytics platform via JDBC connectors to SAP BO or other tools
- Using the analytics platform’s reporting tools (e.g., Ambari Views, Zeppelin, Spotfire)
Partial migration — minimizing SAP BW/HANA
This approach to optimizing SAP BW/HANA’s size and cost focuses on business perspectives. The big question is: What needs to stay on SAP BW/HANA? These areas might include:
- Regulatory reporting
- Financial reporting
- Internal group or branch reporting
- Predefined strict or structured reporting
Everything else — including advanced analytics, machine learning, process optimization, data wrangling and discovery, ad hoc reporting and dashboards — can be moved to the cheaper and faster DXC Analytics Platform (as shown in Figure 8). This can also leverage the power of ETL based on Spark.
Other less-popular patterns
Although most organizations start with the augmentation pattern and then progress through several variants of partial migration, some organizations need to go further. For them, the best option is the full replacement pattern. This pattern assumes that a traditional analytics solution will be turned off. Obviously, the migration project to turn off the traditional solution needs to go through phases, migrating step by step. But the final architecture will need to be optimized for full replacement. Similarly, optimization of the architecture will need to focus on the needs of the non-EDW components, including ETL, business intelligence (BI) and machine learning. Full replacement is the most challenging pattern from a change-adoption perspective, but it also provides the biggest opportunity for reducing cost and complexity. It does so mainly by removing the need for expensive enterprise software licenses. In addition, full replacement empowers the organization to focus on the business outcomes of analytics, rather than on their technical complexities and maintenance requirements.
Organizations that opt for full replacement have several options:
- For ETL, it’s possible to keep the existing source, and either reroute its data flows to the analytics platform or connect the analytics platform to every data source and use native ETL tools such as Sqoop, Spark and NiFi.
- Existing front-end reporting tools may either be integrated with the analytics platform via JDBC/ODBC connectors or replaced by analytics platform reporting tools.
- Full replacement programs can be rolled out in phases. In this way, the EDW architecture will temporarily pass through partial migration scenarios for improved safety and reliability.
Organizations may also want to consider an EDW integration pattern following a major merger or corporate reorganization. In these situations, the organization may find itself with two or more data warehouse, business intelligence or analytics solutions. An EDW integration pattern assumes that the complexity of two or more systems for analytics is straightforward, hidden by a modern approach using solutions such as Schema on Read and Document Databases. As a result, the end user should see a single system view — even if the underlying data is not integrated.
This pattern provides a quick improvement and gives the organization additional time to analyze and develop the final architecture; for example, an eventual partial migration or full replacement. However, because this pattern also adds both an additional layer and more complexity to the architecture, it should be treated as a temporary solution only.
Migration patterns as a step from augmentation to full replacement
Choosing a migration pattern is not a “one size fits all” decision. On the contrary, no single solution will likely prove sufficient for every organization (see Figure 9). That said, one recommended best practice is to start small while also creating enough agility to grow quickly in many directions. The larger goal is to use the right methodology and approach for the unique needs of the business.
Customer examples
A global consumer packaged-goods supplier recently implemented the augmentation pattern with help from DXC. The company’s end users received quick and easy access to external data located on the DXC Analytics Platform, including information related to macroeconomics, market research, brand health, surveys, media and shoppers. They also received large amounts of operational data. All of this data was analyzed with SAP Lumira data visualization and alternative data-discovery tools.
As a result of its augmentation implementation, the consumer goods company:
- Derived business value from the capability to analyze between different functional data sets
- Gained new analytical insights by analyzing ad-hoc and operationalized data sets
- Extended the retention period of large-scale data sets in an economically viable way
- Used a platform that supports the modification of the data schema in a way that doesn’t require the repopulation of the database or the reorganization of data at the data-lake storage layer
Similarly, a major energy company, also with help from DXC, recently completed the partial migration pattern. As part of the project, DXC offloaded cold data from the company’s SAP HANA implementation to a cloud-based solution. As a result, the energy company:
- Reduced its costs and improved its CAPEX-to-OPEX ratio
- Optimized cost of cloud services via the selection of services and instances DWO (such as auto-scale for compute)
- Optimized data models for better processing and consumption
- Improved processing on the platform using Spark SQL and HQL
- Achieved an iterative deployment to the production database
- Improved ongoing performance tuning in the production database
Another company, this one in the telecommunications industry, turned to DXC for help with several EDW-related challenges. The company sought to lower its operational costs for SAP HANA and non-SAP business applications. It also wanted to deploy advanced analytics on its structured and unstructured data. The company wanted to improve information sharing with internal and external stakeholders. And it wanted to improve access to non-SAP data.
Using the DXC Analytics Platform, the telecom company cut its operational costs by 20 percent from baseline. It also gained scalability and agility in the EDW analytics area, enabling the company to set up predictive analytics on both hot and cold data. In addition, the company has democratized access to non-SAP data by storing it on a new landing zone.
Finally, a manufacturing company worked with DXC and our partner Datavard to achieve meaningful reductions in both the resources and costs for SAP HANA. Our assessment step obtained a 20 percent reduction of required memory (in the InfoCubes area) and a 50 percent reduction of resources needed in the decision support objects (DSO) layer. Also as a result of our refinements, the company was able to safely offload 3 terabytes of data to NLS, further lowering its operational costs.
DXC, your trusted partner
DXC specializes in developing workload-specific transformation and migration strategies aligned with your business objectives. We offer a broad set of migration capabilities to support a diverse array of technologies, geographies, regulatory requirements, operating models and target environments. The DXC Analytics Platform, being infrastructure-agnostic, makes it easy and efficient to follow the best suitable migration path for your enterprise data warehouse.
DXC has experience in managing enterprise hybrid environments, a balanced knowledge of traditional and next-generation infrastructure, and an offering of comprehensive services. All this enables us to successfully deliver numerous migration projects to all types of cloud platforms and on-premises solutions.
Visit DXC Analytics and AI Platform and contact us to learn more. We’re eager to help your business create a happy marriage between your EDW and big data.