LEARN MORE

Zones



What You Need to Know About Data Warehousing

A data warehouse is an information technology strategy that takes production data, scrubs it up, organizes it, and puts it in a place appropriate for browsing, data analysis, and decision making. The result: Practical access to critical data and decision support.

"We've had decision support for years, but we always had to go out and find the data each time we wanted to write a new decision support application," explains Ron Shelby, general director of Data Management and Warehousing for General Motors Corp. (Detroit, MI). This extra effort was because companies like GM installed centralized database management systems (DBMS) that were "best-in-class" in their day, but limited by physical space and costs. Too often, the high volumes and the complexities of operational data brought these systems to their virtual knees. Moreover, the data access tools that came with these DBMS were either too unsophisticated for detailed data analysis or they were too difficult for the average end user to exploit.

Then, in the early 1990s, GM began aggregating all of its data regarding vehicles and vehicle orders into data warehouses. These information systems segregated the company's operational data and data query activities from the production databases. Data access became easier and less costly, regardless of the location of the source data. Plus, decision support solutions, such as analyzing how warranty claims varied by car model, came faster and were more accurate.

What Is a Data Warehouse?

A data warehouse is, well, a warehouse of data. It is a centralized directory of metadata—information about data—that points to data contained in multiple, and sometimes incompatible, transactional systems. In short, it is a database containing snapshots of different databases. The result is a common, comprehensive view of all the data in the databases. From this, users can surf through, aggregate, and manipulate the data as they see fit to extract the relevant information they need.

In operation, according to Computer Sciences Corp (CSC; El Segundo, CA), "active, tactical, and current events flow from the operational system to the warehouse to become static, strategic, and historical data." That flow is more a periodic operation as data are extracted from the operational (transaction) system to populate the data warehouse. Users then issue queries against the data warehouse and perform decision support activities, such as create management reports and perform data analysis.

The transactional database and the data warehouse are kept separate; the data warehouse is not a shadow or replicated database. Consequently, users' queries do not adversely affect the performance of the organization's operational system. Continues CSC, "In a perfect world, simply retaining the expired operational data would support decision support queries from the operational database. This arrangement sounds good in theory, but has proven to be impractical: the real-world performance, capacity, and technology limitations cannot be ignored. Controlled and practical redundancy is preferable to out-of-control theoretical data purity."

What this requires in terms of information technology, according to CSC, are two main components. There is (1) the "information store of historical events (the data warehouse)" and (2) the "tools to accomplish strategic analysis of that information (a decision support system)." Information from PeopleSoft, Inc. (Pleasanton, CA) adds some details about what comprises a data warehouse:

Transactional applications. Source data can be stored in any format from modern relational databases to traditional legacy sources, including IMS databases, VSAM files, IDMS databases, flat files, personal computer files, and spreadsheets.

Data extraction and transformation tools to read data from transactional systems, transform the data for data consistency, and write it to an intermediate file.

Data scrubbing tools to further "cleanse" raw data.

Data movement software to move data from the intermediate files to the data warehouse while automatically managing data volume and cross-platform issues.
 

Data warehouse, which is typically a relational database optimized for analysis, not for transaction processing. According to the SAS Institute Inc. (Cary, NC), "a DBMS is typically a transaction-oriented system designed to handle multiple reads and writes to the same database by multiple users, such as in order entry or inventory control systems. Data warehouses, on the other hand, are designed to be read from, rather than written to." End users are not entering transactions; instead they are extracting data to analyze it. Both DBMS and data warehouses typically support SQL, data compression and indexing, and rigorous access security; however, data warehouses do not require sophisticated DBMS capabilities, such as rollback and recovery, which is required to recover what was lost in a transaction system. Instead, a data warehouse gets refreshed with the most current data.
 

Data marts are "specialized" data warehouses that are optimized to give users, such as sales, finance, and marketing, access to the data they need out of the data warehousing environment for data analysis needs.
 

Data access tools to retrieve, view, manipulate, analyze, and present data. On the desktop, these tools include spreadsheets, query engines, report writers, and even web browsers.
 

Repository tools to maintain the metadata that points to the data in the data warehouse. Repository tools also monitor transactional applications so that if a data record in the transactional system changes, the data extraction and transformation tools will be updated to follow suit.
Administrative tools for implementing the actual data warehouse.

GM currently has data warehouses containing data about vehicle orders, warranties, and dealers. These data warehouses are typically divided by sector, such as Delphi, GMAC, and the automotive sectors in North America and Europe. GM's main data warehouses run in MVS mainframe-based DB2 database environments; newer data warehouses and data marts run on Unix-based Informix DBMS. Data extraction and data cleaning tools from Prism Solutions Inc. (Sunnyvale, CA) help maintain data integrity throughout the information system.

By the way, these data warehouses are huge. GM had several 30- to 50-gigabyte databases in its data marts three to five years ago. The benefits from these data warehouses were so great that GM is now talking about creating data warehouses that are "beyond what anybody else has done," says Shelby. "One of the data warehouse environments we need to put together will contain more than 10 terabytes of data."

Easy Access to Benefits

Data warehouses are already at work in a variety of industries and applications—from utilities to lotteries. One durable goods manufacturer has an equipment maintenance data warehouse with data regarding the condition of equipment, work performed, work crews, and the time, material, and costs involved. A production data warehouse, such as what GM is developing, could include information about both the vehicles and the conditions of the manufacturing process at the time the vehicles were made. Example queries against this production data warehouse could include:

What was the yield on a given production line over the last six months?
How many of a given car model were produced on each shift over the last four weeks?
What were the mean and standard deviation for a key production variable for a particular car model over a specified time interval?
What are the material costs to build a specified car model and who are the suppliers?
Which production lines had the lowest problem rate in terms of warranty claims?
"Manufacturers today typically don't have that kind of data close at hand," says Shelby. To do so takes a lot of time and money. "Increasingly it's the time, more than the money, that's important. The value of a data warehouse environment is that it helps reduce the development time in getting information, which means faster cost benefit to the company."

From an MIS perspective, a data warehouse focused on and maintaining historical data over time makes the operational system leaner, letting it more directly address the daily operational needs of production. The data warehouse also improves the effectiveness and productivity of manufacturing engineering support and it helps integrate manufacturing functions by enabling the sharing of data across operating areas. Similarly, data warehousing supports supply chain management initiatives by collecting data from various platforms and sources for decision-making.

A data warehouse can also be an essential element in a company's ISO 9000 efforts. It can provide an organized infrastructure to track, monitor, and enforce parameters or operations to comply with specifications. It provides people at all levels of the organization an opportunity to view, review, analyze, and potentially input contributing quality efforts, thus giving the company more direct control of its total quality achievements.

For GM, says Shelby, one of the greatest benefits will be "in using data warehousing to `close the loop' from the manufacturing floor all the way to our warranty system, creating a foundation for decision support solutions to improve vehicle quality. By improving vehicle quality, we will be able to lower costs and increase margins. The manufacturer that gets there first and does the best job is going to reap a pretty significant reward."

Data Cleanliness Is Next to Enterprise Efficiency

General Motors Corporation, General Motors Corp., GM Corp., and GM are not all the same company—to a computer. Therein lies the problem. "When building a data warehouse, data mart, customer information system, or a data store, you need to look at the quality of your data. If those data are migrated in their current state, they're probably fraught with inconsistencies, juxtapositions, spelling errors, and mixed domains," warns Tony Jacques, senior sales representative for Vality Technology Inc. (Boston, MA).

There are many reasons for such losses in data integrity, including multiple data entry operators, lack of corporate standards, operational data distributed across disparate sources and legacy systems, data redundancy between different applications, and simple data entry errors. For example, one Vality customer, a first-tier automotive supplier, has seven divisions that share some of the same vendors, but not the same vendor codes. Each division not only assigned a unique identification number to each of those vendors, each division also had its own descriptions for the components supplied by those vendors. On top of that, the seven divisions had a hodgepodge of legacy hardware platforms, software applications, and financial systems. To say the least, monthly roll-ups for divisional product sales, as well as component and supplier costs, were difficult for the parent company to obtain.

To extract order out of this informational chaos, the supplier decided to implement a single enterprise resource planning (ERP) system across all of its divisions. Before doing so, the supplier contracted Vality for "data reengineering"—Vality's terminology for cleaning and reconditioning operational data, reconciling conflicting data structures and data variations (e.g., multiple spellings and misspellings for the same data entities), and consolidating data from one or more data sources before migrating those data to new and existing target databases.

"Failure to identify complex relationships buried within text fields and hidden across millions of records will undoubtedly cause inaccurate responses to end-user queries and greatly diminish the effectiveness of data-mining trend analysis," says Jacques. In terms of costs, Vality's customer had three key justifications for data reengineering:

Garbage in/garbage out was un-acceptable for the company's new, multi-million dollar ERP system.
Data cleansing reduced duplicated vendor listings, which in turn reduced operating costs, estimated at $30 for each check cut to a vendor, whether the check was for $5 or $500,000.
Consolidating vendor information enabled the supplier to negotiate larger volume discounts from its vendors, further cutting the company's operating costs.
Points out Jacques, "Any organization that is not now looking at the quality of its data and cleansing those data is probably a year or two behind the times."