Data Warehousing Essay Research Paper Contents1 Introduction2 (стр. 1 из 2)

Data Warehousing Essay, Research Paper

Contents

1. Introduction

2. What is a data warehouse

3. Past, Present and Future

4. Data Warehouses and Business Organisations

5. Conclusion

6. Bibliography

1.0 Introduction

In recent years, data warehousing has emerged as the primary method of analysing sales and marketing data for a competitive advantage. As the number of knowledge workers using the data warehouse/data mart grows and the amount of data increases daily, performance problems have become a major concern of both the Information Systems staff and the users.

Many options have been tried in an attempt to solve the performance problems – from bigger hardware to different software or database tuning and redesign using star schemas or snowflake data structures. However, all have limitations – either in functionality or in terms of cost – and their strengths are almost inevitably outstripped by users’ demands.

During the past three years, data warehousing has emerged as one of the hottest trends in information technology for corporations seeking to utilise the massive amounts of data they are accumulating.

Managers from all business disciplines want enterprise wide information access, as well as the ability to manipulate and analyse information that the company has gathered for a single purpose, to make more intelligent business decisions. Whether to increase customer value, identify new markets or improve the management of the firm’s assets, the data warehouse promises to deliver the information necessary to accomplish these tasks quickly and efficiently.

This report entails various aspects of Data Warehousing, ranging from a clear and concise definition of its working system through to its operational environment. It discusses its implications and effects on internal and external interaction. I have presented my finding with the backing of some actual case studies and elaborated upon the evolution, the current state and what the future holds for Data Warehousing.

The report is summarised by a final conclusion.

2.0 What is a Data Warehouse

Data warehouse is the center of the architecture for information systems in the 1990s. Data warehouse supports informational processing by providing a solid platform of integrated, historical data from which to do analysis. Data warehouse provides the facility for integration in a world of unintegrated application systems. Data warehouse is achieved in a step-at-a-time fashion. Data warehouse organises and stores the data needed for informational, analytical processing over a long historical time perspective.

There is indeed a tremendous advantage in building and maintaining a data warehouse.

So now the question arises, what is a data warehouse?

A data warehouse is a

+ subject-orientated

+ integrated

+ time-variant

+ non-volatile

collection of data in support of management s decision making process.

The data entering the data warehouse comes from the operational environment in almost every case.

The data warehouse is always a physically separate store of data transformed from the data found in the operational environment.

To understand the data warehouse in more detail, I shall now elaborate upon its main characteristics.

2.1 Subject-Orientated

The first feature of the data warehouse is that it is oriented around the major subjects of the enterprise. The data-driven, subject orientation is in contrast to the more classical process/functional orientation of applications, which most older operational systems are organised around.

For example if an operational world was designed around applications and functions such as loans, savings, bank card and trust for a financial institution. The data warehouse world would be organised around major subjects such as customer, vendor, product and activity. The alignment around subject areas affects the design and implementation of the data found in the data warehouse. Most importantly, the major subject areas influence the most important part of the key structure.

The application world is concerned both with database design and process design. The data warehouse world focuses on data modelling and database design exclusively. Process design is not part of the data warehouse environment.

The differences between process/function application orientation and subject orientation show up as a difference in the content of data at the detailed level as well. Data warehouse data excludes data that will not be used for Decision Support System (DSS) processing, while operational application-oriented data contains data to satisfy immediate functional/processing requirements that may or may not be of use to the DSS analyst.

2.2 Integration

Easily the most important aspect of the data warehouse environment is that data found within the data warehouse is integrated. Always, with no exceptions.

The integration shows up in many different ways – in consistent naming conventions, in consistent measurement of variables, in consistent encoding structures, in consistent physical attributes of data, and much more.

When data is moved to the data warehouse from the application-oriented operational environment, the data is integrated before entering the warehouse.

Over the years the different applications designers have made numerous individual decisions as to how an application should be built. The style and the individualised design decisions of the application designer show up in a hundred ways. In differences in encoding. In differences in key structures. In differences in physical characteristics. In differences in naming conventions, and so forth.

The collective ability of many application designers to create inconsistent applications is legendary.

I have below shown 2 examples to simplify my explanation:

Encoding – application designers have chosen to encode the field GENDER in different ways. One designer represents GENDER as an “M” and an “F.” Another application designer represents GENDER as a “1″ and a “0.” Whist another represents GENDER as an “x” and a “y.” And yet another represents it as “male” and “female.” It doesn’t matter much how GENDER arrives in the data warehouse. “M” and “F” are probably as good as any representation. What matters is that whatever source GENDER comes from, it must arrive in the data warehouse in a consistent integrated state. Therefore when GENDER is loaded into the data warehouse from an application where it has been represented in other than an “M” and “F” format, the data must be converted to the data warehouse format.

Measurement of attributes – application designers have chosen to measure pipeline in a variety of ways over the years. One application designer stores pipeline data in centimetres. Another, stores pipeline data in terms of inches. Whilst, another stores the data in million cubic feet per second. And another designer stores pipeline information in terms of yards. Whatever the source, when the pipeline information arrives in the data warehouse it needs to be measured the same way.

The issues of integration affect almost every aspect of design – the physical characteristics of data, the dilemma of having more than one source of data, the issue of inconsistent naming standards, inconsistent date formats, the list is endless.

Whatever the design issue, the result is the same – the data needs to be stored in the data warehouse in a singular, globally-acceptable fashion even when the underlying operational systems store the data differently.

When the DSS analyst looks at the data warehouse, the focus of the analyst should be on using the data that is in the warehouse, rather than on wondering about the credibility or consistency of the data.

2.3 Time Variancy

All data in the data warehouse is accurate as of some moment in time. This basic characteristic of data in the warehouse is very different from data found in the operational environment. In the operational environment when you access a unit of data, you expect that it will reflect accurate values as of the moment of access.

Because data in the data warehouse is accurate as of some moment in time (i.e., not “right now”), data found in the warehouse is said to be “time variant.”

The time variancy of data warehouse data shows up in several ways. The simplest way is that data warehouse data represents data over a long time horizon – from five to ten years. The time horizon represented for the operational environment is much shorter – from the current values of today up to sixty to ninety days. Applications that must perform well and must be available for transaction processing must carry the minimum amount of data if they are to have any degree of flexibility at all. Therefore operational applications have a short time horizon, as a matter of sound application design.

Another way that time variancy appears is that data warehouse data, once correctly recorded, cannot be updated. In some cases it may be unethical or even illegal for data in the data warehouse to be altered. Operational data, being accurate as of the moment of access, can be updated as the need arises.

2.4 Non-volatile

The fourth defining characteristic of the data warehouse is that it is non-volatile. This basically refers to the factor that the information in the operational environment needs to be changed, deleted , updated and other data inserted, whereas the data in the data warehouse has only two operations, the initial loading of the data, and the access of the data. This seemed very simple to me, but after extensive research I understood that its implications were very powerful.

For example, at the design level, the need to be cautious of the update function holds no importance at all, since update of data is not done. Therefore at the physical level of design, liberties can be taken to optimise the access of data, particularly in dealing with the issues of normalisation and physical de-normalisation.

3.0 Past, Present and Future

Data warehouses represent the latest great paradigm of database management. The earliest data management systems were hierarchical, run on massive mainframes, and were used primarily for archival purposes. The first big change came in the early 1980’s, with the adoption of relational database systems, which have primarily operational applications. These systems, typically run on minicomputers, are used for online transaction processing, or O.L.T.P., to operate networks of automated teller machines, for example. Now come data warehouses, commonly run on client/server networks of personal computers and more powerful server machines. These latest systems are used for online analytical processing, or O.L.A.P., an essentially strategic application.

Put another way, traditional database systems are good at recording and reporting what happened. In the 1991-93 time frame, this industry segment tackled the delicate work of getting companies to look at how they have organised information and what they needed to do to be more systematic. This drove them to look at how disorganised, or dirty the data was in a variety of operational environments and to examine data modelling requirements for organising multiple databases into one file structure. Data scrubbing and cleansing technologies and modelling tools began to evolve.

From 1993 to 1995, the need for an automated process for extract and transformation became obvious. And during 1995-96, customers pushed vendors for more sophisticated transformation capabilities. Early on, simple summary was as much as people could comprehend. Now there are many-to-many joins, if-then-else logic.

This period also saw the rise of the data replication business, particularly for companies that only wanted to copy information, integrated and summarised in some way, over from transactional files.

In 1997-98, it was predicted that the strongest trend will be the movement to operational data stores [ODS], integrating ‘near real-time’ transactional data, and making it available for query, analysis, and reporting. The telephone companies are moving 100% to ODS, because in the process of creating data warehouses, they’ve learned how to clean up operational systems.

Also, because of the amount of data flowing from operational to information systems, customers are demanding a scheduling and monitoring capability making sure that the metadata is collected about whether data arrived at the data warehouse and how many times the job was run and run successfully.

Another forthcoming aspect is, performance monitoring tools will proliferate between now and 1999, providing information on how useful or useless the data is. Because of the size of data warehouses, companies are paying attention to the cleansing of data.

Web-enabling tools, will appear using Java applets, accessible on the net, to construct and maintain the warehouse. This will be an ease-of-use model for building the warehouse that will succeed all existing models.

In the current period, the issue of metadata standards are coming to the forefront, although there are competitive efforts from different groups of vendors. The variety of tools, warehouse and end user, all produce their own metadata, each for different functions. It’s critical to open metadata stores and make them available to view by other tools. But the biggest issue is to synchronise metadata so you get a logical view of all the versions of information you are storing throughout your company.

4.0 Data Warehouses and Business Organisations

The information provided here are from real companies, my research led me to many case studies and I have chosen two out of the long list to show how the companies benefited from the implementation of a Data Warehouse.

The first success story, that of Longs Drug Stores, a retail chain based in Walnut Creek, Calif.

Prior to installing a data warehouse, category management was extremely difficult for Longs. With a decentralised business model, in which individual store managers served as buyers, the chain as a whole lacked data on specific product sales, making it impossible to measure the success of a promotion in a timely way. Now, all Longs stores feed information nightly to a corporate data warehouse that runs on programming supplied by Red Brick Systems Inc., a pioneering software maker in the field. That means promotions can be measured, and altered, on a daily basis.

Before this, they didn’t know what was sold, as a corporate entity, although their store managers knew, and they weren’t getting any economies of scale. But since the warehouse opened, their stores have the ability to see the impact of their merchandising decisions within a day. Longs is now working on an expanded warehouse that will combine its internal data with syndicated information from suppliers like A.C. Nielsen for a broader perspective.

Another success story is that of Nationwide Building Society, they have used a data warehouse to help launch their life and pensions business.

Nationwide, now the world s largest building society, opened a life assurance and pensions business in January 1996, it entered a highly competitive market. IT support was vital to the new 55 million operation and according to Kevin Bounds, the financial director, much work was needed before opening for business.

Work began around 15 months before the company made its debut. It chose CAPSIL, a life assurance industry package, for policy processing on an IBM ES9000 platform. Other database applications were developed using Microsoft’ SQL Server. Eventually a data warehouse was considered.

We wanted an integrated management reporting application that could be fed by a whole variety of different sources. We required a single data repository said Bounds.

The case study goes onto describe how their first step was to design a SQL Server Database. The next step was to decide what sort of query tool was to be used. After a complete evaluation it was decided that the most appropriate tool was Andeyne s GQL (already being used by Barclays and NatWest). Apart from an excellent graphical interface and pre-defined report facilities, it was easy to build client queries. Performance was also outstanding. As the final piece in a complex 500,000 IT jigsaw, the SQL Server based warehouse went live in June 1996. It was loaded with the first five months of trading data. This meant writing interfaces to extract data from six operational and management systems, some of which are also fed from several others. Nationwide drew on a relational database structure of 90 tables with up to 50 data items per table.

Before June 1996, much of the management data available was created manually from separate reports. Since implementing the data warehouse, that has changed. All the data is now in one place, drawn together from the different operational systems. It has enabled them to replace a lot of their manual data collection. They were then able to use GQL for ad hoc analysis to produce reports to control operational activities as a means of monitoring workflow, resourcing and the management. Managers can monitor data quality, with exception reports and even produce daily new business summaries.

Given that the prime purpose of the data warehouse is to help run the company, one of the most complex reports covers Key Performance Indicators . Drawing on 1.5Gb of stored data taken from all feeder systems, these KPIs track seller performance and help comply with the regulatory requirements of the Personal Investment Authority. Some data is also exported to a Microsoft Access based Actuarial system. Nationwide Life is also using the warehouse to help senior directors cut and dice management information in different ways, for example, new analysis are available by product, customer, premium level or region. Drilling-down into the detail helps identify good and bad practices, helping the sales force to improve it overall performance.