What is olap cube in excel. Creating an SSAS project. What is analysis and why is it needed?

Information systems of a serious enterprise, as a rule, contain applications designed for complex analysis of data, their dynamics, trends, etc. Accordingly, top management becomes the main consumers of the analysis results. Such analysis is ultimately intended to support decision making. And in order to make any management decision, it is necessary to have the necessary information, usually quantitative. To do this, it is necessary to collect this data from all information systems enterprises, bring them to a common format and then analyze them. For this purpose, Data Warehouses are created.

What is a data warehouse?

Usually - the place where all information of analytical value is collected. The requirements for such stores correspond to the classic definition of OLAP and will be explained below.

Sometimes the Warehouse has another goal - the integration of all enterprise data, to maintain the integrity and relevance of information within all information systems. That. the repository accumulates not only analytical, but almost all information, and can provide it in the form of directories back to other systems.

A typical data warehouse is typically different from a typical relational database. First, regular databases are designed to help users perform day-to-day work, while data warehouses are designed for decision making. For example, the sale of goods and the issuance of invoices are carried out using a database designed for transaction processing, and the analysis of sales dynamics over several years, which allows planning work with suppliers, is carried out using a data warehouse.

Second, while traditional databases are subject to constant change as users work, the data warehouse is relatively stable: the data in it is usually updated according to a schedule (for example, weekly, daily, or hourly, depending on needs). Ideally, the enrichment process is simply the addition of new data over a period of time without changing the previous information already in the warehouse.

And thirdly, regular databases are most often the source of data that ends up in the warehouse. In addition, the storage can be replenished by external sources, for example statistical reports.

How is a storage facility built?

ETL– basic concept: Three stages:
  • Extraction – extracting data from external sources in an understandable format;
  • Transformation – transformation of the structure of the source data into structures convenient for building an analytical system;
Let's add one more stage - data cleaning ( Cleaning) – the process of filtering out irrelevant or correcting erroneous data based on statistical or expert methods. So as not to generate reports like “Sales for 20011” later.

Let's return to the analysis.

What is analysis and why is it needed?

Analysis is the study of data for the purpose of making decisions. Analytical systems are called decision support systems ( DSS).

Here it is worth pointing out the difference between working with DSS and a simple set of regulated and unregulated reports. Analysis in DSS is almost always interactive and iterative. Those. the analyst digs into the data, composing and adjusting analytical queries, and receives reports, the structure of which may be unknown in advance. We will return to this in more detail below when we discuss query language. MDX.

OLAP

Decision support systems usually have the means to provide the user with aggregate data for various samples from the original set in a form convenient for perception and analysis (tables, charts, etc.). The traditional approach to segmenting source data involves extracting from the source data one or more multidimensional data sets (often called a hypercube or metacube), the axes of which contain attributes, and the cells contain aggregated quantitative data. (Such data can also be stored in relational tables, but in this case we are talking about the logical organization of data, and not about the physical implementation of their storage.) Along each axis, attributes can be organized in the form of hierarchies, representing different levels of their detail. Thanks to this data model, users can formulate complex queries, generate reports, and obtain subsets of data.

The technology for complex multidimensional data analysis is called OLAP (On-Line Analytical Processing). OLAP is a key component of traditional data warehousing. The concept of OLAP was described in 1993 by Edgar Codd, a renowned database researcher and author of the relational data model. In 1995, based on the requirements set out by Codd, the so-called FASMI test (Fast Analysis of Shared Multidimensional Information) was formulated, including the following requirements for applications for multidimensional analysis:

  • providing the user with analysis results in an acceptable time (usually no more than 5 s), even at the cost of a less detailed analysis;
  • the ability to carry out any logical and statistical analysis characteristic of this application, and saving it in a form accessible to the end user;
  • multi-user access to data with support for appropriate locking mechanisms and authorized access means;
  • multidimensional conceptual representation of data, including full support for hierarchies and multiple hierarchies (this is a key requirement of OLAP);
  • the ability to access any necessary information, regardless of its volume and storage location.
It should be noted that OLAP functionality can be implemented different ways, starting with the simplest data analysis tools in office applications and ending with distributed analytical systems based on server products. Those. OLAP is not a technology, but ideology.

Before we talk about the various OLAP implementations, let's take a closer look at what cubes are from a logical point of view.

Multidimensional concepts

We will use the Northwind database included with Microsoft to illustrate OLAP principles. SQL Server and which is a typical database storing information about the trading operations of a company engaged in the wholesale supply of food. Such data includes information about suppliers, clients, a list of supplied goods and their categories, data about orders and ordered goods, a list of company employees.

Cube

Let's take for example the Invoices1 table, which contains the company's orders. The fields in this table will be as follows:
  • Order date
  • A country
  • City
  • Customer name
  • Delivery company
  • Product Name
  • Quantity of goods
  • Order price
What aggregate data can we get from this view? Typically these are answers to questions like:
  • What is the total value of orders placed by customers from a particular country?
  • What is the total value of orders placed by customers in a certain country and delivered by a certain company?
  • What is the total value of orders placed by customers in a particular country in a given year and delivered by a particular company?
All this data can be obtained from this table using quite obvious SQL queries with grouping.

The result of this query will always be a column of numbers and a list of attributes describing it (for example, country) - this is a one-dimensional data set or, in mathematical language, a vector.

Let's imagine that we need to obtain information on the total cost of orders from all countries and their distribution among delivery companies - we will get a table (matrix) of numbers, where delivery companies will be listed in the column headings, countries in the row headings, and in the cells there will be amount of orders. This is a two-dimensional data array. This set of data is called a pivot table ( pivot table) or crosstab.

If we want to get the same data, but also by year, then another change will appear, i.e. the data set will become three-dimensional (a conditional 3rd order tensor or a 3-dimensional “cube”).

Obviously, the maximum number of dimensions is the number of all attributes (Date, Country, Customer, etc.) that describe our aggregated data (amount of orders, number of products, etc.).

This is how we come to the concept of multidimensionality and its embodiment - multidimensional cube. We will call such a table “ fact table" Dimensions or Cube Axes ( dimensions) are attributes whose coordinates are expressed by the individual values ​​of these attributes present in the fact table. Those. for example, if information about orders was maintained in the system from 2003 to 2010, then this year axis will consist of 8 corresponding points. If orders come from three countries, then the country axis will contain 3 points, etc. Regardless of how many countries are included in the Country directory. Points on an axis are called its “members” ( Members).

In this case, the aggregated data themselves will be called “measures” ( Measure). To avoid confusion with "dimensions", the latter are preferably called "axes". The set of measures forms another "Measures" axis ( Measures). It has as many members (points) as there are measures (aggregated columns) in the fact table.

Members of dimensions or axes can be combined by one or more hierarchies ( hierarchy). Let us explain what hierarchy is with an example: cities from orders can be united into districts, districts into regions, regions of a country, countries into continents or other entities. Those. there is a hierarchical structure - continent- country-region-district-city– 5 levels ( Level). For a region, data is aggregated for all cities that are included in it. For a region across all districts that contain all cities, etc. Why do we need multiple hierarchies? For example, on the order date axis we may want to group points (i.e. days) into a hierarchy Year-Month-Day or by Year-Week-Day: in both cases there are three levels. Obviously, Week and Month group the days differently. There are also hierarchies, the number of levels in which is not deterministic and depends on the data. For example, folders on a computer disk.

Data aggregation can occur using several standard functions: sum, minimum, maximum, average, count.

MDX

Let's move on to the query language in multidimensional data.
The SQL language was originally designed not for programmers, but for analysts (and therefore has a syntax that resembles natural language). But over time it became more and more complicated and now few analysts know how to use it well, if at all. It has become a tool for programmers. The MDX query language, rumored to have been developed by our former compatriot Mosha (or Mosha) Posumansky in the wilds of Microsoft, was also initially intended to be aimed at analysts, but its concepts and syntax (which is vaguely reminiscent of SQL, and completely in vain, i.e. because it only confuses), even more complicated than SQL. However, its basics are still easy to understand.

We will look at it in detail because it is the only language that has received standard status within the framework of the general XMLA protocol standard, and secondly because there is an open-source implementation of it in the form of the Mondrian project from the company Pentaho. Other OLAP analysis systems (for example, Oracle OLAP Option) usually use their own extensions of the SQL syntax, however, they also declare support for MDX.

Working with analytical data sets only means reading them and does not mean writing them. That. MDX has no clauses for changing data, but only one selection clause - select.

In OLAP you can make multidimensional cubes slices– i.e. when data is filtered along one or more axes, or projections– when the cube “collapses” along one or more axes, aggregating data. For example, our first example with the amount of orders from countries is a projection of the cube onto the Country axis. The MDX query for this case will look like this:

Select ...Children on rows from
What's what here?

Select– the key word is included in the syntax solely for beauty.
is the name of the axis. All proper names in MDX are written in square brackets.
is the name of the hierarchy. In our case, this is the Country-City hierarchy
– this is the name of the axis member at the first level of the hierarchy (i.e. country) All – this is a meta-member that unites all members of the axis. There is such a meta-term in each axis. For example, in the year axis there is “All years”, etc.
Children is a member function. Each member has several functions available. Such as Parent. Level, Hierarchy, returning respectively the ancestor, the level in the hierarchy and the hierarchy itself to which the member belongs in this case. Children - Returns a set of child members of this member. Those. in our case – countries.
on rows– Specifies how to arrange this data in the resulting table. In this case - in the header of the lines. Possible values ​​here: on columns, on pages, on paragraphs, etc. It is also possible to simply indicate by index, starting from 0.
from– this is an indication of the cube from which the selection is made.

What if we don't need all countries, but only a couple specific ones? To do this, we can explicitly specify in the request the countries that we need, rather than selecting everything using the Children function.

Select ( ..., ... ) on rows from
The curly braces in this case are the declaration of the set ( Set). A set is a list, an enumeration of members from one axis.

Now let’s write a query for our second example - output in the context of a delivery person:

Select ...Children on rows .Members on columns from
Added here:
– axis;
.Members– an axis function that returns all terms on it. Hierarchy and level have the same function. Because There is only one hierarchy in this axis, then its indication can be omitted, because the level and hierarchy are also the same, then you can display all members in one list.

I think it’s already obvious how we can continue this with our third example with detail by year. But let’s better not drill down by year, but filter – i.e. build a slice To do this, we will write the following query:

Select ..Children on rows .Members on columns from where (.)
Where is the filtration here?

where- keyword
is one member of the hierarchy . The full name, including all terms, would be: .. , but because Since the name of this member is unique within the axis, all intermediate clarifications of the name can be omitted.

Why is the date term in parentheses? The parentheses are a tuple ( tuple). A tuple is one or more coordinates along various axes For example, to filter along two axes at once, in parentheses we list two terms from different measurements separated by commas. That is, the tuple defines a “slice” of the cube (or “filtering”, if such terminology is closer).

The tuple is used for more than just filtering. Tuples can also be in the headers of rows/columns/pages, etc.

This is necessary, for example, in order to display the result of a three-dimensional query in a two-dimensional table.

Select crossjoin(...Children, ..Children) on rows .Members on columns from where (.)
Crossjoin is a function. It returns a set of tuples (yes, a set can contain tuples!) resulting from the Cartesian product of two sets. Those. the resulting set will contain all possible combinations of Countries and Years. The row headers will thus contain a pair of values: Country-Year.

The question is, where is the indication of what numerical characteristics should be displayed? In this case, the default measure defined for this cube is used, i.e. Order price. If we want to derive another measure, then we remember that measures are members of a dimension Measures. And we act in exactly the same way as with the other axes. Those. filtering a query by one of the measures will display exactly this measure in the cells.

Question: What is the difference between filtering in where and filtering by specifying axis members in on rows. Answer: practically nothing. Simply in where a slice is indicated for those axes that do not participate in the formation of headings. Those. same axis can not be present at the same time on rows, and in where.

Computed Members

For more complex queries You can declare calculated members. Members of both the attribute and measure axes. Those. You can declare, for example, a new measure that will display the contribution of each country to the total amount of orders:

With member. as '.CurrentMember / ..', FORMAT_STRING='0.00%' select ...Children on rows from where .
The calculation occurs in the context of a cell in which all its coordinate attributes are known. The corresponding coordinates (members) can be obtained by the CurrentMember function for each of the cube axes. Here we must understand that the expression .CurrentMember/..’ does not divide one term by another, but divides relevant aggregated data cube slices! Those. the slice for the current territory will be divided into a slice for all territories, i.e. the total value of all orders. FORMAT_STRING – sets the format for displaying values, i.e. %.

Another example of a calculated member, but on the years axis:

With member. as '. - .’
Obviously, the report will not contain a unit, but the difference of the corresponding sections, i.e. the difference in the amount of orders in these two years.

Display in ROLAP

OLAP systems are one way or another based on some kind of data storage and organization system. When we're talking about about RDBMS, then they talk about ROLAP (we’ll leave MOLAP and HOLAP for self-study). ROLAP – OLAP on a relational database, i.e. described in the form of ordinary two-dimensional tables. ROLAP systems convert MDX queries into SQL. The main computing problem for databases is fast aggregation. To aggregate faster, the data in the database is usually highly denormalized, i.e. are not stored very efficiently in terms of disk space taken up and database integrity monitoring. Plus they additionally contain auxiliary tables that store partially aggregated data. Therefore, for OLAP, a separate database schema is usually created, which only partially replicates the structure of the original transactional databases in terms of directories.

Navigation

Many OLAP systems offer interactive navigation tools for an already generated query (and accordingly selected data). In this case, the so-called “drilling” or “drilling” is used. A more adequate translation into Russian would be the word “deepening.” But this is a matter of taste, in some environments the word “drilling” has stuck.

Drill– this is report detailing by reducing the degree of data aggregation, combined with filtering along some other axis (or several axes). There are several types of drilling:

  • drill-down– filtering along one of the source axes of the report with the display of detailed information on descendants within the hierarchy of the selected filtering member. For example, if there is a report on the distribution of orders broken down by Countries and Years, then clicking on the year 2007 will display a report broken down by the same Countries and months of 2007.
  • drill-side– filtering under one or more selected axes and removing aggregation along one or more other axes. For example, if there is a report on the distribution of orders broken down by Countries and Years, then clicking on the year 2007 will display another report broken down, for example, by Countries and Suppliers with filtering by 2007.
  • drill-trough– removing aggregation along all axes and simultaneous filtering along them – allows you to see the source data from the fact table from which the value in the report was obtained. Those. When you click on a cell value, a report is displayed with all orders that gave this amount. A kind of instant drilling into the very “depths” of the cube.
That's all. Now, if you decide to devote yourself to Business Intelligence and OLAP, it’s time to start reading serious literature.

Tags: Add tags

I have been a resident of Habr for quite some time, but I have never read articles on the topic of multidimensional cubes, OLAP and MDX, although the topic is very interesting and is becoming more and more relevant every day.
It is no secret that during that short period of time of development of databases, electronic accounting and online systems, a lot of data itself has accumulated. Now, a full analysis of the archives, and perhaps an attempt to predict situations for similar models in the future, is also of interest.
On the other hand, large companies, even over the course of several years, months or even weeks, can accumulate such large amounts of data that even their basic analysis requires extraordinary approaches and stringent hardware requirements. These could be banking transaction processing systems, exchange agents, telephone operators etc.
I think everyone is well aware of 2 different approaches to database design: OLTP and OLAP. The first approach (Online Transaction Processing - real-time transaction processing) is designed for efficient data collection in real time, while the second (Online Analytical Processing - real-time analytical processing) is aimed specifically at sampling and processing data in the most efficient way.

Let's look at the main capabilities of modern OLAP cubes and what problems they solve (Analysis Services 2005/2008 are taken as a basis):

  • fast access to data
  • preaggregation
  • hierarchy
  • working with time
  • multidimensional data access language
  • KPI (Key Performance Indicators)
  • date mining
  • multi-level caching
  • multilingual support
So, let's look at the capabilities of OLAP cubes in a little more detail.

A little more about the possibilities

Quick access to data
Actually, fast access to data, regardless of the size of the array, is the basis of OLAP systems. Since this is the main focus, a data warehouse is usually built on principles different from those of relational databases.
Here, the time to fetch simple data is measured in fractions of a second, and a query exceeding a few seconds most likely requires optimization.

Preaggregation
In addition to quickly retrieving existing data, it also provides the ability to preaggregate “most likely to be used” values. For example, if we have daily records of sales of a certain product, the system Maybe We can also preaggregate monthly and quarterly sales amounts, which means that if we request data monthly or quarterly, the system will instantly give us the result. Why does pre-aggregation not always occur? Because theoretically possible combinations of goods/time/etc. there can be a huge number, which means you need to have clear rules for which elements the aggregation will be built and for which not. In general, the topic of taking these rules into account and the actual design of aggregations is quite extensive and deserves a separate article in itself.

Hierarchies
It is natural that when analyzing data and constructing final reports, there is a need to take into account the fact that months consist of days, and they themselves form quarters, and cities are included in areas, which in turn are part of regions or countries. The good news is that OLAP cubes initially they consider data from the point of view of hierarchies and relationships with other parameters of the same entity, so building and using hierarchies in cubes is a very simple matter.

Working with time
Since data analysis mainly takes place in time areas, time is given special importance in OLAP systems, which means that by simply defining for the system where we have time here, in the future you can easily use functions like Year To Date, Month To Date ( the period from the beginning of the year/month to the current date), Parallel Period (on the same day or month, but last year), etc.

Multidimensional Data Access Language
MDX(Multidimensional Expressions) - a query language for simple and efficient access to multidimensional data structures. And that says it all – there will be a few examples below.

Key Performance Indicators (KPI)
Key Performance Indicators is a financial and non-financial measurement system that helps an organization determine the achievement of strategic goals. Key performance indicators can be quite simply defined in OLAP systems and used in reports.

Mining date
Data Mining(Data Mining) - essentially, identifying hidden patterns or relationships between variables in large data sets.
The English term “Data Mining” does not have an unambiguous translation into Russian (data mining, data mining, information mining, data/information extraction) therefore in most cases it is used in the original. The most successful indirect translation is the term “data mining” (DMA). However, this is a separate, no less interesting topic for consideration.

Multi-level caching
Actually, to ensure the highest speed of data access, in addition to tricky data structures and preaggregations, OLAP systems support multi-level caching. In addition to caching simple queries, parts of data read from the store, aggregated values, and calculated values ​​are also cached. Thus, the longer you work with an OLAP cube, the faster it, in fact, starts working. There is also the concept of “warming up the cache” - an operation that prepares the OLAP system for working with specific reports, queries, or all combined.

Multilingual support
Yes Yes Yes. At a minimum, Analysis Services 2005/2008 (though Enterprise Edition) natively supports multilingualism. It is enough to provide a translation of the string parameters of your data, and the client who specified his language will receive localized data.

Multidimensional cubes

So what exactly are these multidimensional cubes?
Let's imagine a 3-dimensional space whose axes are Time, Products and Customers.
A point in such a space will indicate the fact that one of the buyers bought a specific product in a particular month.

In fact, the plane (or the set of all such points) will be the cube, and, accordingly, Time, Products and Customers will be its dimensions.
It is a little more difficult to imagine (and draw) a four-dimensional or more cube, but the essence does not change, and most importantly, for OLAP systems it does not matter at all in how many dimensions you will work (within reasonable limits, of course).

A little MDX

So, what’s the beauty of MDX? Most likely, it’s that we need to describe not how we want to select data, but What exactly we want.
For example,
SELECT
( . ) ON COLUMNS,
( ., . ) ON ROWS
FROM
WHERE (., .)

Which means I want the number of iPhones sold in June and July in Mozambique.
At the same time I describe which this is the data I want and How I want to see them in the report.
Beautiful, isn't it?

Here's a little more complicated:

WITH MEMBER AverageSpend AS
. / .
SELECT
( AverageSpend ) ON COLUMNS,
( .., .. ) ON ROWS
FROM
WHERE (.)

* This source code was highlighted with Source Code Highlighter.

In fact, first we determine the formula for calculating the “average purchase size” and try to compare who (what gender) spends more money in one visit to the Apple store.

The language itself is extremely interesting both to study and to use, and perhaps deserves a lot of discussion.

Conclusion

In fact, this article covers very little of even basic concepts; I would call it an “appetizer” - an opportunity to interest the Habra community in this topic and develop it further. As for development, there is a huge unplowed field here, and I will be happy to answer all your questions.

P.S. This is my first post about OLAP and the first publication on Habré - I would be very grateful for constructive feedback.
Update: I transferred it to SQL, I will transfer it to OLAP as soon as they allow me to create new blogs.

Tags: Add tags

As part of this work, the following issues will be considered:

  • What are OLAP cubes?
  • What are measures, dimensions, hierarchies?
  • What types of operations can be performed on OLAP cubes?
The concept of an OLAP cube

The main postulate of OLAP is multidimensionality in data presentation. In OLAP terminology, the concept of a cube, or hypercube, is used to describe a multidimensional discrete data space.

Cube is a multi-dimensional data structure from which a user-analyst can query information. Cubes are created from facts and dimensions.

Data- this is data about objects and events in the company that will be subject to analysis. Facts of the same type form measures. A measure is the type of value in a cube cell.

Measurements- these are the data elements by which the facts are analyzed. A collection of such elements forms a dimension attribute (for example, days of the week can form a time dimension attribute). In business analysis tasks for commercial enterprises, the dimensions often include categories such as “time”, “sales”, “products”, “customers”, “employees”, “geographic location”. The measurements are most often hierarchical structures, which are logical categories by which the user can analyze actual data. Each hierarchy can have one or more levels. Thus, the hierarchy of the “geographic location” dimension may include the levels: “country - region - city”. In the time hierarchy, we can distinguish, for example, the following sequence of levels: A dimension can have several hierarchies (each hierarchy of one dimension must have the same key attribute of the dimension table).

A cube can contain actual data from one or more fact tables and most often contains multiple dimensions. Any given cube usually has a specific focus for analysis.

Figure 1 shows an example of a cube designed to analyze sales of petroleum products by a certain company by region. This cube has three dimensions (time, product and region) and one measure (sales volume expressed in monetary terms). Measure values ​​are stored in the corresponding cells of the cube. Each cell is uniquely identified by a set of members of each dimension, called a tuple. For example, the cell located in the lower left corner of the cube (contains the value $98399) is specified by the tuple [July 2005, Far East, Diesel]. Here the value of $98,399 shows the sales volume (in monetary terms) of diesel in the Far East for July 2005.

It is also worth noting that some cells do not contain any values: these cells are empty because the fact table does not contain data for them.

Rice. 1. Cube with information on sales of petroleum products in various regions

The ultimate goal of creating such cubes is to minimize the processing time of queries that extract the required information from the actual data. To accomplish this task, cubes typically contain precomputed totals called aggregations(aggregations). Those. the cube covers a data space larger than the actual one - there are logical, calculated points in it. Aggregation functions allow you to calculate the values ​​of points in logical space based on actual values. The simplest aggregation functions are SUM, MAX, MIN, COUNT. So, for example, using the MAX function, for the cube given in the example, you can identify when the peak in diesel sales occurred in the Far East, etc.

Another specific feature of multidimensional cubes is the difficulty of determining the origin. For example, how do you set point 0 for the Product or Regions dimension? The solution to this problem is to introduce a special attribute that combines all the elements of the dimension. This attribute (created automatically) contains only one element - All. For simple aggregation functions such as sum, the All element is equivalent to the sum of the values ​​of all elements in the actual space of a given dimension.

An important concept in a multidimensional data model is the subspace, or sub cube. A subcube is a part of the full space of a cube in the form of some multidimensional figure inside the cube. Since the multidimensional space of a cube is discrete and limited, the subcube is also discrete and limited.

Operations on OLAP cubes

The following operations can be performed on an OLAP cube:

  • slice;
  • rotation;
  • consolidation;
  • detailing.
Slice(Figure 2) is a special case of a subcube. This is a procedure for forming a subset of a multidimensional data array corresponding to a single value of one or more dimension elements not included in this subset. For example, to find out how sales of petroleum products progressed over time only in a certain region, namely in the Urals, you need to fix the “Products” dimension on the “Ural” element and extract the corresponding subset (subcube) from the cube.
  • Rice. 2. OLAP cube slice

    Rotation(Figure 3) - the operation of changing the location of measurements presented in a report or on the displayed page. For example, a rotation operation may involve rearranging the rows and columns of a table. Additionally, rotating a data cube moves out-of-tabular dimensions into place with dimensions present on the displayed page, and vice versa.

    In general, every specialist knows what OLAP is today. At least, the concepts of “OLAP” and “multidimensional data” are firmly connected in our minds. Nevertheless, the fact that this topic is being raised again, I hope, will be approved by the majority of readers, because in order for the idea of ​​​​something not to become outdated over time, you need to periodically communicate with smart people or read articles in a good publication...

    Data warehouses (place of OLAP in the information structure of the enterprise)

    The term "OLAP" is inextricably linked with the term "data warehouse" (Data Warehouse).

    Here is the definition formulated by the “founding father” of data warehousing, Bill Inmon: “A data warehouse is a domain-specific, time-bound, immutable collection of data to support management decision-making.”

    The data in the warehouse comes from operational systems (OLTP systems), which are designed to automate business processes. In addition, the repository can be replenished from external sources, such as statistical reports.

    Why build data warehouses - after all, they contain obviously redundant information that already “lives” in databases or operating system files? The answer can be brief: it is impossible or very difficult to directly analyze data from operating systems. This is due to various reasons, including the fragmentation of data, its storage in different DBMS formats and in different “corners” corporate network. But even if an enterprise stores all its data on a central database server (which is extremely rare), an analyst will almost certainly not understand their complex, sometimes confusing structures. The author has quite a sad experience of trying to “feed” hungry analysts with “raw” data from operational systems - it turned out to be “too much for them”.

    Thus, the purpose of the repository is to provide the “raw materials” for analysis in one place and in a simple, understandable structure. Ralph Kimball, in the preface to his book "The Data Warehouse Toolkit", writes that if, after reading the entire book, the reader understands only one thing - namely, that the structure of the warehouse should be simple - the author will consider his task completed.

    There is another reason that justifies the appearance of a separate repository - complex analytical queries to operational information slow down current work companies, blocking tables for a long time and seizing server resources.

    In my opinion, a repository does not necessarily mean a gigantic accumulation of data - the main thing is that it is convenient for analysis. Generally speaking, there is a separate term for small storage facilities - Data Marts (data kiosks), but in our Russian practice you don’t often hear it.

    OLAP - a convenient analysis tool

    Centralization and convenient structuring are not all that an analyst needs. He still needs a tool for viewing and visualizing information. Traditional reports, even those built on a single repository, lack one thing - flexibility. They cannot be "twisted", "expanded" or "collapsed" to get the desired view of the data. Of course, you can call a programmer (if he wants to come), and he (if he is not busy) will make a new report quickly enough - say, within an hour (I’m writing this and I don’t believe it myself - it doesn’t happen that fast in life; let’s give him three hours) . It turns out that an analyst can test no more than two ideas per day. And he (if he is a good analyst) can come up with several such ideas per hour. And the more “slices” and “sections” of data the analyst sees, the more ideas he has, which, in turn, require more and more “slices” for verification. If only he had a tool that would allow him to expand and collapse data simply and conveniently! OLAP acts as such a tool.

    Although OLAP is not a necessary attribute of a data warehouse, it is increasingly being used to analyze the information accumulated in the warehouse.

    The components included in a typical repository are shown in Fig. 1.

    Rice. 1. Data warehouse structure

    Operational data is collected from various sources, cleansed, integrated and stored in a relational store. Moreover, they are already available for analysis using various reporting tools. Then the data (in whole or in part) is prepared for OLAP analysis. They can be loaded into a special OLAP database or stored in relational storage. Its most important element is metadata, i.e. information about the structure, placement and transformation of data. Thanks to them, effective interaction of various storage components is ensured.

    To summarize, we can define OLAP as a set of tools for multidimensional analysis of data accumulated in a warehouse. Theoretically, OLAP tools can be applied directly to operational data or their exact copies (so as not to interfere with operational users). But we thereby risk stepping on the rake already described above, that is, starting to analyze operational data that is not directly suitable for analysis.

    Definition and basic concepts of OLAP

    First, let's decipher: OLAP is Online Analytical Processing, i.e. operational data analysis. The 12 defining principles of OLAP were formulated in 1993 by E. F. Codd, the “inventor” of relational databases. Later, its definition was reworked into the so-called FASMI test, which requires that the OLAP application provide the ability to quickly analyze shared multidimensional information ().

    FASMI test

    Fast(Fast) - analysis should be carried out equally quickly on all aspects of the information. Acceptable response time is 5 seconds or less.

    Analysis(Analysis) - it must be possible to carry out basic types of numerical and statistical analysis, predefined by the application developer or freely defined by the user.

    Shared(Shared) - many users must have access to data, while it is necessary to control access to confidential information.

    Multidimensional(Multidimensional) is the main, most essential characteristic of OLAP.

    Information(Information) - the application must be able to access any necessary information, regardless of its volume and storage location.

    OLAP = Multidimensional View = Cube

    OLAP provides convenient, fast means of accessing, viewing, and analyzing business information. The user receives a natural, intuitive data model, organizing them in the form of multidimensional cubes (Cubes). The axes of the multidimensional coordinate system are the main attributes of the analyzed business process. For example, for sales it could be product, region, type of buyer. Time is used as one of the dimensions. At the intersections of the axes - dimensions (Dimensions) - there are data that quantitatively characterize the process - measures (Measures). This can be sales volumes in pieces or in monetary terms, stock balances, costs, etc. The user analyzing the information can “cut” the cube according to different directions, receive summary (for example, by year) or, conversely, detailed (by week) information and carry out other manipulations that come to his mind during the analysis process.

    As measures in the three-dimensional cube shown in Fig. 2, sales amounts are used, and time, product and store are used as dimensions. Measurements are presented at specific levels of grouping: products are grouped by category, stores by country, and transaction times are grouped by month. A little later we will look at the levels of grouping (hierarchy) in more detail.


    Rice. 2. Cube example

    "Cutting" a cube

    Even a three-dimensional cube is difficult to display on a computer screen so that the values ​​of the measures of interest are visible. What can we say about cubes with more than three dimensions? To visualize data stored in a cube, as a rule, familiar two-dimensional, i.e., tabular, views with complex hierarchical row and column headings are used.

    A two-dimensional representation of a cube can be obtained by “cutting” it across one or more axes (dimensions): we fix the values ​​of all dimensions except two, and we get a regular two-dimensional table. The horizontal axis of the table (column headers) represents one dimension, the vertical axis (row headers) represents another, and the table cells represent the values ​​of the measures. In this case, a set of measures is actually considered as one of the dimensions - we either select one measure to display (and then we can place two dimensions in the row and column headings), or show several measures (and then one of the table axes will be occupied by the names of the measures, and the other - values ​​of the only “uncut” dimension).

    Take a look at fig. 3 - here is a two-dimensional slice of the cube for one measure - Unit Sales (pieces sold) and two "uncut" dimensions - Store (Store) and Time (Time).


    Rice. 3. 2D cube slice for one measure

    In Fig. Figure 4 shows only one “uncut” dimension - Store, but it displays the values ​​of several measures - Unit Sales (units sold), Store Sales (sale amount) and Store Cost (store expenses).


    Rice. 4. 2D cube slice for multiple measures

    A two-dimensional representation of a cube is also possible when more than two dimensions remain “uncut”. In this case, two or more dimensions of the “cut” cube will be placed on the slice axes (rows and columns) - see Fig. 5.


    Rice. 5. 2D cube slice with multiple dimensions on one axis

    Tags

    The values ​​"laid" along dimensions are called members or labels. Labels are used both to “cut” the cube and to limit (filter) the selected data - when in a dimension that remains “uncut” we are not interested in all the values, but in a subset of them, for example, three cities out of several dozen. Label values ​​appear in the 2D cube view as row and column headings.

    Hierarchies and levels

    Labels can be combined into hierarchies consisting of one or more levels. For example, the labels of the Store dimension are naturally grouped into a hierarchy with levels:

    Country

    State

    City

    Store.

    Aggregate values ​​are calculated according to the hierarchy levels, for example sales volume for USA ("Country" level) or for California ("State" level). It is possible to implement more than one hierarchy in one dimension - say, for time: (Year, Quarter, Month, Day) and (Year, Week, Day).

    Architecture of OLAP applications

    Everything that was said above about OLAP essentially related to the multidimensional presentation of data. How the data is stored, roughly speaking, does not concern either the end user or the developers of the tool the client uses.

    Multidimensionality in OLAP applications can be divided into three levels:

    • Multidimensional data representation - end-user tools that provide multidimensional visualization and manipulation of data; layer multidimensional representation abstracts from the physical structure of the data and perceives the data as multidimensional.
    • Multidimensional processing - a tool (language) for formulating multidimensional queries (traditional relational SQL language turns out to be unsuitable here) and a processor capable of processing and executing such a request.
    • Multidimensional storage is a means of physically organizing data that ensures the efficient execution of multidimensional queries.

    The first two levels are mandatory in all OLAP tools. The third level, although widespread, is not necessary, since data for a multidimensional representation can also be extracted from ordinary relational structures; The multidimensional query processor in this case translates multidimensional queries into SQL queries that are executed by the relational DBMS.

    Specific OLAP products, as a rule, are either a multidimensional data representation tool, an OLAP client (for example, Pivot Tables in Excel 2000 from Microsoft or ProClarity from Knosys), or a multidimensional server DBMS, an OLAP server (for example, Oracle Express Server or Microsoft OLAP Services).

    The multidimensional processing layer is usually built into the OLAP client and/or OLAP server, but can be isolated in its pure form, such as Microsoft's Pivot Table Service component.

    Technical aspects of multidimensional data storage

    As mentioned above, OLAP analysis tools can also extract data directly from relational systems. This approach was more attractive in those days when OLAP servers were not included in the price lists of leading DBMS manufacturers. But today, Oracle, Informix, and Microsoft offer full-fledged OLAP servers, and even those IT managers who do not like to create a “zoo” of software from different manufacturers in their networks can buy (or rather, make a corresponding request to the company management ) OLAP server of the same brand as the main database server.

    OLAP servers, or multidimensional database servers, can store their multidimensional data in different ways. Before considering these methods, we need to talk about such an important aspect as storing units. The fact is that in any data warehouse - both ordinary and multidimensional - along with detailed data extracted from operational systems, summary indicators (aggregated indicators, aggregations) are also stored, such as the sum of sales volumes by month, by category goods, etc. Aggregates are stored explicitly for the sole purpose of speeding up the execution of queries. After all, on the one hand, as a rule, a very large amount of data is accumulated in the warehouse, and on the other hand, analysts in most cases are interested not in detailed, but in generalized indicators. And if millions of individual sales had to be added up each time to calculate the total sales for the year, the speed would most likely be unacceptable. Therefore, when loading data into a multidimensional database, all total indicators or part of them are calculated and stored.

    But, as you know, you have to pay for everything. And for the speed of processing requests for summary data, you have to pay for an increase in data volumes and time for loading them. Moreover, an increase in volume can become literally catastrophic - in one of the published standardized tests a full calculation of aggregates for 10 MB of original data required 2.4 GB, i.e. the data grew 240 times! The degree of data “swelling” when calculating aggregates depends on the number of dimensions of the cube and the structure of these dimensions, i.e., the ratio of the number of “fathers” and “children” at different measurement levels. To solve the problem of storing aggregates, sometimes complex schemes are used, which make it possible to achieve a significant increase in query performance when calculating not all possible aggregates.

    Now about the various options for storing information. Both granular data and aggregates can be stored in either relational or multidimensional structures. Multidimensional storage allows you to treat data as a multidimensional array, which ensures equally fast calculations of total indicators and various multidimensional transformations along any of the dimensions. Some time ago, OLAP products supported either relational or multidimensional storage. Today, as a rule, the same product provides both of these types of storage, as well as a third type - mixed. The following terms apply:

    • MOLAP(Multidimensional OLAP) - both detailed data and aggregates are stored in a multidimensional database. In this case, the greatest redundancy is obtained, since multidimensional data completely contains relational data.
    • ROLAP(Relational OLAP) - detailed data remains where it originally “lived” - in the relational database; aggregates are stored in the same database in specially created service tables.
    • HOLAP(Hybrid OLAP) - detailed data remains in place (in a relational database), and aggregates are stored in a multidimensional database.

    Each of these methods has its own advantages and disadvantages and should be used depending on the conditions - the volume of data, the power of the relational DBMS, etc.

    When storing data in multidimensional structures, there is a potential problem of "bloat" due to storing empty values. After all, if in a multidimensional array space is reserved for all possible combinations of dimension labels, but only a small part is actually filled (for example, a number of products are sold only in a small number of regions), then most of the cube will be empty, although the space will be occupied. Modern OLAP products can cope with this problem.

    To be continued. In the future, we will talk about specific OLAP products produced by leading manufacturers.

    04/07/2011 Derek Comingore

    If you've worked in any technology-related field, you've probably heard the term "cube"; however, most ordinary database administrators and developers did not work with these objects. Cubes provide a powerful data architecture for quickly aggregating multidimensional information. If your organization needs to analyze large volumes of data, then ideal solution it will be a cube

    What is a cube?

    Relational databases were designed to handle thousands of concurrent transactions while maintaining performance and data integrity. By design, relational databases are not efficient at aggregating and searching large volumes of data. To aggregate and return large volumes of data, a relational database must receive a set-based query, the information for which will be collected and aggregated on the fly. Such relational queries are very expensive because they rely on multiple joins and aggregate functions; Aggregate relational queries are especially ineffective when working with large amounts of data.

    Cubes are multidimensional entities designed to address this deficiency in relational databases. By using a cube, you can provide users with a data structure that provides fast response to queries with large aggregation volumes. Cubes perform this “aggregation magic” by first aggregating data (dimensions) across multiple dimensions. Pre-aggregation of the cube is usually carried out during processing. When you process a cube, you produce precomputed data aggregations that are stored in binary form on disk.

    The cube is the central data structure in operating system SQL Server Analytical Services (SSAS) OLAP data analysis. Cubes are typically built from an underlying relational database called a dimensional model, but are separate technical entities. Logically, a cube is a data warehouse that is made up of dimensions (dimensions) and measurements (measures). Dimensions contain descriptive features and hierarchies, while dimensions are the facts that you describe in dimensions. Dimensions are grouped into logical combinations called dimension groups. You link dimensions to measurement groups based on a characteristic - the degree of detail.

    IN file system a cube is implemented as a sequence of linked binary files. The binary architecture of the cube facilitates the rapid retrieval of large volumes of multidimensional data.

    I mentioned that cubes are built from an underlying relational database called a dimensional model. The dimension model contains relational tables (fact and dimension) that connect it to the cube entities. Fact tables contain dimensions such as the quantity of a product sold. Dimension tables store descriptive attributes such as product names, dates, and employee names. Typically, fact tables and dimension tables are related through primary foreign key constraints, with the foreign keys located in the fact table (this relational relationship relates to the cube granularity attribute discussed above). When dimension tables are linked directly to a fact table, a star schema is formed. When dimension tables are not directly linked to a fact table, the result is a snowflake schema.

    Please note that dimensional models are classified according to application. A data mart is a dimensional model that is designed for a single business process, such as sales or inventory management. A data warehouse is a dimensional model designed to capture component business processes so that it facilitates cross-business process analytics.

    Software requirements

    Now that you have a basic understanding of what cubes are and why they're important, I'll turn on the gears and take you on a step-by-step tour of building your first cube using SSAS. There are some basic components software, which you will need, so before you start building your first cube, make sure your system meets the requirements.

    My example Internet Sales cube will be built from the AdventureWorksDW 2005 test database. I will build the test cube from a subset of the tables found in the test database that will be useful for analyzing Internet sales data. Figure 1 shows the basic layout of the database tables. Since I'm using version 2005, you can follow my instructions using either SQL Server 2005 or SQL Server 2008.

    Figure 1. Subset of the Adventure Works Internet Sales data mart

    The Adventure WorksDW 2005 training database can be found on the CodePlex website: msftdbprodsamples.codeplex.com. Find the link “SQL Server 2005 product sample databases are still available” (http://codeplex.com/MSFTDBProdSamples/Release/ProjectReleases.aspx?ReleaseId=4004). The training database is contained in the file AdventureWorksBI.msi (http://msftdbprodsamples.codeplex.com/releases/view/4004#DownloadId=11755).

    As mentioned, you must have access to an instance of SQL Server 2008 or 2005, including SSAS and Business Intelligence Development Studio (BIDS) components. I'll be using SQL Server 2008, so you may see some subtle differences if you're using SQL Server 2005.

    Creating an SSAS Project

    The first thing you should do is create an SSAS project using BIDS. Find BIDS in the Start menu and then in the Microsoft SQL Server 2008/2005 menu, sub-item SQL Server Business Intelligence Development Studio. Clicking this button will launch BIDS with the default splash screen. Create new project SSAS by selecting File, New, Project. You'll see the New Project dialog box, which Figure 1 shows. Select the Analysis Services Project folder and set the project description to SQLMAG_MyFirstCube. Click OK.

    Once the project is created, right-click on it in Solution Explorer and select context menu Properties item. Now select the Deployment section on the left side of the SQLMAG_MyFirstCube: Property Pages dialog box and review the Target Server and Database settings settings, as Figure 2 shows. If you're working in a distributed SQL Server environment, you'll need to qualify the Target Server property with the name of the server. to which you are going to deploy. Click OK when you are happy with the deployment settings for this SSAS project.

    Defining the data source

    The first object you need to create is the data source. A data source object provides the schema and data used to build the objects associated with and at the base of the cube. To create a data source object in BIDS, use the Source Wizard Data Source Wizard.

    Start the Data Source Wizard by right-clicking on the Data Source folder in the Solution Explorer panel and selecting New Data Source. You will find that creating SSAS objects in BIDS has a developmental nature. The wizard first walks you through the object creation process and general settings. And then you open the resulting SSAS object in the designer and customize it in detail if necessary. Once you get past the prompt screen, define a new data connection by clicking the New button. Select and create a new connection based on Native OLEDB\SQL Server Native Client 10 pointing to the one you want SQL server Server that owns the desired database instance. You can use either Windows or SQL Server authentication, depending on your SQL Server environment settings. Click the Test Connection button to ensure that you have correctly identified the database connection, and then click OK.

    Next comes Impersonation Information, which, like data association, depends on how the SQL Server environment is structured. Privilege borrowing is the security context that SSAS relies on when processing its objects. If you're managing your deployment on a primary, single server (or laptop), as I assume most readers are, you can simply select the Use the service account option. Click Next to complete the Data Source Wizard and set AWDW2005 as the Data Source Name. It's quite convenient that you can use this method for testing purposes, but in a real production environment it's not the most best practice- use a service account. It is better to specify domain Accounts to borrow SSAS connection rights to the data source.

    Data Source View

    For the data source you have defined, the next step in the SSAS cube building process is to create a Data Source View (DSV). DSV provides the ability to separate the schema that your cube expects from that of the underlying database. As a result, DSV can be used to extend the underlying relational schema when building a cube. Some of the key features of DSV for extending data source schemas include named queries, logical relationships between tables, and named calculated columns.

    Let's go ahead and right-click on the DSV folder and select New Data Source View to launch the Create New DSV View wizard. In the dialog box, at the Select a Data Source step, select a relational database connection and click Next. Select the FactInternetSales, DimProduct, DimTime, DimCustomer tables and click the single right arrow button to move these tables to the Included column. Finally, click Next and complete the wizard by accepting the default name and clicking Finish.

    At this point, you should have a DSV view located under the Data Source Views folder in Solution Explorer. Double click on the new DSV to launch the DSV designer. You should see all four tables for a given DSV, as shown in Figure 2.

    Creating Database Dimensions

    As I explained above, dimensions provide descriptive features of dimensions and hierarchies that are used to enable aggregation above the level of detail. It is important to understand the difference between a database dimension and a cube dimension: the dimensions from the database provide the underlying dimension objects for the several dimensions of the cube that will be used to build the cube.

    Database and cube dimensions provide an elegant solution to a concept known as "role dimensions." Role-based dimensions are used when you need to use a single dimension in a cube multiple times. Date is a perfect example in this cube instance: you will construct a single date dimension and reference it once for each date for which you want to analyze online sales. The calendar date will be the first dimension you create. Right-click the Dimensions folder in Solution Explorer and select New Dimension to launch the Dimension Wizard. Select Use an existing table and click Next in the Select Creation Method step. At the Specify Source Information step, specify the DimTime table in the Main table drop-down list and click Next. Now, at the Select Dimension Attributes step, you need to select the attributes of the time dimension. Select each attribute, as Figure 3 shows.

    Click Next. As a final step, enter Dim Date in the Name field and click Finish to complete the Dimension Wizard. You should now see the new Dim Date dimension located under the Dimensions folder in Solution Explorer.

    Then use the Dimension Wizard to create product and customer dimensions. Follow the same steps to create the base dimension as before. When working with the Dimension Wizard, make sure that you select all potential attributes in the Select Dimension Attributes step. The default values ​​for the other settings are fine for a test cube instance.

    Creating an Internet Sales Cube

    Now that you have prepared the database dimensions, you can begin building the cube. In Solution Explorer, right-click the Cubes folder and select New Cube to launch the Cube Wizard. In the Select Creation Method window, select the Use existing tables option. Select the FactInternetSales table for Measure Group in the Select Measure Group Tables step. Uncheck the boxes next to the Promotion Key, Currency Key, Sales Territory Key, and Revision Number dimensions in the Select Measures step and click Next.

    On the Select Existing Dimensions screen, ensure that all existing database dimensions are selected to be used as cube dimensions. Because I would like to keep this cube as simple as possible, deselect the FactInternetSales dimension in the Select New Dimensions step. By leaving the FactInternetSales dimension selected, you would create what is called a fact dimension or degenerate dimension. Fact dimensions are dimensions that were created using a basic fact table as opposed to a traditional dimension table.

    Click Next to go to the Completing the Wizard step and enter "My First Cube" in the Cube Name field. Click the Finish button to complete the Create Cube Wizard process.

    Expanding and Processing a Cube

    Now you're ready to deploy and process the first cube. Right-click the new cube icon in Solution Explorer and select Process. You will see a message box stating that the content appears to be out of date. Click Yes to deploy the new cube to the target SSAS server. When you deploy a cube you send XML file for Analisis (XMLA) to the target SSAS server, which creates a cube on the server itself. As mentioned, processing a cube populates its binaries on disk with data from the main source, as well as additional metadata you've added (cube dimensions, dimensions, and settings).

    Once the deployment process is complete, a new Process Cube dialog box appears. Click the Run button to begin processing the cube, which opens with the Process Progress window. When processing is complete, click Close (twice to close both dialog boxes) to complete the cube deployment and processing processes.

    You have now built, deployed and processed your first cube. You can view this new cube by right-clicking on it in the Solution Explorer window and selecting Browse. Drag dimensions to the center of the pivot table and dimension attributes onto rows and columns to explore your new cube. Notice how quickly the cube processes various aggregation queries. Now you can appreciate the unlimited power, and therefore business value, of the OLAP cube.

    Derek Comingore ( [email protected]) is a senior architect at B.I. Voyage, which has Microsoft Partner status in the field of business analytics. Has the SQL Server MVP title and several Microsoft certifications





  • 
    Top