Data warehousing withPostgreSQLGabriele Bartolini gabriele.bartolini at PostgreSQL Day 20096 November, ParisTech Telecom, Paris,

Audience Case of one PostgreSQL node data warehouse– This talk does not directly address multi-node distribution ofdata Limitations on disk usage and concurrent access– No rule of thumb– Depends on a careful analysis of data flows and requirements Small/medium size

Summary Data warehousing introductory conceptsPostgreSQL strengths for data warehousingData loading on PostgreSQLAnalysis and reporting of a PostgreSQL DWExtending PostgreSQL for data warehousingPostgreSQL current

Part one: Data warehousing basics Business intelligenceData warehouseDimensional modelStar schemaGeneral

Business intelligence & Data warehouse Business intelligence: “skills, technologies,applications and practices used to help a businessacquire a better understanding of its commercialcontext” Data warehouse: “A data warehouse houses astandardized, consistent, clean and integrated formof data sourced from various operational systems inuse in the organization, structured in a way tospecifically address the reporting and analyticrequirements”– Data warehousing is a broader

A simple

PostgreSQL RDBMS for DW? The typical storage system for a data warehouse is aRelational DBMS Key aspects:– Standards compliance (e.g. SQL)– Integration with external tools for loading and analysis PostgreSQL 8.4 is an ideal

Example of dimensional model Subject: commerce Process: sales Dimensions: customer, product– Analyse sales by customer and product over


General concepts Keep the model simple (star schema is fine) Denormalise tables Keep track of changes that occur over time ondimension attributes Use calendar tables (static, read-only)

Example of calendar table-- Days (calendar date)CREATE TABLE calendar (-- days since January 1, 4712 BCid day INTEGER NOT NULL PRIMARY KEY,sql date DATE NOT NULL UNIQUE,month day INTEGER NOT NULL,month INTEGER NOT NULL,year INTEGER NOT NULL,week day str CHAR(3) NOT NULL,month str CHAR(3) NOT NULL,year day INTEGER NOT NULL,year week INTEGER NOT NULL,week day INTEGER NOT NULL,year quarter INTEGER NOT NULL,work day INTEGER NOT NULL DEFAULT '1'.);SOURCE:

Part two: PostgreSQL and DW General featuresStored proceduresTablespacesTable partitioningSchemas / namespacesViewsWindowing functions and WITH

General features Connectivity:– PostgreSQL perfectly integrates with external tools orapplications for data mining, OLAP and reporting Extensibility:– User defined data types and domains– User defined functions Stored

Stored Procedures Key aspects in terms of data warehousing Make the data warehouse:– flexible– intelligent Allow to analyse, transform, model and deliver datawithin the database

Tablespaces Internal label for a physical directory in the filesystem Can be created or removed at anytime Allow to store objects such as tables and indexes ondifferent locations Good for scalability Good for

Horizontal table partitioning1/2 A physical design concept Basic support in PostgreSQL through

Views and schemas Views:– Can be seen as “placeholders” for queries– PostgreSQL supports read-only views– Handy for summary navigation of fact tables Schemas:– Similar to the “namespace” concept in OOA– Allows to organise database objects in logical

Window functions and WITH queries Both added in PostgreSQL 8.4 Window functions:– perform aggregate/rank calculations over partitions of theresult set– more powerful than traditional “GROUP BY” WITH queries:– label a subquery block, execute it once– allow to reference it in a query– can be

Part three: Optimisation techniques Surrogate keysLimited constraintsSummary navigationHorizontal table partitioningVertical table partitioning“Bridge tables” /

Use surrogate keys Record identifier within the database Usually a sequence:– serial (INT sequence, 4 bytes)– bigserial (BIGINT sequence, 8 bytes) Compact primary and foreign keys Allow to keep track of changes on

Limit the usage of constraints Data is already consistent No need for:– referential integrity (foreign keys)– check constraints– not-null

Implement summary navigation Analysing data through hierarchies in dimensions isvery time-consuming Sometimes caching these summaries is necessary:– real-time applications (e.g. web analytics)– can be achieved by simulating materialised views– requires careful management on latest appended data Skytools' PgQ can be used to manage it Can be totally delegated to OLAP

Horizontal (table) partitioning Partition tables based on record characteristics ( range, customer ID, etc.) Allows to split fact tables (or dimensions) in smallerchunks Great results when combined with

Vertical (table) partitioning Partition tables based on columns Split a table with many columns in more tables Useful when there are fields that are accessed morefrequently than others Generates:– Redundancy– Management headaches (careful planning)

Bridge hierarchy tables Defined by Kimball and RossVariable depth hierarchies (flattened trees)Avoid recursive queries in parent/child relationshipsGenerates:– Redundancy– Management headaches (careful planning)

Example of bridge hierarchy tableid bridge category integer not nullcategory key integer not nullcategory parent key integer not nulldistance level integer not nullbottom flag integer not null default 0top flag integer not null default 0id bridge category category key category parent key distance level bottom flag top flag-------------------- -------------- --------------------- ---------------- ------------- ---------1 1 1 0 0 12 586 1 1 1 03 587 1 1 1 04 588 1 1 1 05 589 1 1 1 06 590 1 1 1 07 591 1 1 1 08 2 2 0 0 19 3 2 1 1 0SOURCE:

Part four: Data loading ExtractionTransformationLoadingETL or ELT?Connecting to external sourcesExternal loadersExploration data

Extraction Data may be originally stored:– in different locations– on different systems– in different formats (e.g. database tables, flat files) Data is extracted from source systems Data may be

Transformation Data previously extracted is transformed–––––Selected, filtered, sortedTranslatedIntegratedAnalysed. Goal: prepare the data for the

Loading Data is loaded in the warehouse database Which frequency? Facts are usually appended– Issue: aggregate facts need to be


Connecting to external sources PostgreSQL allows to connect to external sources,through some of its extensions:– dblink– PL/Proxy– DBI-Link (any database type supported by Perl's DBI) External sources can be seen as database tables Practical for ETL/ELT operations:– INSERT . SELECT

External tools External tools for ETL/ELT can be used withPostgreSQL Many applications exist– Commercial– Open-source Kettle (part of Pentaho Data Integration) Generally use ODBC or JDBC (with Java)

Exploration data marts Business requirements change, continuously The data warehouse must offer ways:– to explore the historical data– to create/destroy/modify data marts in a staging area connected to the production warehouse totally independent, safe– this environment is commonly known as

Part five: Beyond PostgreSQL Data analysis and reporting Scaling a PostgreSQL warehouse with PL/

Data Analysis and reporting Ad-hoc applications External BI applications– Integrate your PostgreSQL warehouse with third-partyapplications for: OLAP Data mining Reporting– Open-source examples: Pentaho Data

Scaling with PL/Proxy PL/Proxy can be directly used for querying data from asingle remote database PL/Proxy can be used to speed up queries from a localdatabase in case of multi-core server and partitionedtable PL/Proxy can also be used:– to distribute work on several servers, each with their ownpart of data (known as shards)– to develop map/reduce type analysis over sets of

Part six: PostgreSQL's weaknesses Native support for data distribution and parallelprocessing On-disk bitmap indexes Transparent support for data partitioning Transparent support for materialised views Better support for “temporal”

Data distribution & parallel processing Shared nothing architectureAllow for (massive) parallel processingData is partitioned over servers, in shardsPostgreSQL also lacks a DISTRIBUTED BY clausePL/Proxy could potentially solve this

On-disk bitmap indexes Ideal for data warehouses Use bitmaps (vectors of bits) Would perfectly integrate with PostgreSQL in-memorybitmaps for bitwise logical

Transparent table partitioning Native transparent support for table partitioning isneeded– PARTITION BY clause is needed– Partition daily

Materialised views Currently can be simulated through stored proceduresand views A transparent native mechanism for the creation andmanagement of materialised views would be helpful– Automatic Summary Tables generation and managementwould be cool too!

Temporal extensions Some of TSQL2 features could be useful:– Period data type– Comparison functions on two periods, such as

Conclusions PostgreSQL is a suitable RDBMS technology for a singlenode data bilityLimitations apply For open-source multi-node data warehouse, useSkyTools (pgQ, Londiste and PL/Proxy) If Massive Parallel Processing is required:– Custom solutions can be developed using PL/Proxy– Easy to move up to commercial products based on PostgreSQLlike Greenplum, if data volumes and business requirementsneed

Recap Data warehousing introductory conceptsPostgreSQL strengths for data warehousingData loading on PostgreSQLAnalysis and reporting of a PostgreSQL DWExtending PostgreSQL for data warehousingPostgreSQL current

Thanks to 2ndQuadrant team:– Simon Riggs– Hannu Krosing– Gianni


License Creative Commons:– Attribution-Non-Commercial-Share Alike 2.5 Italy– You are free: to copy, distribute, display, and perform the work to make derivative works– Under the following conditions: Attribution. You must give the original author credit. Non-Commercial. You may not use this work for commercialpurposes. Share Alike. If you alter, transform, or build upon this work, youmay distribute the resulting work only under a licence identicalto this one. t/

PostgreSQL is a suitable RDBMS technology for a single node data warehouse: – FLEXIBILITY – Performances – Reliability – Limitations apply For open-source multi-node data warehouse, use SkyTools (pgQ, Londiste and PL/Proxy) If Massive Parallel Processing is required: – Custom solutions can be developed using PL/ProxyFile Size: 1MB