Transcription

PostgreSQLas a Schemaless Database.Christophe PettusPostgreSQL Experts, Inc.OSCON 2013

Welcome! I’m Christophe. PostgreSQL person since 1997. Consultant with PostgreSQL Experts, Inc. [email protected] thebuild.com @xof on Twitter.

What’s on the menu? What is a schemaless database? How can you use PostgreSQL to storeschemaless data? How does do the various schemalessoptions perform?

A note on NoSQL. Worst. Term. Ever. It’s true that all modern schemalessdatabases do not use SQL, but Neither did Postgres before it becamePostgreSQL. (Remember QUEL?) The defining characteristic is the lack of afixed schema.

Schematic. A schema is a fixed (although mutableover time) definition of the data. Database to schema (unfortunate term) totable to field/column/attribute. Individual fields can be optional (NULL). Adding new columns requires a schemachange.

Rock-n-Roll! Schemaless databases store “documents”rather than rows. They have internal structure, but that structure is per document. No fields! No schemas! Make up whateveryou like!

We are not amused. Culturally, very different from the glasshouse data warehouse model. Grew out of the need for persistent objectstorage and impatience with the (perceived)limitations of relational databases andobject-relational managers.

Let us never speak of thisagain. There’s a lot to talk about in schemaless vstraditional relational databases. But let’s not. Today’s topic: If you want to storeschemaless data in PostgreSQL, how canyou? And what can you expect?

What is schemaless data? Schemaless does not mean unstructured. Each “document” ( record/row) is ahierarchical structure of arrays and keyvalue pairs. The application knows what to expect inone of these and how to react if it doesn’t get it.

PostgreSQL has youcovered. Not one, not two, but three differentdocument types: XML hstore JSON Let’s see what they’ve got.

XMLIt seemed like a good idea at the time.

XML Been around since the mid-1990s. Hierarchical structured data based onSGML. Underlying technology for SOAP and a lotof other stuff that was really popular for awhile. Still super-popular in the Java world.

XML, your dad’s documentlanguage. Can specify XML schemas using DTDs. No one does this. Can do automatic transformations of XMLinto other markups using XSLT. Only the masochistic do this. Let’s not forget the most important use ofXML!

Tomcat Configuration Files. Server port "8005" shutdown "SHUTDOWN" debug "0" Service name "Tomcat-Standalone" Connector className port "8080" minProcessors "5" maxProcessors "75"enableLookups "true" redirectPort "8443"acceptCount "10" debug "0" connectionTimeout "60000"/ Engine name "Standalone" defaultHost "localhost" debug "0" Logger className "org.apache.catalina.logger.FileLogger"prefix "catalina log." suffix ".txt"timestamp "true"/ Realm className "org.apache.catalina.realm.MemoryRealm" / Host name "localhost" debug "0" appBase "webapps" unpackWARs "true" Valve className ry "logs" prefix "localhost access log." suffix ".txt"pattern "common"/ Logger className "org.apache.catalina.logger.FileLogger"directory "logs" prefix "localhost log." suffix ".txt"timestamp "true"/ Context path "/examples" docBase "examples" debug "0"reloadable "true" Logger className "org.apache.catalina.logger.FileLogger"prefix "localhost examples log." suffix ".txt"timestamp "true"/ /Context /Host /Engine /Service /Server

XML Support inPostgreSQL. Built-in type. Can handle documents up to 2 gigabytes. A healthy selection of XML operators. xpath in particular. Very convenient XML export functions. Great for external XML requirements.

XML Indexing. There isn’t any. Unless you build it yourself with anexpression index. Functionality is great. Performance is we’ll talk about this later.

hstoreThe hidden gem of contrib/

hstore A key/value storage type specific toPostgreSQL. Maps string keys to string values. Contrib module; not part of thePostgreSQL core.

hstore functions Lots and lots and lots of hstore functions. h- ”a” (get value for key a). h?”a” (does h contain key a?). [email protected] ”a- 2” (does key a contain 2?). Many others.

hstore indexing. Can create GiST and GIN indexes overhstore values. Indexes the whole hierarchy, not just onekey. Accelerates @ , ?, ?& and ? operators. Can also build expression indexes.

JSONAll the cool kids are doing it.

JSON JavaScript Object Notation. JavaScript’s data structure declarationformat, turned into a protocol. Dictionaries, arrays, primitive types. Originally designed to just be passed intoeval() in JavaScript. Please don’t do this.

JSON, the new hotness The de facto standard API data format forREST web services. Very comfortable for Python and Rubyprogrammers. MongoDB’s native data storage type.

JSON? Yeah, we got that. JSON type in core as of 9.2. Validates JSON going in. And not much else right now. array to json, row to json. Lots more coming in 9.3 (offer subject tocommitter approval).

JSON Indexing. Expression indexing. Can also treat as a text string for strictcomparison which is kind of a weird idea and I’mnot sure why you’d do that. But the coolest part of JSON in core is!

PL/V8! The V8 JavaScript engine from Google isavailable as an embedded language. JavaScript deals with JSON very well, asyou’d expect. Not part of core or contrib; needs to bebuilt and installed separately.

PL/V8 ProTips Use the static V8 engine that comes withPL/V8. Function is compiled by V8 on first use. Now that we got rid of SQL injectionattacks, we now have JSON injectionattacks. PL invocation overhead is non-trivial.

Schemaless Strategies Create single-field tables with only ahierarchical type. Wrap up the (very simple) SQL to providean object API. Create indexes to taste Maybe extract fields if you need to JOIN. Profit!

CREATE OR REPLACE FUNCTIONget json key(structure JSON, key TEXT) RETURNS TEXTAS get json key var js object structure;if (typeof ej ! 'object')return NULL;return JSON.stringify(js object[key]); get json key IMMUTABLE STRICT LANGUAGE plv8;

CREATE TABLE blog {post json}CREATE INDEX post pk idx ONblog((get json key(post, ‘post id’)::BIGINT));CREATE INDEX post date idx ONblog((get json key(post, ‘post date’)::TIMESTAMPTZ));

But but but PostgreSQL was not designed to be aschemaless database. Wouldn’t it be better to use a bespokedatabase designed for this kind of data? Well, let’s find out!

Some Numbers.When all else fails, measure.

Schemaless Shootout! A very basic document structure: id, name, company, address1, address2,city, state, postal code. address2 and company are optional(NULL in relational version). id 64-bit integer, all others text. 1,780,000 records, average 63 bytes each.

The Competitors! Traditional relational schema. hstore (GiST and GIN indexes). XML JSON One column per table for these. MongoDB

Timing Harness. Scripts written in Python. psycopg2 2.4.6 for PostgreSQL interface. pymongo 2.4.2 for MongoDB interface.

The Test Track. This laptop. OS X 10.7.5. 2.8GHz Intel Core i7. 7200 RPM disk. 8GB (never comes close to using a fractionof it).

Indexing Philosophy For relational, index on primary key. For hstore, index using GiST and GIN (andnone). For JSON and XML, expression index onprimary key. For MongoDB, index on primary key. Indexes created before records loaded.

Your Methodology Sucks. Documents are not particularly large. No deep hierarchies. Hot cache. Only one index. No joins. No updates.

The Sophisticated DatabaseTuning Philosophy. None. Stock PostgreSQL 9.2.2, from source. No changes to postgresql.conf Stock MongoDB 2.2, from MacPorts. Fire it up, let it go.

First Test: Bulk Load Scripts read a CSV file, parse it into theappropriate format, INSERT it into thedatabase. We measure total load time, includingparsing time. (COPY will be much much much faster.) mongoimport too, most likely.

ore (GiST) hstore (GIN)XMLJSONMongoDB

Observations. No attempt made to speed up PostgreSQL. Synchronous commit, checkpoint tuning,etc. GIN indexes are really slow to build. The XML xpath function is probably theculprit for its load time.

Next Test: Disk Footprint. Final disk footprint once data is loaded. For PostgreSQL, reported database sizesfrom the pg * size functions. For MongoDB, reported by db.stats().

Disk Footprint in re (GiST) hstore (GIN)XMLJSONMongoDB

Observations. GIN indexes are really big on disk. PostgreSQL’s relational data storage is veryefficient. None of these records are TOAST-able. MongoDB certain likes its disk space. padding factor was 1, so it wasn’t that.

Next Test: Query onPrimary Key For a sample of 100 documents, query asingle document based on the primary key. Results not fetched. For PostgreSQL, time of .execute()method from Python. For MongoDB, time of .fetch()method.

Fetch Time in Milliseconds4003002001000Relationalhstorehstore (GiST) hstore (GIN)XMLJSONMongoDB

Fetch Time in Milliseconds ( 100ms)139.756.53.250RelationalXMLJSONMongoDB

Fetch Time in Milliseconds ( 100ms)4003002001000hstorehstore (GiST)hstore (GIN)

Observations. B-tree indexes kick ass. GiST and GIN not even in same leaguefor simple key retrieval. Difference between relational, XML andJSON is not statistically significant. Wait, I thought MongoDB was supposed tobe super-performant. Huh.

Next Test: Query on Name For a sample of 100 names, query alldocuments with that name. Results not fetched. Required a full-table scan (except forhstore with GiST and GIN indexes). Same timing methodology.

Fetch Time in store (GiST) hstore (GIN)XMLJSONMongoDB

Fetch Time in Milliseconds ( 500ms)5003752501250Relationalhstorehstore (GiST)hstore (GIN)MongoDB

Fetch Time in Milliseconds ( 500ms)500003750025000125000XMLJSON

Observations. GiST and GIN accelerate every field, notjust the “primary” key. Wow, executing the accessor function oneach XML and JSON field is slow. MongoDB’s grotesquely bloated diskfootprint hurts it here. Not that there’s anything wrong with that.

Now that we knowthis, what do weknow?

Some Conclusions. PostgreSQL does pretty well as aschemaless database. Build indexes using expressions oncommonly-queried fields or use GiST and hstore if you wantfull flexibility. GIN might well be worth it for other cases.

Some Conclusions, 2. Avoid doing full-table scans if you need touse an accessor function. Although hstore’s are not bad comparedto xpath or a PL. Seriously consider hstore if you have theflexibility. It’s really fast.

Flame Bait! MongoDB doesn’t seem to be moreperformant than PostgreSQL. And you still get all of PostgreSQL’sgoodies. Larger documents will probably continue tofavor PostgreSQL. As will larger tables.

Fire Extinguisher. You can find workloads that “prove” anydata storage technology is the right answer. dBase II included. Be very realistic about your workload anddata model, now and in the future. Test, and test fairly with real-world data inreal-world volumes.

The Future is Now! Many more JSON functions in PostgreSQL9.3, now in beta. Better, faster support for indexing. No GIST or GIN indexes yet, sadly.

mongres! Direct Mongo wire-protocol support inPostgreSQL. Uses PostgresSQL’s custom backgroundprocess feature. Prototype, but check it out.

Thank [email protected]

house data warehouse model. . Final disk footprint once data is loaded. For PostgreSQL