Melding Monads

2016 October 26

Announcing Configurator-ng 0.0

Filed under: Uncategorized — lpsmith @ 10:29 pm

I’m pleased to announce a preliminary release of configurator-ng, after spending time writing documentation for it at Hac Phi this past weekend. This release is for the slightly adventurous, but I think that many, especially those who currently use configurator, will find this worthwhile.

This is a massively breaking fork of Bryan O’Sullivan’s configurator package. The configuration file syntax is almost entirely backwards compatible, and mostly forwards compatible as well. However, the application interface used to read configuration files is drastically different. The focus so far has been on a more expressive interface, an interface that is safer in the face of concurrency, and improved error messages. The README offers an overview of the goals and motivations behind this fork, and how it is attempting to satisfy those goals.

I consider this an alpha release, but I am using it in some of my projects and it should be of reasonable quality. The new interfaces I’ve created in the Data.Configurator.Parser and Data.Configurator.FromValue modules should be pretty stable, but I am planning on major breaking changes to the file (re)loading and change notification interfaces inherited from configurator.

Documentation is currently a little sparse, but the README and the preliminary haddocks I hope will be enough to get started. Please don’t hesitate to contact me, via email, IRC (lpsmith on freenode), or GitHub if you have any questions, comments, ideas, needs or problems.

2015 November 21

Announcing linux-inotify 0.3

Filed under: Uncategorized — lpsmith @ 12:34 am

linux-inotify is a thin binding to the Linux kernel’s file system notification functionality for the Glasgow Haskell Compiler. The latest version does not break the interface in any way, but the implementation eliminates use-after-close file descriptor faults, which are now manifested as IOExceptions instead. It also adds thread safety, and is much safer in the presence of asynchronous exceptions.

In short, the latest version is far more industrial grade. It is the first version I would not hesitate to recommend to others, and the first version I have talked much about publicly. The only downside is that it does require GHC 7.8 or later, due to the use of threadWaitReadSTM to implement blocking reads.

File Descriptor Allocation

These changes were motivated by lpsmith/postgresql-simple#117, one of the more interesting bugs I’ve been peripherally involved with. The upshot is that it’s very probably a use-after-close descriptor fault: a descriptor associated with a websocket would close, that descriptor would be associated with a new description that would be a database connection, and then a still-active websocket thread would ping the database connection, causing the database server to hang up due to a protocol error. Thus, a bug in snap and/or websockets-snap manifested itself in the logs of a database server.

This issue did end up exposing a significant deficiency in my understanding of the allocation of file descriptors: for historical reasons, the POSIX standard specifies that a new description must be indexed by the smallest unused descriptor.

This is an unfortunate design from several standpoints, as it necessitates more serialization than would otherwise be required, and a scheme that seeks to delay reusing descriptors as long as practical would greatly mitigate use-after-close faults, as they would almost certainly manifest as an EBADF errno instead of a potentially catastrophic interaction with an unintended description.

In any case, one of the consequences of the POSIX standard is that a use-after-close descriptor fault is a pretty big deal, especially in a server with a high descriptor churn rate. In this case, the chances of an inappropriate interaction with the wrong descriptor is actually pretty likely.

Thus, in the face of new evidence, I revised my opinion regarding what the linux-inotify binding should provide; in particular protection against use-after-close. And in the process I made the binding safer in the presence of concurrency and asynchronous exceptions.

The Implementation

Due to the delivery of multiple inotify messages in a single system call, linux-inotify provides a buffer behind the scenes. The new implementation has two MVar locks: one associated with the buffer, and another associated with the inotify descriptor itself.

Four functions take only the descriptor lock: addWatch, rmWatch, close, and isClosed. Four functions take the buffer lock, and may or may not take the descriptor lock: getEvent, peekEvent, getEventNonBlocking, and peekEventNonBlocking. Two functions only ever take the buffer lock: getEventFromBuffer and peekEventFromBuffer.

The buffer is implemented in an imperative fashion, and the lock protects all the buffer manipulations. The existence of the descriptor lock is slightly frustrating, as the only reason it’s there is to correctly support the use-after-close protection; the kernel would otherwise be able to deal with concurrent calls itself.

In this particular design, it’s important that neither of these locks are held for any significant length of time; in particular we don’t want to block while holding either one. This is so that the nonblocking functions are actually nonblocking: otherwise it would be difficult for getEventNonBlocking to not block while another call to getEvent is blocked on the descriptor.

So looking at the source of getEvent, the first thing it does is to take the buffer lock, then checks if the buffer is empty. If the buffer not empty, it returns the first message without ever taking the descriptor lock. If the buffer is empty, it then also takes the descriptor lock (via fillBuffer) and attempts to read from the descriptor. If the descriptor has been closed, then it throws an exception. If the read would block, then it drops both locks and then waits for the descriptor to become readable before it tries again. And of course if the read succeeds, then we grab the first message, then drop the locks, and return the message.

getEvent :: Inotify -> IO Event
getEvent inotify@Inotify{..} = loop
    funcName = "System.Linux.Inotify.getEvent"
    loop = join $ withLock bufferLock $ do
               start <- readIORef startRef
               end   <- readIORef endRef
               if start < end
               then (do
                        evt <- getMessage inotify start True
                        return (return evt)                  )
               else fillBuffer funcName inotify
                    -- file descriptor closed:
                       (throwIO $! fdClosed funcName)
                    -- reading the descriptor would block:
                       (\fd -> do
                            (waitRead,_) <- threadWaitReadSTM fd
                            return (atomically waitRead >> loop) )
                    -- read succeeded:
                            evt <- getMessage inotify 0 True
                            return (return evt)                  )

So there are two points of subtlety worth pointing out in this implementation. First is the idiom used in loop: the main body returns an IO (IO Event), which is turned into IO Event via join. The outer IO is done inside a lock, which computes an (IO Event) to perform after the lock(s) are dropped.

The second point of subtlety is the case where the read would block. You may want to think for a minute why I didn’t write this instead:

\fd -> do
   return (threadWaitRead fd >> loop)

Because we don’t want to block other operations that need the buffer and/or file descriptor locks, we need to drop those locks before we wait on the descriptor. But because we are dropping those locks, anything could happen before we actually start waiting on the descriptor.

In particular, another thread could close the descriptor, and the descriptor could be reused to refer to an entirely new description, all before we start waiting. Essentially, that file descriptor is truly valid only while we hold the descriptor lock; thus we need to go through the process of registering our interest in the file descriptor while we have the lock, and then actually blocking on the descriptor after we have released the lock.

Now, this race condition is a relatively benign one as long as the new description either becomes readable relatively quickly, or that the descriptor is closed and GHC’s IO manager is aware the fact. In the first case, the threadWaitRead fd will unblock, and the thread will loop and realize that the descriptor has been closed, and throw an exception. In the second case, GHC’s IO manager will cause threadWaitRead fd to throw an exception directly.

But if you get very unlucky, this might not happen for a very long time, if ever, causing deadlock. This might be the case if say, the new descriptor is a unix socket to /dev/log. Or, if the reallocated descriptor is managed by foreign code, and is closed without GHC’s IO manager being made aware, it is quite possible that this could deadlock.

2015 February 12

Announcing blaze-builder-0.4

Filed under: Uncategorized — lpsmith @ 1:42 pm

After a recent chat with Simon Meier, we decided that I would take over the maintenance of the exceedingly popular blaze-builder package.

Of course, this package has been largely superseded by the new builder shipped inside bytestring itself. The point of this new release is to offer a smooth migration path from the old to the new.

If you have a package that only uses the public interface of the old blaze-builder, all you should have to do is compile it against blaze-builder-0.4 and you will in fact be using the new builder. If your program fails to compile against the old public interface, or there’s any change in the semantics of your program, then please file a bug against my blaze-builder repository.

If you are looking for a function to convert Blaze.ByteString.Builder.Builder to Data.ByteString.Builder.Builder or back, it is id. These two types are exactly the same, as the former is just a re-export of the latter. Thus inter-operation between code that uses the old interface and the new should be efficient and painless.

The one caveat is that the old implementation has all but disappeared, and programs and libraries that touch the old internal modules will need to be updated.

This compatibility shim is especially important for those libraries that have the old blaze-builder as part of their public interface, as now you can move to the new builder without breaking your interface.

There are a few things to consider in order to make this transition as painless as possible, however: libraries that touch the old internals should probably move to the new bytestring builder as soon as possible, while those libraries who depend only on the public interface should probably hold off for a bit and continue to use this shim.

For example, blaze-builder is part of the public interface of both the Snap Framework and postgresql-simple. Snap touches the old internals, while postgresql-simple uses only the public interface. Both libraries are commonly used together in the same projects.

There would be some benefit to postgresql-simple to move to the new interface. However, let’s consider the hypothetical situation where postgresql-simple has transitioned, and Snap has not. This would cause problems for any project that 1.) depends on this compatibility shim for interacting with postgresql-simple, and 2.) uses Snap.

Any such project would have to put off upgrading postgresql-simple until Snap is updated, or interact with postgresql-simple through the new bytestring builder interface and continue to use the old blaze-builder interface for Snap. The latter option could range from anywhere from trivial to extremely painful, depending on how entangled the usage of Builders are between postgresql-simple and Snap.

By comparison, as long as postgresql-simple continues to use the public blaze-builder interface, it can easily use either the old or new implementation. If postgresql-simple holds off until after Snap makes the transition, then there’s little opportunity for these sorts of problems to arise.

Announcing snaplet-postgresql-simple-0.6

Filed under: Uncategorized — lpsmith @ 12:16 pm

In the past, I’ve said some negative things1 about Doug Beardsley’s snaplet-postgresql-simple, and in this long overdue post, I retract my criticism.

The issue was that a connection from the pool wasn’t reserved for the duration of the transaction. This meant that the individual queries of a transaction could be issued on different connections, and that queries from other requests could be issued on the connection that’s in a transaction. Setting the maximum size of the pool to a single connection fixes the first problem, but not the second.

At Hac Phi 2014, Doug and I finally sat down and got serious about fixing this issue. The fix did require breaking the interface in a fairly minimal fashion. Snaplet-postgresql-simple now offers the withPG and liftPG operators that will exclusively reserve a single connection for a duration, and in turn uses withPG to implement withTransaction.

We were both amused by the fact that apparently a fair number of people have been using snaplet-postgresql-simple, even transactions in some cases, without obviously noticing the issue. One could speculate the reasons why, but Doug did mention that he pretty much never uses transactions. So in response, I came up with a list of five common use cases, the first three involve changing the database, and last two are useful even in a read-only context.

  1. All-or-nothing changes

    Transactions allow one to make a group of logically connected changes so that they either all reflected in the resulting state of the database, or that none of them are. So if anything fails before the commit, say due to a coding error or even something outside the control of software, the database isn’t polluted with partially applied changes.

  2. Bulk inserts

    Databases that provide durability, like PostgreSQL, are limited in the number of transactions per second by the rotational speed of the disk they are writing to. Thus individual DML statements are rather slow, as each PostgreSQL statement that isn’t run in an explicit transaction is run in its own individual, implicit transaction. Batching multiple insert statements into a single transaction is much faster.

    This use case is relatively less important when writing to a solid state disk, which is becoming increasingly common. Alternatively, postgresql allows a client program to turn synchronous_commit off for the connection or even just a single transaction, if sacrificing a small amount of durability is acceptable for the task at hand.

  3. Avoiding Race Conditions

    Transactional databases, like Software Transactional Memory, do not automatically eliminate all race conditions, they only provide a toolbox for avoiding and managing them. Transactions are the primary tool in both toolboxes, though there are considerable differences around the edges.

  4. Using Cursors

    Cursors are one of several methods to stream data out of PostgreSQL, and you’ll almost always want to use them inside a single transaction.2 One advantage that cursors have over the other streaming methods is that one can interleave the cursor with other queries, updates, and cursors over the same connection, and within the same transaction.

  5. Running multiple queries against a single snapshot

    If you use the REPEATABLE READ or higher isolation level, then every query in the transaction will be executed on a single snapshot of the database.

So I no longer have any reservations about using snaplet-postgresql-simple if it is a good fit for your application, and I do recommend that you learn to use transactions effectively if you are using Postgres. Perhaps in a future post, I’ll write a bit about picking an isolation level for your postgres transactions.

  1. See for example, some of my comments in the github issue thread on this topic, and the reddit thread which is referenced in the issue.

  2. There is the WITH HOLD option for keeping a cursor open after a transaction commits, but this just runs the cursor to completion, storing the data in a temporary table. Which might occasionally be acceptable in some contexts, but is definitely not streaming.

2014 November 5

A Brief History of Timekeeping, Part 1: Dates on the Western Calendars

Filed under: time — Tags: , , — lpsmith @ 8:13 pm

Time is simple and well understood, right? Well, in a physical sense, it’s fairly well understood, but not at all simple. Cutting edge research into atomic clock technology promises an uncertainty of about a second over an interval of several billion years. This is plenty accurate enough to expose the minuscule relativistic effects of simply repositioning the clock. In fairly short order, differences of fractions of a picosecond can be detected, caused by the fact that time passes at imperceptibly different speeds depending on motion and the small changes in the local gravity field.

But the topic of these posts is the political vagaries of time, not the physical. The basic concept of time (modulo relativity) is something intuitively understood by almost all human beings, and human cultures have been accurately counting the days for thousands of years. So if understanding fractions of a picosecond is difficult, at least dates should be easy, right?

Unfortunately, not so much. Although the Western Calendar has become the de facto and de jure standard over most the world, mapping a Western date to a precise day has some complicated and even ambiguous edge cases.

Julius Caesar introduced the Julian Calendar to the Roman Republic around January 1, 45 BC, in part to help resolve some political problems created by the ancient Roman Calendar. A bit over a year later, on March 15, 44 BC, Julius Caesar was assassinated, motivated in part by some of the political problems the older calendar contributed to.

One might ask, exactly how many days have passed since Caesar was assassinated until the time you are reading this? Certainly, if this question can be answered correctly, arriving at the answer is a bit more subtle than most people realize.

The Julian Calendar was essentially version 1.0 of the Western Calendar, and it’s initial deployment had a few teething problems. There was a bit of confusion surrounding what it meant to have a leap day "every fourth year", and as a result leap days occurred every third year until the error was noticed and corrected by Augustus Caesar and his advisers nearly 40 years later. All leap years were suspended for 12 years, to be resumed at the proper interval once the error was corrected.

We don’t know exactly when those earliest leap years occurred. It does seem as though there is a single probable answer in the case of Caesar’s assassination, as both of the unfalsified candidate solutions for the early leap years lead to the same answer. However, there are many dates over the subsequent decades that could easily refer to one of two days. In any case, sometime around 1 BC to 4 AD, the Julian calendar stabilized, and for nearly 1600 years afterwards, there is an unambiguous mapping of historical dates to days.

However, the average length of a Julian year was longer than the length of the solar year. Over the next 1500 years the dates on the Julian Calendar drifted noticeably out of sync with the solar year, with the spring equinox (as determined by actual astronomical observations) coming some 10 days too early.

In February 1582, Pope Gregory XIII instituted the Gregorian Calendar. Leap years would come every fourth year, as before, but now leap years would be skipped every 100 years, except every 400 years. Also, the new calendar would skip 10 days in order to bring it back in sync with the solar year, with 1582-10-04 followed by 1582-10-15.

And, if that was the entire story, it wouldn’t be so bad, but unfortunately this reform ushered in several hundred years of ambiguity and confusion due to its uneven adoption. This reform was immediately adopted by the Catholic Church, the Papal States and many Catholic nations, with other Catholic nations moving to the Gregorian Calendar soon afterwards. However, the Gregorian Calendar was resisted by most Protestant and Orthodox nations. For example, Great Britain and her colonies held out until 1752,1 and Russia and Greece held out until 1918 and 1923 respectively, shortly after armed revolutions forced a change of government in those countries.

Then you have the case of the Swedish Calendar, which is notable for it’s particularly convoluted, mishandled, and then aborted transition from the Julian to the Gregorian calendars which resulted in a February 30, 1712. Also, in a few locales, including Switzerland and the Czech Republic, some people were using the Julian calendar at the same time others were using the Gregorian calendar.

Thus, accurately identifying a date after 1582 coming from original sources, or any historical date coming from modern sources, can be complicated or even impossible, depending on the precise context (or lack thereof) surrounding the date and its source.

Most dates in modern history books remain as they were originally observed, but there are a few exceptions. For example, the date of George Washington’s birth is observed as 1732-02-22 (Gregorian), even though the American colonies were using the Julian calendar at the time. On the day that Washington was born, the date was actually 1732-02-11 (Julian).

As encouraged by the ISO 8601:2004 standard, computer software often uses the proleptic Gregorian calendar for historical dates, extending the Gregorian calendar backwards in time before its introduction and using it for all date calculations. This includes PostgreSQL’s built-in time types. Glasgow Haskell’s time package supports both the proleptic Julian and Gregorian calendars, though defaults to the Gregorian Calendar.

The good news is that as time marched on, things have gotten better. Driven largely by faster methods of communication and transportation, we’ve gone from the typical context-dependence of dates down to the typical context-dependence of a dozen seconds or so. Even ignoring context, we are often more limited by the accuracy of a typical clock, and technologies such as GPS and NTP are often available to help keep them in reasonable agreement with atomic time standards.

In the upcoming posts of this two or three part series, we’ll look at the process of reducing this ambiguity, as well as local time, time zones, and how PostgreSQL’s nearly universally misunderstood timestamp with time zone type actually works.

  1. Users with access to a unix machine may be amused by the output of cal 9 1752, and differences in various man pages for cal and ncal can be informative, take for example these two.

2014 March 27

Announcing postgresql-simple-0.4.2

Filed under: Uncategorized — lpsmith @ 7:46 pm

With 19 new releases, postgresql-simple has seen substantial development since my announcement of version 0.3.1 nearly a year ago. Compared to my last update, the changes are almost exclusively bugfixes and new features. Little code should break as a result of these changes, and most if not all of the code that does break should fail at compile time.

As it stands today, postgresql-simple certainly has the best support for postgres-specific functionality of anything on hackage. So, for a few highlights:

Parametrized Identifiers

Thanks to contributions from Tobias Florek, perhaps the most exciting change for many is that postgresql-simple now properly supports parametrized identifiers, including column, table, and type names. These are properly escaped via libpq. So for example, you can now write:

              (Only ("schema.table" :: QualifiedIdentifier))

Of course, this particular example is still potentially very insecure, but it shouldn’t be possible to create a SQL injection vulnerability this way.

The downside of this change is that postgresql-simple now requires libpq version 9.0 or later. However, community support for 8.4 is coming to and end this July. Also, it is possible to use newer versions of libpq to connect to older versions of postgres, so you don’t have to upgrade your server. (In fact, in one particular situation I’m still using postgresql-simple to connect to postgresql 8.1, although some features such as non-builtin types don’t work.)

Copy In and Copy Out support

While the postgresql-libpq binding has supported COPY FROM STDIN and COPY TO STDOUT for some time, postgresql-simple now supports these directly without having to muck around with postgresql-libpq calls via the Internal module.

If you are interested in streaming data to and from postgres, you may also be interested in higher-level COPY bindings for pipes and io-streams. This is available in Oliver Charles’s pipes-postgresql-simple, which also supports cursors, and my own unreleased postgresql-simple-streams, which also supports cursors and large objects.

Out-of-box support for JSON, UUID, and Scientific

Thanks to contributions from Bas van Dijk, postgresql-simple now provides FromField and ToField instances for aeson‘s value type, Antoine Latter’s uuid, as well as scientific.

Savepoints and nestable folds

Thanks to contributions from Joey Adams, postgresql-simple now has higher-level support for savepoints in the Transaction module, as well as the ability to nest the fold operator. Previously, postgresql-simple had assigned a static name to the cursor that underlies fold, and now every connection has a counter used to generate temporary names.

Parametrized VALUES expressions

While executeMany and returning already support common use cases of this new feature, the new Values type allows you to parameterize more than just a single VALUES table literal.

For example, these situations commonly arise when dealing with writable common table expressions. Let’s say we have a table of things with an associated table of attributes:

   name TEXT NOT NULL,

CREATE TABLE attributes (
   id    INT NOT NULL REFERENCES things,
   key   TEXT   NOT NULL,

Then we can populate both a thing and its attributes with a single query, returning the id generated by the database:

query conn [sql|
    WITH thing AS (
        INSERT INTO things (name) VALUES (?) RETURNING id
    ), newattrs AS (
        INSERT INTO attributes
            SELECT, a.*
            FROM thing JOIN ? a
      SELECT id FROM thing;
  |]  ("bob", Values   [ "text" ,"int8" ]
                     [ ( "foo"  , 42    )
                     , ( "bar"  , 60    ) ])

The empty case is also dealt with correctly; see the documentation for details.

Improved support for outer joins

A long standing challenge with postgresql-simple is dealing with outer joins in a sane way. For example, with the schema above, let’s say we want to fetch all the things along with their attributes, whether or not any given thing has any attributes at all. Previously, we could write:

getAllTheThings :: Connection -> IO [(Text, Maybe Text, Maybe Int64)]
getAllTheThings conn = do
    query conn [sql|
        SELECT name, key, value
          FROM things LEFT OUTER JOIN attributes
            ON =

Now, the columns from the attributes table are not nullable, so normally we could avoid the Maybe constructors, however the outer join changes that. Since both of these columns are not nullable, they are always both null or both not null, which is an invariant not captured by the type. And a separate Maybe for each column gets awkward to deal with, especially when more columns are involved.

What we would really like to do is change the type signature to:

getAllTheThings :: Connection -> IO [Only Text :. Maybe (Text, Int64)]

And now we can, and it will just work! Well, almost. The caveat is that there is a separate instance FromRow (Maybe ...) for most of the provided FromRow instances. This won’t work with your own FromRow instances unless you also declare a second instance. What’s really desired is a generic instance:

instance FromRow a => FromRow (Maybe a)

This instance would return Nothing if all the columns that would normally be consumed are null, and attempt a full conversion otherwise. This would reduce code bloat and repetition, and improve polymorphism and compositionality.

But alas, it’s not possible to define such an instance without changing the FromRow interface, and quite probably breaking everybody’s FromRow instances. Which I’m totally willing to do, once somebody comes up with a way to do it.

2013 April 27

Announcing postgresql-simple 0.3.1

Filed under: Uncategorized — lpsmith @ 10:49 am

Array types

Postgresql-simple has been progressing since my last announcement of version 0.1 nearly a year ago. Since then there has been many changes by myself and contributors, some of which will break your code with or without compilation errors. So this post will attempt to highlight some of the bigger changes. Probably the most exciting recent change is the new support for PostgreSQL’s array types, largely due to work from Jason Dusek, Bas van Dijk, and myself. Here’s two examples of a table and query that makes use of this functionality:

   ( id   INT NOT NULL
   , link INT NOT NULL

   ( id     INT NOT NULL
   , matrix FLOAT8[][]
import qualified Data.Vector as V
import           Database.PostgreSQL.Simple
import           Data.Int(Int64)

getAssocs :: Connection -> IO [(Int,V.Vector Int)]
getAssocs conn = do
    query_ conn "SELECT id, ARRAY_AGG(link) FROM assocs GROUP BY id"

insertMatrix :: Connection -> Int -> V.Vector (V.Vector Double) -> IO Int64
insertMatrix conn id matrix = do
    execute conn "INSERT INTO matrices (id, matrix) VALUES (?,?)" (id, matrix)

TypeInfo Changes

In order to properly support the FromField a => FromField (Vector a) instance, the TypeInfo system was overhauled. Previously, the only information it tracked was a mapping of type OIDs to type names, first by consulting a static table and then using a per-connection cache, finally querying the pg_type metatable. This was useful for writing FromField instances for postgresql types that do not have a stable OID, such as those provided by an extension, like the period type from Temporal Postgres or the hstore type from the contributed modules bundled with PostgreSQL. However, proper array support required more information, especially the type of the array’s elements. This information is now available in an easily extended data structure, available in the new Database.PostgreSQL.Simple.TypeInfo module. This was introduced in 0.3, 0.3.1 added support for range and composite types; however there is not yet any FromField instance that makes use of this information or deals with these types.

IO at Conversion Time

Version 0.3 also stopped pre-computing the type name of every column and storing these in a vector before converting the results, by allowing a restricted set of IO actions at conversion time. This is a win, because the common case is that the typename is never consulted; for example almost all of the out-of-box FromField instances examine the type OID alone. Also, since IO information no longer has to be computed before conversion takes place, it makes it practical to consider using IO information that would rarely be used in normal circumstances, such as turning table OIDs into table names when errors are encountered. It’s possible to extend the IO actions available to FromField and FromRow instances by accessing the data constructor of the Conversion monad via the Database.PostgreSQL.Simple.Internal module.

This required changing the type of the FromField.typename operator, which will break your FromField instances that use it. It also required small changes to the FromField and FromRow interface, which has a chance of breaking some of your FromField and FromRow instances if they don’t strictly use the abstract Applicative and/or Monad interfaces. However, all of this breakage should be obvious; if your code compiles, it should work with the new interface.

HStore support

Version 0.3.1 also introduced out-of-box support for hstore. The hstore type provides key-value maps from textual strings to textual strings. Conversions to/from lists and Data.Map is provided, while conversions from other Haskell types can be easily implemented via the HStoreBuilder interface (similar to my own json-builder package), and conversions to other Haskell types can easily be implemented via the conversion to lists.

CREATE TABLE hstore_example
    ( id  INT NOT NULL
    , map hstore
insertHStore :: Connection -> Int -> [(Text,Text)] -> IO Int64
insertHStore conn id map = do
    execute conn "INSERT INTO hstore_example (id,map) VALUES (?,?)" (id, HStoreList map)

retrieveHStore :: Connection -> Int -> IO (Maybe [(Text,Text)])
retrieveHStore conn id
    xs <- query conn "SELECT map FROM hstore_example WHERE id = ?" (Only id)
    case xs of
      [] -> return Nothing
      (Only (HStoreList val):_) -> return (Just val)

Better Error Messages

Jeff Chu and Leonid Onokhov have improved both error messages and error handling options in the latest version. Thanks to Jeff, the ResultError exception now includes the column name and associated table OID (if any) from which the column was taken from. And Leonid has contributed a new Errors module that can be used to dissect SqlError values in greater detail.

Better Time Conversions

And in older news, version 0.1.4 debuted brand new time parsers and printers for the ISO-8601 syntax flavor that PostgreSQL emits, included FromField instances for LocalTime, and introduced new datatypes for dealing with PostgreSQL’s time infinities. Among other things, the new parsers correctly handle timestamps with UTC offsets of a whole number of minutes, which means (for example) that postgresql-simple now works in India. Version 0.2 removed the conversion from timestamp (without time zone) to the UTCTime and ZonedTime types, due to the inherent ambiguity that conversion represents; LocalTime is now the preferred way of handling timestamps (without time zones).

2012 July 10

Announcing split-channel

Filed under: Uncategorized — lpsmith @ 11:44 pm

The split-channel package is new library that is a small variation on Control.Concurrent.Chan. The most obvious change is that it splits the channel into sending and receiving ports. This has at least two advantages: first, that this enables the type system to more finely constrain program behavior, and second, a SendPort can have zero ReceivePorts associated with it, and messages written to such a channel can be garbage collected.

This library started life last fall as part of my experiments in adding support for PostgreSQL’s asynchronous notifications to Chris Done’s native pgsql-simple library. The initial motivation was that if a notification arrived and nobody was listening, I wanted to be able to garbage collect it. However, the type advantages are what keep me coming back.

Beyond the primary change, this library has a number of other small improvements over Control.Concurrent.Chan: the deprecated thread-unsafe functions aren’t there, and several operators have been added or improved, most notably listen, sendMany, fold, and split.

  1. listen attaches a new ReceivePort to an existing SendPort. By contrast, Chan only provides the ability to duplicate an existing ReceivePort.

    Edit: I was mistaken: listen is essentially equivalent to dupChan, whereas duplicate is new.

  2. sendMany sends a list of messages atomically. It’s a better name than writeList2Chan, which is not atomic and is only a convenience function written in terms of send. However, writeList2Chan does work on infinite streams, whereas sendMany does not.

  3. fold is a generalization of getChanContents, potentially avoiding some data structures.

  4. split cuts an existing channel into two channels. It gives you back a new ReceivePort associated with the existing SendPort, and a new SendPort associated with the existing ReceivePorts. This is a more general operator than one I’ve used in a few places to transparently swap out backend services.

    Chan does not provide the split operator, though one could be added. However I am skeptical that this is a good idea: it’s just a little too effect-ful for comfort. I think that putting a SendPort in an MVar tends to be a better idea than using split, even though it does introduce another layer of indirection.

Finally, a few acknowledgements are in order: primarily, Control.Concurrent.Chan and its authors and contributors, and secondarily, Joey Adams for GHC Bug #5870, the fix of which has been incorporated into split-channel.

2012 May 5

Announcing postgresql-simple 0.1

Filed under: Uncategorized — lpsmith @ 3:02 pm

A new release of postgresql-simple is now available. Version 0.1 introduces some breaking changes as well as some significant improvements to the interface. I’ll cover the biggest and best change in this post, for other changes you should read the changelog.

A few months ago, Ozgun Ataman suggested that I add some convenience code to simplify writing instances of QueryResults. I ended up liking his suggestion so much I decided to dig deeper into postgresql-simple and rebase the implementation around his interface. This allowed me to eliminate some intermediate data structures, and has some other advantages I’ll get to later. I’m rather pleased with the end result; here is a contrast of the old and the new:

class QueryResults a where
    convertResults :: [Field] -> [Maybe ByteString] -> Either SomeException a
instance (Result a, Result b) => QueryResults (a,b) where
    convertResults [fa,fb] [va,vb] = do
        !a <- convert fa va
        !b <- convert fb vb
        return (a,b)
    convertResults fs vs  = convertError fs vs 2

-- RowParser has instances for Applicative, Alternative, and Monad
class FromRow a where
    fromRow :: RowParser a

-- field is exported from Database.PostgreSQL.Simple.FromRow
field :: FromField a => RowParser a

instance (FromField a, FromField b) => FromRow (a,b) where
    fromRow = (,) <$> field <*> field

The new interface not only eliminates a significant amount of syntactic overhead, it also allowed me to automatically force the converted values to WHNF, which eliminates the old caveats about writing QueryResults instances. (Note that there are (still) caveats to writing FromField instances, even if they weren’t documented before.) And perhaps more interestingly, it enables new forms of composition. For example, the Types module defines a pair type for composing FromRow instances as follows:

data a :. b = a :. b 

infixr 3 :.

instance (FromRow a, FromRow b) => FromRow (a :. b) where
   fromRow = (:.) <$> fromRow <*> fromRow

Here’s an example of how you might use this. Let’s pretend we have a simple schema for that might be used for a (small) website where people can post stories and rate them from 0-9:

CREATE TABLE users                     (uid int not null, username text   not null);
CREATE TABLE posts   (pid int not null, uid int not null, content  text   not null);
CREATE TABLE ratings (pid int not null, uid int not null, score    float4 not null);

Then, here’s some example boilerplate that could be used to represent (part of) the schema in Haskell:

data User = User { u_uid :: Int, username :: Text }

instance FromRow User where
  fromRow = User <$> field <*> field

data Post = Post { p_pid :: Int, p_uid :: Int,  content :: Text } 

instance FromRow Post where
  fromRow = Post <$> field <*> field <*> field

type Score = Float

And finally, here is an example of how one could get the data out of the database in a form almost ready to generate a web page:

    rows :: [User :. Post :. Only Score] <- 
       query conn [sql|  SELECT u.*, p.*, avg(r.score) as avg_score
                           FROM users as u JOIN posts as p JOIN ratings as r  
                                ON u.uid = p.uid AND =
                          GROUP BY
                          ORDER BY avg_score DESC                             |] ()
    forM rows $ \(user :. post :. Only score) -> do
        -- generate the HTML for a single post here

Now of course, this isn’t meant to be a serious sketch of such a post/rating system; there are plenty of performance problems that are perfectly obvious to me, and maybe to you too. This is just to give you some ideas about how this functionality might be used: for example, to deal with (a limited class of) joins, or to deal with a computed column that’s been affixed to the table.

Some unresolved issues at this point would be dealing with natural joins and outer joins. I suspect that an adequate solution to these problems may require more breaking changes to this interface. In any case, multiple people have been asking for this kind of functionality, and I do think this interface is a step in the right direction.

For questions, comments, or suggestions relating to postgresql-simple, I suggest joining database-devel, a new mailing list pertaining not only to postgresql-simple, but also database development in Haskell in general. Also I’m often willing to answer questions if you happen to catch me on Freenode.

2012 February 24

Implementing json-builder (or, how the CommaBuilder got its spots)

Filed under: Uncategorized — lpsmith @ 6:42 am

Json-builder is capable of serializing aeson’s mandatory data structure at almost exactly the same speed as aeson itself. This is no accident, but rather it was a conscious implementation goal; in fact I managed to very substantially improve on aeson’s performance in a few minor corner cases (such as generating character escapes) and then shared what I discovered with Bryan O’Sullivan, who then improved aeson in turn.

One of the more interesting tricks in json-builder is the Monoid instances for Array and Object types. Those instances have been through 4 major revisions, with an interesting pattern of mistakes.

Now, a value of each type is conceptually represented by a Json string without the opening and closing delimiters. So for example, the row "x" 3 is represented by the string "x":3. The outer braces are added by ‘toJson’ function when it converts the Object to a value of the ‘Json’ type.

The delimiters are left off because a client can both append and prepend additional rows to the object, and we cannot efficiently remove the opening and closing brace from a Builder. It may help to pretend that an adversary is trying to break and abuse your interface. However, once the object is converted to a Json value, no such adversary can append and prepend rows; after all, a Json value isn’t necessarily even an Object.

However, when two objects are appended, we need to insert a comma between them. One might be tempted to write:

newtype Object = Object Builder

instance Monoid Object where
    mempty      = Object mempty
    mappend a b = Object (a ++ "," ++ b)

However, this is incorrect; in fact, it isn’t even a monoid. An adversary could produce syntactically incorrect Json simply by appending or prepending a mempty value. For example, here’s some sample expressions and the syntactically incorrect fragments they would generate:

   toJson  (row "x" 3 ++ mempty)               {"x":3,}
   mconcat [ row "x" 3, mempty, row "y" 4 ]     "x":3,,"y":4

Now, this adversary wouldn’t need to be very sophisticated, in fact these problems would almost certainly be discovered by a casual programmer. And a friendly programmer would find it excruciatingly painful to work around this broken implementation. But thinking like an adversary tends to lead to the best code; within reason, you want to minimize an adversary’s best move.

So, we need a way treat the empty Object specially. Also, we cannot simply append or prepend a comma to every singleton object, which would leave an superfluous comma either at the beginning or end of the object.

My first solution was to use the strict state monad to track whether or not we’ve emitted a singleton. Then row knows whether it needs to prepend a comma or not. In essence, my implementation was:

newtype Object = Object (State Bool Builder)

instance Monoid Object where
    mempty  = Object (return mempty)
    mappend = Object (liftM2 mappend)

row k v = Object $ do
   first <- get
   put False
   let str = toBuilder k ++ ":" ++ toBuilder v
   return $ if first then str else "," ++ str

instance Value Object where
   toJson (Object m) = Json ( "{" ++ evalState m True ++ "}" )

But some months later, as I was fleshing out other aspects of json-builder, it occurred to me that the first solution was overly strict; I couldn’t incrementally serialize something like toJson [1..10^9] for example. So, my second solution was conceptually the following code, albeit the actual code was special-cased and simplified slightly:

newtype Object = Object (ContT Builder (State Bool) ())

instance Monoid Object where
    mempty  = Object (return ())
    mappend = Object (>>)

row k v = Object $ do
    first <- get
    put False
    if first
      then return ()
      else write ","
    write (toBuilder k ++ ":" ++ toBuilder v)
    write b = mapContT (b ++) (return ())

instance Value Object where
    toJson (Object m) = Json ( "{" ++ run m ++ "}" )
      where run m = runState (runContT m (\_ -> return mempty)) True

But then I started benchmarking json-builder against aeson. Although I was in the right ballpark, I discovered a noticeable performance gap when serializing objects and vectors. So I started thinking about how to improve the Monoid implementation, and soon realized I really didn’t need a full blown monad. My third implementation was much prettier and closed the performance gap:

data Object = Empty | Comma Builder

instance Monoid Object where
    mempty = Empty
    mappend Empty y = y
    mappend x Empty = x
    mappend (Comma a) (Comma b) = Comma (a ++ "," ++ b)

row k v = Comma (toBuilder k ++ ":" ++ toBuilder v)

instance Value Object where
    toJson Empty     = Json "{}"
    toJson (Comma x) = Json ( "{" ++ x ++ "}" )

But, I soon realized that I had replicated the same mistake from the first solution, namely that this implementation was insufficiently lazy for incremental serialization. Fixing that was a simple matter of applying a standard and well-known code transformation to mappend:

instance Monoid Object where
    mempty = Empty
    mappend Empty     x = x
    mappend (Comma a) x = Comma ( a ++ case x of
                                         Empty   -> mempty
                                         Comma b -> "," ++ b )

However, this transformation reintroduced a performance gap, which was closed by adding a strictness annotation (<– worth reading) on the data constructor:

data Object = Empty | Comma !Builder

And that my friends, is the story of how the CommaBuilder got its spots.

Older Posts »

Create a free website or blog at