Melding Monads

2016 October 26

Announcing Configurator-ng 0.0

Filed under: Uncategorized — lpsmith @ 10:29 pm

I’m pleased to announce a preliminary release of configurator-ng, after spending time writing documentation for it at Hac Phi this past weekend. This release is for the slightly adventurous, but I think that many, especially those who currently use configurator, will find this worthwhile.

This is a massively breaking fork of Bryan O’Sullivan’s configurator package. The configuration file syntax is almost entirely backwards compatible, and mostly forwards compatible as well. However, the application interface used to read configuration files is drastically different. The focus so far has been on a more expressive interface, an interface that is safer in the face of concurrency, and improved error messages. The README offers an overview of the goals and motivations behind this fork, and how it is attempting to satisfy those goals.

I consider this an alpha release, but I am using it in some of my projects and it should be of reasonable quality. The new interfaces I’ve created in the Data.Configurator.Parser and Data.Configurator.FromValue modules should be pretty stable, but I am planning on major breaking changes to the file (re)loading and change notification interfaces inherited from configurator.

Documentation is currently a little sparse, but the README and the preliminary haddocks I hope will be enough to get started. Please don’t hesitate to contact me, via email, IRC (lpsmith on freenode), or GitHub if you have any questions, comments, ideas, needs or problems.

2015 November 21

Announcing linux-inotify 0.3

Filed under: Uncategorized — lpsmith @ 12:34 am

linux-inotify is a thin binding to the Linux kernel’s file system notification functionality for the Glasgow Haskell Compiler. The latest version does not break the interface in any way, but the implementation eliminates use-after-close file descriptor faults, which are now manifested as IOExceptions instead. It also adds thread safety, and is much safer in the presence of asynchronous exceptions.

In short, the latest version is far more industrial grade. It is the first version I would not hesitate to recommend to others, and the first version I have talked much about publicly. The only downside is that it does require GHC 7.8 or later, due to the use of threadWaitReadSTM to implement blocking reads.

File Descriptor Allocation

These changes were motivated by lpsmith/postgresql-simple#117, one of the more interesting bugs I’ve been peripherally involved with. The upshot is that it’s very probably a use-after-close descriptor fault: a descriptor associated with a websocket would close, that descriptor would be associated with a new description that would be a database connection, and then a still-active websocket thread would ping the database connection, causing the database server to hang up due to a protocol error. Thus, a bug in snap and/or websockets-snap manifested itself in the logs of a database server.

This issue did end up exposing a significant deficiency in my understanding of the allocation of file descriptors: for historical reasons, the POSIX standard specifies that a new description must be indexed by the smallest unused descriptor.

This is an unfortunate design from several standpoints, as it necessitates more serialization than would otherwise be required, and a scheme that seeks to delay reusing descriptors as long as practical would greatly mitigate use-after-close faults, as they would almost certainly manifest as an EBADF errno instead of a potentially catastrophic interaction with an unintended description.

In any case, one of the consequences of the POSIX standard is that a use-after-close descriptor fault is a pretty big deal, especially in a server with a high descriptor churn rate. In this case, the chances of an inappropriate interaction with the wrong descriptor is actually pretty likely.

Thus, in the face of new evidence, I revised my opinion regarding what the linux-inotify binding should provide; in particular protection against use-after-close. And in the process I made the binding safer in the presence of concurrency and asynchronous exceptions.

The Implementation

Due to the delivery of multiple inotify messages in a single system call, linux-inotify provides a buffer behind the scenes. The new implementation has two MVar locks: one associated with the buffer, and another associated with the inotify descriptor itself.

Four functions take only the descriptor lock: addWatch, rmWatch, close, and isClosed. Four functions take the buffer lock, and may or may not take the descriptor lock: getEvent, peekEvent, getEventNonBlocking, and peekEventNonBlocking. Two functions only ever take the buffer lock: getEventFromBuffer and peekEventFromBuffer.

The buffer is implemented in an imperative fashion, and the lock protects all the buffer manipulations. The existence of the descriptor lock is slightly frustrating, as the only reason it’s there is to correctly support the use-after-close protection; the kernel would otherwise be able to deal with concurrent calls itself.

In this particular design, it’s important that neither of these locks are held for any significant length of time; in particular we don’t want to block while holding either one. This is so that the nonblocking functions are actually nonblocking: otherwise it would be difficult for getEventNonBlocking to not block while another call to getEvent is blocked on the descriptor.

So looking at the source of getEvent, the first thing it does is to take the buffer lock, then checks if the buffer is empty. If the buffer not empty, it returns the first message without ever taking the descriptor lock. If the buffer is empty, it then also takes the descriptor lock (via fillBuffer) and attempts to read from the descriptor. If the descriptor has been closed, then it throws an exception. If the read would block, then it drops both locks and then waits for the descriptor to become readable before it tries again. And of course if the read succeeds, then we grab the first message, then drop the locks, and return the message.

getEvent :: Inotify -> IO Event
getEvent inotify@Inotify{..} = loop
    funcName = "System.Linux.Inotify.getEvent"
    loop = join $ withLock bufferLock $ do
               start <- readIORef startRef
               end   <- readIORef endRef
               if start < end
               then (do
                        evt <- getMessage inotify start True
                        return (return evt)                  )
               else fillBuffer funcName inotify
                    -- file descriptor closed:
                       (throwIO $! fdClosed funcName)
                    -- reading the descriptor would block:
                       (\fd -> do
                            (waitRead,_) <- threadWaitReadSTM fd
                            return (atomically waitRead >> loop) )
                    -- read succeeded:
                            evt <- getMessage inotify 0 True
                            return (return evt)                  )

So there are two points of subtlety worth pointing out in this implementation. First is the idiom used in loop: the main body returns an IO (IO Event), which is turned into IO Event via join. The outer IO is done inside a lock, which computes an (IO Event) to perform after the lock(s) are dropped.

The second point of subtlety is the case where the read would block. You may want to think for a minute why I didn’t write this instead:

\fd -> do
   return (threadWaitRead fd >> loop)

Because we don’t want to block other operations that need the buffer and/or file descriptor locks, we need to drop those locks before we wait on the descriptor. But because we are dropping those locks, anything could happen before we actually start waiting on the descriptor.

In particular, another thread could close the descriptor, and the descriptor could be reused to refer to an entirely new description, all before we start waiting. Essentially, that file descriptor is truly valid only while we hold the descriptor lock; thus we need to go through the process of registering our interest in the file descriptor while we have the lock, and then actually blocking on the descriptor after we have released the lock.

Now, this race condition is a relatively benign one as long as the new description either becomes readable relatively quickly, or that the descriptor is closed and GHC’s IO manager is aware the fact. In the first case, the threadWaitRead fd will unblock, and the thread will loop and realize that the descriptor has been closed, and throw an exception. In the second case, GHC’s IO manager will cause threadWaitRead fd to throw an exception directly.

But if you get very unlucky, this might not happen for a very long time, if ever, causing deadlock. This might be the case if say, the new descriptor is a unix socket to /dev/log. Or, if the reallocated descriptor is managed by foreign code, and is closed without GHC’s IO manager being made aware, it is quite possible that this could deadlock.

2015 February 12

Announcing blaze-builder-0.4

Filed under: Uncategorized — lpsmith @ 1:42 pm

After a recent chat with Simon Meier, we decided that I would take over the maintenance of the exceedingly popular blaze-builder package.

Of course, this package has been largely superseded by the new builder shipped inside bytestring itself. The point of this new release is to offer a smooth migration path from the old to the new.

If you have a package that only uses the public interface of the old blaze-builder, all you should have to do is compile it against blaze-builder-0.4 and you will in fact be using the new builder. If your program fails to compile against the old public interface, or there’s any change in the semantics of your program, then please file a bug against my blaze-builder repository.

If you are looking for a function to convert Blaze.ByteString.Builder.Builder to Data.ByteString.Builder.Builder or back, it is id. These two types are exactly the same, as the former is just a re-export of the latter. Thus inter-operation between code that uses the old interface and the new should be efficient and painless.

The one caveat is that the old implementation has all but disappeared, and programs and libraries that touch the old internal modules will need to be updated.

This compatibility shim is especially important for those libraries that have the old blaze-builder as part of their public interface, as now you can move to the new builder without breaking your interface.

There are a few things to consider in order to make this transition as painless as possible, however: libraries that touch the old internals should probably move to the new bytestring builder as soon as possible, while those libraries who depend only on the public interface should probably hold off for a bit and continue to use this shim.

For example, blaze-builder is part of the public interface of both the Snap Framework and postgresql-simple. Snap touches the old internals, while postgresql-simple uses only the public interface. Both libraries are commonly used together in the same projects.

There would be some benefit to postgresql-simple to move to the new interface. However, let’s consider the hypothetical situation where postgresql-simple has transitioned, and Snap has not. This would cause problems for any project that 1.) depends on this compatibility shim for interacting with postgresql-simple, and 2.) uses Snap.

Any such project would have to put off upgrading postgresql-simple until Snap is updated, or interact with postgresql-simple through the new bytestring builder interface and continue to use the old blaze-builder interface for Snap. The latter option could range from anywhere from trivial to extremely painful, depending on how entangled the usage of Builders are between postgresql-simple and Snap.

By comparison, as long as postgresql-simple continues to use the public blaze-builder interface, it can easily use either the old or new implementation. If postgresql-simple holds off until after Snap makes the transition, then there’s little opportunity for these sorts of problems to arise.

Announcing snaplet-postgresql-simple-0.6

Filed under: Uncategorized — lpsmith @ 12:16 pm

In the past, I’ve said some negative things1 about Doug Beardsley’s snaplet-postgresql-simple, and in this long overdue post, I retract my criticism.

The issue was that a connection from the pool wasn’t reserved for the duration of the transaction. This meant that the individual queries of a transaction could be issued on different connections, and that queries from other requests could be issued on the connection that’s in a transaction. Setting the maximum size of the pool to a single connection fixes the first problem, but not the second.

At Hac Phi 2014, Doug and I finally sat down and got serious about fixing this issue. The fix did require breaking the interface in a fairly minimal fashion. Snaplet-postgresql-simple now offers the withPG and liftPG operators that will exclusively reserve a single connection for a duration, and in turn uses withPG to implement withTransaction.

We were both amused by the fact that apparently a fair number of people have been using snaplet-postgresql-simple, even transactions in some cases, without obviously noticing the issue. One could speculate the reasons why, but Doug did mention that he pretty much never uses transactions. So in response, I came up with a list of five common use cases, the first three involve changing the database, and last two are useful even in a read-only context.

  1. All-or-nothing changes

    Transactions allow one to make a group of logically connected changes so that they either all reflected in the resulting state of the database, or that none of them are. So if anything fails before the commit, say due to a coding error or even something outside the control of software, the database isn’t polluted with partially applied changes.

  2. Bulk inserts

    Databases that provide durability, like PostgreSQL, are limited in the number of transactions per second by the rotational speed of the disk they are writing to. Thus individual DML statements are rather slow, as each PostgreSQL statement that isn’t run in an explicit transaction is run in its own individual, implicit transaction. Batching multiple insert statements into a single transaction is much faster.

    This use case is relatively less important when writing to a solid state disk, which is becoming increasingly common. Alternatively, postgresql allows a client program to turn synchronous_commit off for the connection or even just a single transaction, if sacrificing a small amount of durability is acceptable for the task at hand.

  3. Avoiding Race Conditions

    Transactional databases, like Software Transactional Memory, do not automatically eliminate all race conditions, they only provide a toolbox for avoiding and managing them. Transactions are the primary tool in both toolboxes, though there are considerable differences around the edges.

  4. Using Cursors

    Cursors are one of several methods to stream data out of PostgreSQL, and you’ll almost always want to use them inside a single transaction.2 One advantage that cursors have over the other streaming methods is that one can interleave the cursor with other queries, updates, and cursors over the same connection, and within the same transaction.

  5. Running multiple queries against a single snapshot

    If you use the REPEATABLE READ or higher isolation level, then every query in the transaction will be executed on a single snapshot of the database.

So I no longer have any reservations about using snaplet-postgresql-simple if it is a good fit for your application, and I do recommend that you learn to use transactions effectively if you are using Postgres. Perhaps in a future post, I’ll write a bit about picking an isolation level for your postgres transactions.

  1. See for example, some of my comments in the github issue thread on this topic, and the reddit thread which is referenced in the issue.

  2. There is the WITH HOLD option for keeping a cursor open after a transaction commits, but this just runs the cursor to completion, storing the data in a temporary table. Which might occasionally be acceptable in some contexts, but is definitely not streaming.

2014 March 27

Announcing postgresql-simple-0.4.2

Filed under: Uncategorized — lpsmith @ 7:46 pm

With 19 new releases, postgresql-simple has seen substantial development since my announcement of version 0.3.1 nearly a year ago. Compared to my last update, the changes are almost exclusively bugfixes and new features. Little code should break as a result of these changes, and most if not all of the code that does break should fail at compile time.

As it stands today, postgresql-simple certainly has the best support for postgres-specific functionality of anything on hackage. So, for a few highlights:

Parametrized Identifiers

Thanks to contributions from Tobias Florek, perhaps the most exciting change for many is that postgresql-simple now properly supports parametrized identifiers, including column, table, and type names. These are properly escaped via libpq. So for example, you can now write:

              (Only ("schema.table" :: QualifiedIdentifier))

Of course, this particular example is still potentially very insecure, but it shouldn’t be possible to create a SQL injection vulnerability this way.

The downside of this change is that postgresql-simple now requires libpq version 9.0 or later. However, community support for 8.4 is coming to and end this July. Also, it is possible to use newer versions of libpq to connect to older versions of postgres, so you don’t have to upgrade your server. (In fact, in one particular situation I’m still using postgresql-simple to connect to postgresql 8.1, although some features such as non-builtin types don’t work.)

Copy In and Copy Out support

While the postgresql-libpq binding has supported COPY FROM STDIN and COPY TO STDOUT for some time, postgresql-simple now supports these directly without having to muck around with postgresql-libpq calls via the Internal module.

If you are interested in streaming data to and from postgres, you may also be interested in higher-level COPY bindings for pipes and io-streams. This is available in Oliver Charles’s pipes-postgresql-simple, which also supports cursors, and my own unreleased postgresql-simple-streams, which also supports cursors and large objects.

Out-of-box support for JSON, UUID, and Scientific

Thanks to contributions from Bas van Dijk, postgresql-simple now provides FromField and ToField instances for aeson‘s value type, Antoine Latter’s uuid, as well as scientific.

Savepoints and nestable folds

Thanks to contributions from Joey Adams, postgresql-simple now has higher-level support for savepoints in the Transaction module, as well as the ability to nest the fold operator. Previously, postgresql-simple had assigned a static name to the cursor that underlies fold, and now every connection has a counter used to generate temporary names.

Parametrized VALUES expressions

While executeMany and returning already support common use cases of this new feature, the new Values type allows you to parameterize more than just a single VALUES table literal.

For example, these situations commonly arise when dealing with writable common table expressions. Let’s say we have a table of things with an associated table of attributes:

   name TEXT NOT NULL,

CREATE TABLE attributes (
   id    INT NOT NULL REFERENCES things,
   key   TEXT   NOT NULL,

Then we can populate both a thing and its attributes with a single query, returning the id generated by the database:

query conn [sql|
    WITH thing AS (
        INSERT INTO things (name) VALUES (?) RETURNING id
    ), newattrs AS (
        INSERT INTO attributes
            SELECT, a.*
            FROM thing JOIN ? a
      SELECT id FROM thing;
  |]  ("bob", Values   [ "text" ,"int8" ]
                     [ ( "foo"  , 42    )
                     , ( "bar"  , 60    ) ])

The empty case is also dealt with correctly; see the documentation for details.

Improved support for outer joins

A long standing challenge with postgresql-simple is dealing with outer joins in a sane way. For example, with the schema above, let’s say we want to fetch all the things along with their attributes, whether or not any given thing has any attributes at all. Previously, we could write:

getAllTheThings :: Connection -> IO [(Text, Maybe Text, Maybe Int64)]
getAllTheThings conn = do
    query conn [sql|
        SELECT name, key, value
          FROM things LEFT OUTER JOIN attributes
            ON =

Now, the columns from the attributes table are not nullable, so normally we could avoid the Maybe constructors, however the outer join changes that. Since both of these columns are not nullable, they are always both null or both not null, which is an invariant not captured by the type. And a separate Maybe for each column gets awkward to deal with, especially when more columns are involved.

What we would really like to do is change the type signature to:

getAllTheThings :: Connection -> IO [Only Text :. Maybe (Text, Int64)]

And now we can, and it will just work! Well, almost. The caveat is that there is a separate instance FromRow (Maybe ...) for most of the provided FromRow instances. This won’t work with your own FromRow instances unless you also declare a second instance. What’s really desired is a generic instance:

instance FromRow a => FromRow (Maybe a)

This instance would return Nothing if all the columns that would normally be consumed are null, and attempt a full conversion otherwise. This would reduce code bloat and repetition, and improve polymorphism and compositionality.

But alas, it’s not possible to define such an instance without changing the FromRow interface, and quite probably breaking everybody’s FromRow instances. Which I’m totally willing to do, once somebody comes up with a way to do it.

2013 April 27

Announcing postgresql-simple 0.3.1

Filed under: Uncategorized — lpsmith @ 10:49 am

Array types

Postgresql-simple has been progressing since my last announcement of version 0.1 nearly a year ago. Since then there has been many changes by myself and contributors, some of which will break your code with or without compilation errors. So this post will attempt to highlight some of the bigger changes. Probably the most exciting recent change is the new support for PostgreSQL’s array types, largely due to work from Jason Dusek, Bas van Dijk, and myself. Here’s two examples of a table and query that makes use of this functionality:

   ( id   INT NOT NULL
   , link INT NOT NULL

   ( id     INT NOT NULL
   , matrix FLOAT8[][]
import qualified Data.Vector as V
import           Database.PostgreSQL.Simple
import           Data.Int(Int64)

getAssocs :: Connection -> IO [(Int,V.Vector Int)]
getAssocs conn = do
    query_ conn "SELECT id, ARRAY_AGG(link) FROM assocs GROUP BY id"

insertMatrix :: Connection -> Int -> V.Vector (V.Vector Double) -> IO Int64
insertMatrix conn id matrix = do
    execute conn "INSERT INTO matrices (id, matrix) VALUES (?,?)" (id, matrix)

TypeInfo Changes

In order to properly support the FromField a => FromField (Vector a) instance, the TypeInfo system was overhauled. Previously, the only information it tracked was a mapping of type OIDs to type names, first by consulting a static table and then using a per-connection cache, finally querying the pg_type metatable. This was useful for writing FromField instances for postgresql types that do not have a stable OID, such as those provided by an extension, like the period type from Temporal Postgres or the hstore type from the contributed modules bundled with PostgreSQL. However, proper array support required more information, especially the type of the array’s elements. This information is now available in an easily extended data structure, available in the new Database.PostgreSQL.Simple.TypeInfo module. This was introduced in 0.3, 0.3.1 added support for range and composite types; however there is not yet any FromField instance that makes use of this information or deals with these types.

IO at Conversion Time

Version 0.3 also stopped pre-computing the type name of every column and storing these in a vector before converting the results, by allowing a restricted set of IO actions at conversion time. This is a win, because the common case is that the typename is never consulted; for example almost all of the out-of-box FromField instances examine the type OID alone. Also, since IO information no longer has to be computed before conversion takes place, it makes it practical to consider using IO information that would rarely be used in normal circumstances, such as turning table OIDs into table names when errors are encountered. It’s possible to extend the IO actions available to FromField and FromRow instances by accessing the data constructor of the Conversion monad via the Database.PostgreSQL.Simple.Internal module.

This required changing the type of the FromField.typename operator, which will break your FromField instances that use it. It also required small changes to the FromField and FromRow interface, which has a chance of breaking some of your FromField and FromRow instances if they don’t strictly use the abstract Applicative and/or Monad interfaces. However, all of this breakage should be obvious; if your code compiles, it should work with the new interface.

HStore support

Version 0.3.1 also introduced out-of-box support for hstore. The hstore type provides key-value maps from textual strings to textual strings. Conversions to/from lists and Data.Map is provided, while conversions from other Haskell types can be easily implemented via the HStoreBuilder interface (similar to my own json-builder package), and conversions to other Haskell types can easily be implemented via the conversion to lists.

CREATE TABLE hstore_example
    ( id  INT NOT NULL
    , map hstore
insertHStore :: Connection -> Int -> [(Text,Text)] -> IO Int64
insertHStore conn id map = do
    execute conn "INSERT INTO hstore_example (id,map) VALUES (?,?)" (id, HStoreList map)

retrieveHStore :: Connection -> Int -> IO (Maybe [(Text,Text)])
retrieveHStore conn id
    xs <- query conn "SELECT map FROM hstore_example WHERE id = ?" (Only id)
    case xs of
      [] -> return Nothing
      (Only (HStoreList val):_) -> return (Just val)

Better Error Messages

Jeff Chu and Leonid Onokhov have improved both error messages and error handling options in the latest version. Thanks to Jeff, the ResultError exception now includes the column name and associated table OID (if any) from which the column was taken from. And Leonid has contributed a new Errors module that can be used to dissect SqlError values in greater detail.

Better Time Conversions

And in older news, version 0.1.4 debuted brand new time parsers and printers for the ISO-8601 syntax flavor that PostgreSQL emits, included FromField instances for LocalTime, and introduced new datatypes for dealing with PostgreSQL’s time infinities. Among other things, the new parsers correctly handle timestamps with UTC offsets of a whole number of minutes, which means (for example) that postgresql-simple now works in India. Version 0.2 removed the conversion from timestamp (without time zone) to the UTCTime and ZonedTime types, due to the inherent ambiguity that conversion represents; LocalTime is now the preferred way of handling timestamps (without time zones).

2012 July 10

Announcing split-channel

Filed under: Uncategorized — lpsmith @ 11:44 pm

The split-channel package is new library that is a small variation on Control.Concurrent.Chan. The most obvious change is that it splits the channel into sending and receiving ports. This has at least two advantages: first, that this enables the type system to more finely constrain program behavior, and second, a SendPort can have zero ReceivePorts associated with it, and messages written to such a channel can be garbage collected.

This library started life last fall as part of my experiments in adding support for PostgreSQL’s asynchronous notifications to Chris Done’s native pgsql-simple library. The initial motivation was that if a notification arrived and nobody was listening, I wanted to be able to garbage collect it. However, the type advantages are what keep me coming back.

Beyond the primary change, this library has a number of other small improvements over Control.Concurrent.Chan: the deprecated thread-unsafe functions aren’t there, and several operators have been added or improved, most notably listen, sendMany, fold, and split.

  1. listen attaches a new ReceivePort to an existing SendPort. By contrast, Chan only provides the ability to duplicate an existing ReceivePort.

    Edit: I was mistaken: listen is essentially equivalent to dupChan, whereas duplicate is new.

  2. sendMany sends a list of messages atomically. It’s a better name than writeList2Chan, which is not atomic and is only a convenience function written in terms of send. However, writeList2Chan does work on infinite streams, whereas sendMany does not.

  3. fold is a generalization of getChanContents, potentially avoiding some data structures.

  4. split cuts an existing channel into two channels. It gives you back a new ReceivePort associated with the existing SendPort, and a new SendPort associated with the existing ReceivePorts. This is a more general operator than one I’ve used in a few places to transparently swap out backend services.

    Chan does not provide the split operator, though one could be added. However I am skeptical that this is a good idea: it’s just a little too effect-ful for comfort. I think that putting a SendPort in an MVar tends to be a better idea than using split, even though it does introduce another layer of indirection.

Finally, a few acknowledgements are in order: primarily, Control.Concurrent.Chan and its authors and contributors, and secondarily, Joey Adams for GHC Bug #5870, the fix of which has been incorporated into split-channel.

2012 May 5

Announcing postgresql-simple 0.1

Filed under: Uncategorized — lpsmith @ 3:02 pm

A new release of postgresql-simple is now available. Version 0.1 introduces some breaking changes as well as some significant improvements to the interface. I’ll cover the biggest and best change in this post, for other changes you should read the changelog.

A few months ago, Ozgun Ataman suggested that I add some convenience code to simplify writing instances of QueryResults. I ended up liking his suggestion so much I decided to dig deeper into postgresql-simple and rebase the implementation around his interface. This allowed me to eliminate some intermediate data structures, and has some other advantages I’ll get to later. I’m rather pleased with the end result; here is a contrast of the old and the new:

class QueryResults a where
    convertResults :: [Field] -> [Maybe ByteString] -> Either SomeException a
instance (Result a, Result b) => QueryResults (a,b) where
    convertResults [fa,fb] [va,vb] = do
        !a <- convert fa va
        !b <- convert fb vb
        return (a,b)
    convertResults fs vs  = convertError fs vs 2

-- RowParser has instances for Applicative, Alternative, and Monad
class FromRow a where
    fromRow :: RowParser a

-- field is exported from Database.PostgreSQL.Simple.FromRow
field :: FromField a => RowParser a

instance (FromField a, FromField b) => FromRow (a,b) where
    fromRow = (,) <$> field <*> field

The new interface not only eliminates a significant amount of syntactic overhead, it also allowed me to automatically force the converted values to WHNF, which eliminates the old caveats about writing QueryResults instances. (Note that there are (still) caveats to writing FromField instances, even if they weren’t documented before.) And perhaps more interestingly, it enables new forms of composition. For example, the Types module defines a pair type for composing FromRow instances as follows:

data a :. b = a :. b 

infixr 3 :.

instance (FromRow a, FromRow b) => FromRow (a :. b) where
   fromRow = (:.) <$> fromRow <*> fromRow

Here’s an example of how you might use this. Let’s pretend we have a simple schema for that might be used for a (small) website where people can post stories and rate them from 0-9:

CREATE TABLE users                     (uid int not null, username text   not null);
CREATE TABLE posts   (pid int not null, uid int not null, content  text   not null);
CREATE TABLE ratings (pid int not null, uid int not null, score    float4 not null);

Then, here’s some example boilerplate that could be used to represent (part of) the schema in Haskell:

data User = User { u_uid :: Int, username :: Text }

instance FromRow User where
  fromRow = User <$> field <*> field

data Post = Post { p_pid :: Int, p_uid :: Int,  content :: Text } 

instance FromRow Post where
  fromRow = Post <$> field <*> field <*> field

type Score = Float

And finally, here is an example of how one could get the data out of the database in a form almost ready to generate a web page:

    rows :: [User :. Post :. Only Score] <- 
       query conn [sql|  SELECT u.*, p.*, avg(r.score) as avg_score
                           FROM users as u JOIN posts as p JOIN ratings as r  
                                ON u.uid = p.uid AND =
                          GROUP BY
                          ORDER BY avg_score DESC                             |] ()
    forM rows $ \(user :. post :. Only score) -> do
        -- generate the HTML for a single post here

Now of course, this isn’t meant to be a serious sketch of such a post/rating system; there are plenty of performance problems that are perfectly obvious to me, and maybe to you too. This is just to give you some ideas about how this functionality might be used: for example, to deal with (a limited class of) joins, or to deal with a computed column that’s been affixed to the table.

Some unresolved issues at this point would be dealing with natural joins and outer joins. I suspect that an adequate solution to these problems may require more breaking changes to this interface. In any case, multiple people have been asking for this kind of functionality, and I do think this interface is a step in the right direction.

For questions, comments, or suggestions relating to postgresql-simple, I suggest joining database-devel, a new mailing list pertaining not only to postgresql-simple, but also database development in Haskell in general. Also I’m often willing to answer questions if you happen to catch me on Freenode.

2012 February 24

Implementing json-builder (or, how the CommaBuilder got its spots)

Filed under: Uncategorized — lpsmith @ 6:42 am

Json-builder is capable of serializing aeson’s mandatory data structure at almost exactly the same speed as aeson itself. This is no accident, but rather it was a conscious implementation goal; in fact I managed to very substantially improve on aeson’s performance in a few minor corner cases (such as generating character escapes) and then shared what I discovered with Bryan O’Sullivan, who then improved aeson in turn.

One of the more interesting tricks in json-builder is the Monoid instances for Array and Object types. Those instances have been through 4 major revisions, with an interesting pattern of mistakes.

Now, a value of each type is conceptually represented by a Json string without the opening and closing delimiters. So for example, the row "x" 3 is represented by the string "x":3. The outer braces are added by ‘toJson’ function when it converts the Object to a value of the ‘Json’ type.

The delimiters are left off because a client can both append and prepend additional rows to the object, and we cannot efficiently remove the opening and closing brace from a Builder. It may help to pretend that an adversary is trying to break and abuse your interface. However, once the object is converted to a Json value, no such adversary can append and prepend rows; after all, a Json value isn’t necessarily even an Object.

However, when two objects are appended, we need to insert a comma between them. One might be tempted to write:

newtype Object = Object Builder

instance Monoid Object where
    mempty      = Object mempty
    mappend a b = Object (a ++ "," ++ b)

However, this is incorrect; in fact, it isn’t even a monoid. An adversary could produce syntactically incorrect Json simply by appending or prepending a mempty value. For example, here’s some sample expressions and the syntactically incorrect fragments they would generate:

   toJson  (row "x" 3 ++ mempty)               {"x":3,}
   mconcat [ row "x" 3, mempty, row "y" 4 ]     "x":3,,"y":4

Now, this adversary wouldn’t need to be very sophisticated, in fact these problems would almost certainly be discovered by a casual programmer. And a friendly programmer would find it excruciatingly painful to work around this broken implementation. But thinking like an adversary tends to lead to the best code; within reason, you want to minimize an adversary’s best move.

So, we need a way treat the empty Object specially. Also, we cannot simply append or prepend a comma to every singleton object, which would leave an superfluous comma either at the beginning or end of the object.

My first solution was to use the strict state monad to track whether or not we’ve emitted a singleton. Then row knows whether it needs to prepend a comma or not. In essence, my implementation was:

newtype Object = Object (State Bool Builder)

instance Monoid Object where
    mempty  = Object (return mempty)
    mappend = Object (liftM2 mappend)

row k v = Object $ do
   first <- get
   put False
   let str = toBuilder k ++ ":" ++ toBuilder v
   return $ if first then str else "," ++ str

instance Value Object where
   toJson (Object m) = Json ( "{" ++ evalState m True ++ "}" )

But some months later, as I was fleshing out other aspects of json-builder, it occurred to me that the first solution was overly strict; I couldn’t incrementally serialize something like toJson [1..10^9] for example. So, my second solution was conceptually the following code, albeit the actual code was special-cased and simplified slightly:

newtype Object = Object (ContT Builder (State Bool) ())

instance Monoid Object where
    mempty  = Object (return ())
    mappend = Object (>>)

row k v = Object $ do
    first <- get
    put False
    if first
      then return ()
      else write ","
    write (toBuilder k ++ ":" ++ toBuilder v)
    write b = mapContT (b ++) (return ())

instance Value Object where
    toJson (Object m) = Json ( "{" ++ run m ++ "}" )
      where run m = runState (runContT m (\_ -> return mempty)) True

But then I started benchmarking json-builder against aeson. Although I was in the right ballpark, I discovered a noticeable performance gap when serializing objects and vectors. So I started thinking about how to improve the Monoid implementation, and soon realized I really didn’t need a full blown monad. My third implementation was much prettier and closed the performance gap:

data Object = Empty | Comma Builder

instance Monoid Object where
    mempty = Empty
    mappend Empty y = y
    mappend x Empty = x
    mappend (Comma a) (Comma b) = Comma (a ++ "," ++ b)

row k v = Comma (toBuilder k ++ ":" ++ toBuilder v)

instance Value Object where
    toJson Empty     = Json "{}"
    toJson (Comma x) = Json ( "{" ++ x ++ "}" )

But, I soon realized that I had replicated the same mistake from the first solution, namely that this implementation was insufficiently lazy for incremental serialization. Fixing that was a simple matter of applying a standard and well-known code transformation to mappend:

instance Monoid Object where
    mempty = Empty
    mappend Empty     x = x
    mappend (Comma a) x = Comma ( a ++ case x of
                                         Empty   -> mempty
                                         Comma b -> "," ++ b )

However, this transformation reintroduced a performance gap, which was closed by adding a strictness annotation (<– worth reading) on the data constructor:

data Object = Empty | Comma !Builder

And that my friends, is the story of how the CommaBuilder got its spots.

Data Structure Agnostic JSON Serialization

Filed under: Uncategorized — lpsmith @ 2:19 am

Recently, Johan Tibell wrote a post on serialization APIs in Haskell, and thought it might be good to mention the approach used in my own json-builder, which I hadn’t previously promoted to very many others.

In the post, Johan highlighted the Value data structure mandated by the popular aeson package, and had a little aside:

Aside: even though we’ll serialize Haskell values to and from this type, it would have been reasonable, although perhaps more cumbersome for the users of our API, to skip the Value type entirely and convert our Haskell values directly to and from JSON-encoded ByteStrings.

type Object = HashMap Text Value

type Array = Vector Value

data Value = Object !Object
           | Array !Array
           | String !Text
           | Number !Number
           | Bool !Bool
           | Null

data Number = I !Integer
            | D !Double

Skipping this data structure is exactly what json-builder does. It takes arbitrary data directly to a json string. It’s also efficient, capable of serializing aeson’s data structure with identical performance as aeson itself. It’s also a robust abstraction, meaning that all uses of the basic interface will result in syntactically correct json strings. And, json-builder just as easy to use as aeson’s ToJSON typeclass.

Unfortunately my library does not solve the problem of parsing and processing Json values; there is no analog of the FromJSON typeclass, though I am interested in how one might implement similarly data structure agnostic json parsing.

The basic idea looks exactly the same as ToJSON class, though its currently named Value instead:

class Value a where
    toJson :: a -> Json

Now, Json is an opaque type, which is why this is a robust abstraction. All you can do with it is to turn it into a string, and use it to build bigger Json values.(In fact, if you look inside the Internal module, you’ll learn that Json is just a newtype wrapper around blaze-builder.)

Now, to take the example that Johan uses, let’s say you want to provide Value instance that serializes a Person record into a Json Object. Now, there are serialization instances for Data.Map and Data.HashMap, so you could take Aeson’s approach and build one of those first. Or you could circumvent the abstraction and produce Builders yourself. But what you really want to do is use the JsObject type class:

data Person = Person { name :: !Text, born :: !Int }

instance JsObject Person where
    toObject p = mconcat [
                   "name" `row` name p
                 , "born" `row` born p

This code is identical (modulo renaming) to the code that Johan gave to turn a Person into an Aeson structure. What it does is construct an Object value, which is an opaque type that builds Json Object. Object values have a very simple API. It only provides a singleton constructor and a Monoid instance. And you can turn an Object value into a Json value, of course:

instance Value Person where
    toJson p = toJson (toObject p)

Unlike aeson, you have full control the order in which the object’s fields appear. Unfortunately, json-builder will also happily produce JSON objects with duplicate field names, whereas aeson ensures that field names are unique. Neither issue is likely to be a very big deal in practice.

Now, json-builder has a couple of potentially interesting advantages over aeson. Let’s look at the serialization code for Haskell lists:

instance Value a => JsArray [a] where
    toArray = foldr (\x xs -> element x `mappend` xs) mempty

instance Value a => Value [a] where
    toJson = toJson . toArray

Unlike Aeson’s list serialization, this is a good consumer, and thus can fuse with good producers. So for example, when compiled with optimization, toJson [1..10^9] shouldn’t create a list at all, but rather directly produce a Json list of integers.

Also, this code is incremental even if it doesn’t fuse. It doesn’t need the entire list to start producing the Json string. Aeson, by contrast, marshals the entire list into a Vector before it produces anything.

Whether or not either of these advantages mean much to real world applications remain to be seen. I would guess that for most such applications, the structure-agnostic aspects are a bigger win.

This generality doesn’t cost anything over aeson in either serialization speed or ease of use; for example, here’s an instance for Aeson’s data structure:

instance Value Aeson.Value where
    toJson (Object v) = toJson v
    toJson (Array  v) = toJson v
    toJson (String v) = toJson v
    toJson (Number v) = toJson v
    toJson (Bool   v) = toJson v
    toJson  Null      = toJson ()

instance Value Number where
    toJson (I x) = toJson x
    toJson (D x) = toJson x

instance Value a => JsArray (Vector a) where
    toArray = Vector.foldr (\x xs -> element x `mappend` xs) mempty

instance Value a => Value (Vector a) where
    toJson = toJson . toArray

This turns out to be almost exactly equal in performance as the serialization code in aeson. (and perhaps I should add an instance for vector to json-builder) Take note that you don’t need to use such a simple recursion in either aeson or json-builder. You can easily tweak the serialization of any part of a data structure by calling something other than toJson. For example, say you have a map of Maybe values, and you don’t want to include keys associated with Nothing. (These would normally be rendered as null.) Then you can use this code:

noNothings :: (JsString k, Value v) => Map k (Maybe v) -> Object
noNothings = Map.foldrWithKey f mempty
   where f k mv xs = case mv of
                       Nothing -> xs
                       Just v  -> row k v `mappend` xs

Json-builder only solves half of the problem that aeson solves, but it solves that half in a more flexible and potentially more efficient way without sacrificing ease of use in common cases.

Older Posts »

Create a free website or blog at