Melding Monads

2014 November 5

A Brief History of Timekeeping, Part 1: Dates on the Western Calendars

Filed under: time — Tags: , , — lpsmith @ 8:13 pm

Time is simple and well understood, right? Well, in a physical sense, it’s fairly well understood, but not at all simple. Cutting edge research into atomic clock technology promises an uncertainty of about a second over an interval of several billion years. This is plenty accurate enough to expose the minuscule relativistic effects of simply repositioning the clock. In fairly short order, differences of fractions of a picosecond can be detected, caused by the fact that time passes at imperceptibly different speeds depending on motion and the small changes in the local gravity field.

But the topic of these posts is the political vagaries of time, not the physical. The basic concept of time (modulo relativity) is something intuitively understood by almost all human beings, and human cultures have been accurately counting the days for thousands of years. So if understanding fractions of a picosecond is difficult, at least dates should be easy, right?

Unfortunately, not so much. Although the Western Calendar has become the de facto and de jure standard over most the world, mapping a Western date to a precise day has some complicated and even ambiguous edge cases.

Julius Caesar introduced the Julian Calendar to the Roman Republic around January 1, 45 BC, in part to help resolve some political problems created by the ancient Roman Calendar. A bit over a year later, on March 15, 44 BC, Julius Caesar was assassinated, motivated in part by some of the political problems the older calendar contributed to.

One might ask, exactly how many days have passed since Caesar was assassinated until the time you are reading this? Certainly, if this question can be answered correctly, arriving at the answer is a bit more subtle than most people realize.

The Julian Calendar was essentially version 1.0 of the Western Calendar, and it’s initial deployment had a few teething problems. There was a bit of confusion surrounding what it meant to have a leap day "every fourth year", and as a result leap days occurred every third year until the error was noticed and corrected by Augustus Caesar and his advisers nearly 40 years later. All leap years were suspended for 12 years, to be resumed at the proper interval once the error was corrected.

We don’t know exactly when those earliest leap years occurred. It does seem as though there is a single probable answer in the case of Caesar’s assassination, as both of the unfalsified candidate solutions for the early leap years lead to the same answer. However, there are many dates over the subsequent decades that could easily refer to one of two days. In any case, sometime around 1 BC to 4 AD, the Julian calendar stabilized, and for nearly 1600 years afterwards, there is an unambiguous mapping of historical dates to days.

However, the average length of a Julian year was longer than the length of the solar year. Over the next 1500 years the dates on the Julian Calendar drifted noticeably out of sync with the solar year, with the spring equinox (as determined by actual astronomical observations) coming some 10 days too early.

In February 1582, Pope Gregory XIII instituted the Gregorian Calendar. Leap years would come every fourth year, as before, but now leap years would be skipped every 100 years, except every 400 years. Also, the new calendar would skip 10 days in order to bring it back in sync with the solar year, with 1582-10-04 followed by 1582-10-15.

And, if that was the entire story, it wouldn’t be so bad, but unfortunately this reform ushered in several hundred years of ambiguity and confusion due to its uneven adoption. This reform was immediately adopted by the Catholic Church, the Papal States and many Catholic nations, with other Catholic nations moving to the Gregorian Calendar soon afterwards. However, the Gregorian Calendar was resisted by most Protestant and Orthodox nations. For example, Great Britain and her colonies held out until 1752,1 and Russia and Greece held out until 1918 and 1923 respectively, shortly after armed revolutions forced a change of government in those countries.

Then you have the case of the Swedish Calendar, which is notable for it’s particularly convoluted, mishandled, and then aborted transition from the Julian to the Gregorian calendars which resulted in a February 30, 1712. Also, in a few locales, including Switzerland and the Czech Republic, some people were using the Julian calendar at the same time others were using the Gregorian calendar.

Thus, accurately identifying a date after 1582 coming from original sources, or any historical date coming from modern sources, can be complicated or even impossible, depending on the precise context (or lack thereof) surrounding the date and its source.

Most dates in modern history books remain as they were originally observed, but there are a few exceptions. For example, the date of George Washington’s birth is observed as 1732-02-22 (Gregorian), even though the American colonies were using the Julian calendar at the time. On the day that Washington was born, the date was actually 1732-02-11 (Julian).

As encouraged by the ISO 8601:2004 standard, computer software often uses the proleptic Gregorian calendar for historical dates, extending the Gregorian calendar backwards in time before its introduction and using it for all date calculations. This includes PostgreSQL’s built-in time types. Glasgow Haskell’s time package supports both the proleptic Julian and Gregorian calendars, though defaults to the Gregorian Calendar.

The good news is that as time marched on, things have gotten better. Driven largely by faster methods of communication and transportation, we’ve gone from the typical context-dependence of dates down to the typical context-dependence of a dozen seconds or so. Even ignoring context, we are often more limited by the accuracy of a typical clock, and technologies such as GPS and NTP are often available to help keep them in reasonable agreement with atomic time standards.

In the upcoming posts of this two or three part series, we’ll look at the process of reducing this ambiguity, as well as local time, time zones, and how PostgreSQL’s nearly universally misunderstood timestamp with time zone type actually works.


  1. Users with access to a unix machine may be amused by the output of cal 9 1752, and differences in various man pages for cal and ncal can be informative, take for example these two.

2014 March 27

Announcing postgresql-simple-0.4.2

Filed under: Uncategorized — lpsmith @ 7:46 pm

With 19 new releases, postgresql-simple has seen substantial development since my announcement of version 0.3.1 nearly a year ago. Compared to my last update, the changes are almost exclusively bugfixes and new features. Little code should break as a result of these changes, and most if not all of the code that does break should fail at compile time.

As it stands today, postgresql-simple certainly has the best support for postgres-specific functionality of anything on hackage. So, for a few highlights:

Parametrized Identifiers

Thanks to contributions from Tobias Florek, perhaps the most exciting change for many is that postgresql-simple now properly supports parametrized identifiers, including column, table, and type names. These are properly escaped via libpq. So for example, you can now write:

execute conn "DROP TABLE IF EXISTS ? CASCADE"
              (Only ("schema.table" :: QualifiedIdentifier))

Of course, this particular example is still potentially very insecure, but it shouldn’t be possible to create a SQL injection vulnerability this way.

The downside of this change is that postgresql-simple now requires libpq version 9.0 or later. However, community support for 8.4 is coming to and end this July. Also, it is possible to use newer versions of libpq to connect to older versions of postgres, so you don’t have to upgrade your server. (In fact, in one particular situation I’m still using postgresql-simple to connect to postgresql 8.1, although some features such as non-builtin types don’t work.)

Copy In and Copy Out support

While the postgresql-libpq binding has supported COPY FROM STDIN and COPY TO STDOUT for some time, postgresql-simple now supports these directly without having to muck around with postgresql-libpq calls via the Internal module.

If you are interested in streaming data to and from postgres, you may also be interested in higher-level COPY bindings for pipes and io-streams. This is available in Oliver Charles’s pipes-postgresql-simple, which also supports cursors, and my own unreleased postgresql-simple-streams, which also supports cursors and large objects.

Out-of-box support for JSON, UUID, and Scientific

Thanks to contributions from Bas van Dijk, postgresql-simple now provides FromField and ToField instances for aeson‘s value type, Antoine Latter’s uuid, as well as scientific.

Savepoints and nestable folds

Thanks to contributions from Joey Adams, postgresql-simple now has higher-level support for savepoints in the Transaction module, as well as the ability to nest the fold operator. Previously, postgresql-simple had assigned a static name to the cursor that underlies fold, and now every connection has a counter used to generate temporary names.

Parametrized VALUES expressions

While executeMany and returning already support common use cases of this new feature, the new Values type allows you to parameterize more than just a single VALUES table literal.

For example, these situations commonly arise when dealing with writable common table expressions. Let’s say we have a table of things with an associated table of attributes:

CREATE TABLE things (
   id   SERIAL NOT NULL,
   name TEXT NOT NULL,
   PRIMARY KEY (id)
  );

CREATE TABLE attributes (
   id    INT NOT NULL REFERENCES things,
   key   TEXT   NOT NULL,
   value BIGINT NOT NULL
  );

Then we can populate both a thing and its attributes with a single query, returning the id generated by the database:

query conn [sql|
    WITH thing AS (
        INSERT INTO things (name) VALUES (?) RETURNING id
    ), newattrs AS (
        INSERT INTO attributes
            SELECT thing.id, a.*
            FROM thing JOIN ? a
    )
      SELECT id FROM thing;
  |]  ("bob", Values   [ "text" ,"int8" ]
                     [ ( "foo"  , 42    )
                     , ( "bar"  , 60    ) ])

The empty case is also dealt with correctly; see the documentation for details.

Improved support for outer joins

A long standing challenge with postgresql-simple is dealing with outer joins in a sane way. For example, with the schema above, let’s say we want to fetch all the things along with their attributes, whether or not any given thing has any attributes at all. Previously, we could write:

getAllTheThings :: Connection -> IO [(Text, Maybe Text, Maybe Int64)]
getAllTheThings conn = do
    query conn [sql|
        SELECT name, key, value
          FROM things LEFT OUTER JOIN attributes
            ON things.id = attributes.id
      |]

Now, the columns from the attributes table are not nullable, so normally we could avoid the Maybe constructors, however the outer join changes that. Since both of these columns are not nullable, they are always both null or both not null, which is an invariant not captured by the type. And a separate Maybe for each column gets awkward to deal with, especially when more columns are involved.

What we would really like to do is change the type signature to:

getAllTheThings :: Connection -> IO [Only Text :. Maybe (Text, Int64)]

And now we can, and it will just work! Well, almost. The caveat is that there is a separate instance FromRow (Maybe ...) for most of the provided FromRow instances. This won’t work with your own FromRow instances unless you also declare a second instance. What’s really desired is a generic instance:

instance FromRow a => FromRow (Maybe a)

This instance would return Nothing if all the columns that would normally be consumed are null, and attempt a full conversion otherwise. This would reduce code bloat and repetition, and improve polymorphism and compositionality.

But alas, it’s not possible to define such an instance without changing the FromRow interface, and quite probably breaking everybody’s FromRow instances. Which I’m totally willing to do, once somebody comes up with a way to do it.

2013 April 27

Announcing postgresql-simple 0.3.1

Filed under: Uncategorized — lpsmith @ 10:49 am

Array types

Postgresql-simple has been progressing since my last announcement of version 0.1 nearly a year ago. Since then there has been many changes by myself and contributors, some of which will break your code with or without compilation errors. So this post will attempt to highlight some of the bigger changes. Probably the most exciting recent change is the new support for PostgreSQL’s array types, largely due to work from Jason Dusek, Bas van Dijk, and myself. Here’s two examples of a table and query that makes use of this functionality:

CREATE TABLE assocs
   ( id   INT NOT NULL
   , link INT NOT NULL
   );

CREATE TABLE matrices
   ( id     INT NOT NULL
   , matrix FLOAT8[][]
   );
import qualified Data.Vector as V
import           Database.PostgreSQL.Simple
import           Data.Int(Int64)

getAssocs :: Connection -> IO [(Int,V.Vector Int)]
getAssocs conn = do
    query_ conn "SELECT id, ARRAY_AGG(link) FROM assocs GROUP BY id"

insertMatrix :: Connection -> Int -> V.Vector (V.Vector Double) -> IO Int64
insertMatrix conn id matrix = do
    execute conn "INSERT INTO matrices (id, matrix) VALUES (?,?)" (id, matrix)

TypeInfo Changes

In order to properly support the FromField a => FromField (Vector a) instance, the TypeInfo system was overhauled. Previously, the only information it tracked was a mapping of type OIDs to type names, first by consulting a static table and then using a per-connection cache, finally querying the pg_type metatable. This was useful for writing FromField instances for postgresql types that do not have a stable OID, such as those provided by an extension, like the period type from Temporal Postgres or the hstore type from the contributed modules bundled with PostgreSQL. However, proper array support required more information, especially the type of the array’s elements. This information is now available in an easily extended data structure, available in the new Database.PostgreSQL.Simple.TypeInfo module. This was introduced in 0.3, 0.3.1 added support for range and composite types; however there is not yet any FromField instance that makes use of this information or deals with these types.

IO at Conversion Time

Version 0.3 also stopped pre-computing the type name of every column and storing these in a vector before converting the results, by allowing a restricted set of IO actions at conversion time. This is a win, because the common case is that the typename is never consulted; for example almost all of the out-of-box FromField instances examine the type OID alone. Also, since IO information no longer has to be computed before conversion takes place, it makes it practical to consider using IO information that would rarely be used in normal circumstances, such as turning table OIDs into table names when errors are encountered. It’s possible to extend the IO actions available to FromField and FromRow instances by accessing the data constructor of the Conversion monad via the Database.PostgreSQL.Simple.Internal module.

This required changing the type of the FromField.typename operator, which will break your FromField instances that use it. It also required small changes to the FromField and FromRow interface, which has a chance of breaking some of your FromField and FromRow instances if they don’t strictly use the abstract Applicative and/or Monad interfaces. However, all of this breakage should be obvious; if your code compiles, it should work with the new interface.

HStore support

Version 0.3.1 also introduced out-of-box support for hstore. The hstore type provides key-value maps from textual strings to textual strings. Conversions to/from lists and Data.Map is provided, while conversions from other Haskell types can be easily implemented via the HStoreBuilder interface (similar to my own json-builder package), and conversions to other Haskell types can easily be implemented via the conversion to lists.

CREATE TABLE hstore_example
    ( id  INT NOT NULL
    , map hstore
    );
insertHStore :: Connection -> Int -> [(Text,Text)] -> IO Int64
insertHStore conn id map = do
    execute conn "INSERT INTO hstore_example (id,map) VALUES (?,?)" (id, HStoreList map)

retrieveHStore :: Connection -> Int -> IO (Maybe [(Text,Text)])
retrieveHStore conn id
    xs <- query conn "SELECT map FROM hstore_example WHERE id = ?" (Only id)
    case xs of
      [] -> return Nothing
      (Only (HStoreList val):_) -> return (Just val)

Better Error Messages

Jeff Chu and Leonid Onokhov have improved both error messages and error handling options in the latest version. Thanks to Jeff, the ResultError exception now includes the column name and associated table OID (if any) from which the column was taken from. And Leonid has contributed a new Errors module that can be used to dissect SqlError values in greater detail.

Better Time Conversions

And in older news, version 0.1.4 debuted brand new time parsers and printers for the ISO-8601 syntax flavor that PostgreSQL emits, included FromField instances for LocalTime, and introduced new datatypes for dealing with PostgreSQL’s time infinities. Among other things, the new parsers correctly handle timestamps with UTC offsets of a whole number of minutes, which means (for example) that postgresql-simple now works in India. Version 0.2 removed the conversion from timestamp (without time zone) to the UTCTime and ZonedTime types, due to the inherent ambiguity that conversion represents; LocalTime is now the preferred way of handling timestamps (without time zones).

2012 July 10

Announcing split-channel

Filed under: Uncategorized — lpsmith @ 11:44 pm

The split-channel package is new library that is a small variation on Control.Concurrent.Chan. The most obvious change is that it splits the channel into sending and receiving ports. This has at least two advantages: first, that this enables the type system to more finely constrain program behavior, and second, a SendPort can have zero ReceivePorts associated with it, and messages written to such a channel can be garbage collected.

This library started life last fall as part of my experiments in adding support for PostgreSQL’s asynchronous notifications to Chris Done’s native pgsql-simple library. The initial motivation was that if a notification arrived and nobody was listening, I wanted to be able to garbage collect it. However, the type advantages are what keep me coming back.

Beyond the primary change, this library has a number of other small improvements over Control.Concurrent.Chan: the deprecated thread-unsafe functions aren’t there, and several operators have been added or improved, most notably listen, sendMany, fold, and split.

  1. listen attaches a new ReceivePort to an existing SendPort. By contrast, Chan only provides the ability to duplicate an existing ReceivePort.

    Edit: I was mistaken: listen is essentially equivalent to dupChan, whereas duplicate is new.

  2. sendMany sends a list of messages atomically. It’s a better name than writeList2Chan, which is not atomic and is only a convenience function written in terms of send. However, writeList2Chan does work on infinite streams, whereas sendMany does not.

  3. fold is a generalization of getChanContents, potentially avoiding some data structures.

  4. split cuts an existing channel into two channels. It gives you back a new ReceivePort associated with the existing SendPort, and a new SendPort associated with the existing ReceivePorts. This is a more general operator than one I’ve used in a few places to transparently swap out backend services.

    Chan does not provide the split operator, though one could be added. However I am skeptical that this is a good idea: it’s just a little too effect-ful for comfort. I think that putting a SendPort in an MVar tends to be a better idea than using split, even though it does introduce another layer of indirection.

Finally, a few acknowledgements are in order: primarily, Control.Concurrent.Chan and its authors and contributors, and secondarily, Joey Adams for GHC Bug #5870, the fix of which has been incorporated into split-channel.

2012 May 5

Announcing postgresql-simple 0.1

Filed under: Uncategorized — lpsmith @ 3:02 pm

A new release of postgresql-simple is now available. Version 0.1 introduces some breaking changes as well as some significant improvements to the interface. I’ll cover the biggest and best change in this post, for other changes you should read the changelog.

A few months ago, Ozgun Ataman suggested that I add some convenience code to simplify writing instances of QueryResults. I ended up liking his suggestion so much I decided to dig deeper into postgresql-simple and rebase the implementation around his interface. This allowed me to eliminate some intermediate data structures, and has some other advantages I’ll get to later. I’m rather pleased with the end result; here is a contrast of the old and the new:

class QueryResults a where
    convertResults :: [Field] -> [Maybe ByteString] -> Either SomeException a
 
instance (Result a, Result b) => QueryResults (a,b) where
    convertResults [fa,fb] [va,vb] = do
        !a <- convert fa va
        !b <- convert fb vb
        return (a,b)
    convertResults fs vs  = convertError fs vs 2



-- RowParser has instances for Applicative, Alternative, and Monad
class FromRow a where
    fromRow :: RowParser a

-- field is exported from Database.PostgreSQL.Simple.FromRow
field :: FromField a => RowParser a

instance (FromField a, FromField b) => FromRow (a,b) where
    fromRow = (,) <$> field <*> field

The new interface not only eliminates a significant amount of syntactic overhead, it also allowed me to automatically force the converted values to WHNF, which eliminates the old caveats about writing QueryResults instances. (Note that there are (still) caveats to writing FromField instances, even if they weren’t documented before.) And perhaps more interestingly, it enables new forms of composition. For example, the Types module defines a pair type for composing FromRow instances as follows:

data a :. b = a :. b 

infixr 3 :.

instance (FromRow a, FromRow b) => FromRow (a :. b) where
   fromRow = (:.) <$> fromRow <*> fromRow

Here’s an example of how you might use this. Let’s pretend we have a simple schema for that might be used for a (small) website where people can post stories and rate them from 0-9:

CREATE TABLE users                     (uid int not null, username text   not null);
CREATE TABLE posts   (pid int not null, uid int not null, content  text   not null);
CREATE TABLE ratings (pid int not null, uid int not null, score    float4 not null);

Then, here’s some example boilerplate that could be used to represent (part of) the schema in Haskell:

data User = User { u_uid :: Int, username :: Text }

instance FromRow User where
  fromRow = User <$> field <*> field

data Post = Post { p_pid :: Int, p_uid :: Int,  content :: Text } 

instance FromRow Post where
  fromRow = Post <$> field <*> field <*> field

type Score = Float

And finally, here is an example of how one could get the data out of the database in a form almost ready to generate a web page:

    rows :: [User :. Post :. Only Score] <- 
       query conn [sql|  SELECT u.*, p.*, avg(r.score) as avg_score
                           FROM users as u JOIN posts as p JOIN ratings as r  
                                ON u.uid = p.uid AND p.pid = r.pid
                          GROUP BY p.pid
                          ORDER BY avg_score DESC                             |] ()
    forM rows $ \(user :. post :. Only score) -> do
        -- generate the HTML for a single post here

Now of course, this isn’t meant to be a serious sketch of such a post/rating system; there are plenty of performance problems that are perfectly obvious to me, and maybe to you too. This is just to give you some ideas about how this functionality might be used: for example, to deal with (a limited class of) joins, or to deal with a computed column that’s been affixed to the table.

Some unresolved issues at this point would be dealing with natural joins and outer joins. I suspect that an adequate solution to these problems may require more breaking changes to this interface. In any case, multiple people have been asking for this kind of functionality, and I do think this interface is a step in the right direction.

For questions, comments, or suggestions relating to postgresql-simple, I suggest joining database-devel, a new mailing list pertaining not only to postgresql-simple, but also database development in Haskell in general. Also I’m often willing to answer questions if you happen to catch me on Freenode.

2012 February 24

Implementing json-builder (or, how the CommaBuilder got its spots)

Filed under: Uncategorized — lpsmith @ 6:42 am

Json-builder is capable of serializing aeson’s mandatory data structure at almost exactly the same speed as aeson itself. This is no accident, but rather it was a conscious implementation goal; in fact I managed to very substantially improve on aeson’s performance in a few minor corner cases (such as generating character escapes) and then shared what I discovered with Bryan O’Sullivan, who then improved aeson in turn.

One of the more interesting tricks in json-builder is the Monoid instances for Array and Object types. Those instances have been through 4 major revisions, with an interesting pattern of mistakes.

Now, a value of each type is conceptually represented by a Json string without the opening and closing delimiters. So for example, the row "x" 3 is represented by the string "x":3. The outer braces are added by ‘toJson’ function when it converts the Object to a value of the ‘Json’ type.

The delimiters are left off because a client can both append and prepend additional rows to the object, and we cannot efficiently remove the opening and closing brace from a Builder. It may help to pretend that an adversary is trying to break and abuse your interface. However, once the object is converted to a Json value, no such adversary can append and prepend rows; after all, a Json value isn’t necessarily even an Object.

However, when two objects are appended, we need to insert a comma between them. One might be tempted to write:

newtype Object = Object Builder

instance Monoid Object where
    mempty      = Object mempty
    mappend a b = Object (a ++ "," ++ b)

However, this is incorrect; in fact, it isn’t even a monoid. An adversary could produce syntactically incorrect Json simply by appending or prepending a mempty value. For example, here’s some sample expressions and the syntactically incorrect fragments they would generate:

   toJson  (row "x" 3 ++ mempty)               {"x":3,}
   mconcat [ row "x" 3, mempty, row "y" 4 ]     "x":3,,"y":4

Now, this adversary wouldn’t need to be very sophisticated, in fact these problems would almost certainly be discovered by a casual programmer. And a friendly programmer would find it excruciatingly painful to work around this broken implementation. But thinking like an adversary tends to lead to the best code; within reason, you want to minimize an adversary’s best move.

So, we need a way treat the empty Object specially. Also, we cannot simply append or prepend a comma to every singleton object, which would leave an superfluous comma either at the beginning or end of the object.

My first solution was to use the strict state monad to track whether or not we’ve emitted a singleton. Then row knows whether it needs to prepend a comma or not. In essence, my implementation was:

newtype Object = Object (State Bool Builder)

instance Monoid Object where
    mempty  = Object (return mempty)
    mappend = Object (liftM2 mappend)

row k v = Object $ do
   first <- get
   put False
   let str = toBuilder k ++ ":" ++ toBuilder v
   return $ if first then str else "," ++ str

instance Value Object where
   toJson (Object m) = Json ( "{" ++ evalState m True ++ "}" )

But some months later, as I was fleshing out other aspects of json-builder, it occurred to me that the first solution was overly strict; I couldn’t incrementally serialize something like toJson [1..10^9] for example. So, my second solution was conceptually the following code, albeit the actual code was special-cased and simplified slightly:

newtype Object = Object (ContT Builder (State Bool) ())

instance Monoid Object where
    mempty  = Object (return ())
    mappend = Object (>>)

row k v = Object $ do
    first <- get
    put False
    if first
      then return ()
      else write ","
    write (toBuilder k ++ ":" ++ toBuilder v)
  where
    write b = mapContT (b ++) (return ())

instance Value Object where
    toJson (Object m) = Json ( "{" ++ run m ++ "}" )
      where run m = runState (runContT m (\_ -> return mempty)) True

But then I started benchmarking json-builder against aeson. Although I was in the right ballpark, I discovered a noticeable performance gap when serializing objects and vectors. So I started thinking about how to improve the Monoid implementation, and soon realized I really didn’t need a full blown monad. My third implementation was much prettier and closed the performance gap:

data Object = Empty | Comma Builder

instance Monoid Object where
    mempty = Empty
    mappend Empty y = y
    mappend x Empty = x
    mappend (Comma a) (Comma b) = Comma (a ++ "," ++ b)

row k v = Comma (toBuilder k ++ ":" ++ toBuilder v)

instance Value Object where
    toJson Empty     = Json "{}"
    toJson (Comma x) = Json ( "{" ++ x ++ "}" )

But, I soon realized that I had replicated the same mistake from the first solution, namely that this implementation was insufficiently lazy for incremental serialization. Fixing that was a simple matter of applying a standard and well-known code transformation to mappend:

instance Monoid Object where
    mempty = Empty
    mappend Empty     x = x
    mappend (Comma a) x = Comma ( a ++ case x of
                                         Empty   -> mempty
                                         Comma b -> "," ++ b )

However, this transformation reintroduced a performance gap, which was closed by adding a strictness annotation (<– worth reading) on the data constructor:

data Object = Empty | Comma !Builder

And that my friends, is the story of how the CommaBuilder got its spots.

Data Structure Agnostic JSON Serialization

Filed under: Uncategorized — lpsmith @ 2:19 am

Recently, Johan Tibell wrote a post on serialization APIs in Haskell, and thought it might be good to mention the approach used in my own json-builder, which I hadn’t previously promoted to very many others.

In the post, Johan highlighted the Value data structure mandated by the popular aeson package, and had a little aside:

Aside: even though we’ll serialize Haskell values to and from this type, it would have been reasonable, although perhaps more cumbersome for the users of our API, to skip the Value type entirely and convert our Haskell values directly to and from JSON-encoded ByteStrings.

type Object = HashMap Text Value

type Array = Vector Value

data Value = Object !Object
           | Array !Array
           | String !Text
           | Number !Number
           | Bool !Bool
           | Null

data Number = I !Integer
            | D !Double

Skipping this data structure is exactly what json-builder does. It takes arbitrary data directly to a json string. It’s also efficient, capable of serializing aeson’s data structure with identical performance as aeson itself. It’s also a robust abstraction, meaning that all uses of the basic interface will result in syntactically correct json strings. And, json-builder just as easy to use as aeson’s ToJSON typeclass.

Unfortunately my library does not solve the problem of parsing and processing Json values; there is no analog of the FromJSON typeclass, though I am interested in how one might implement similarly data structure agnostic json parsing.

The basic idea looks exactly the same as ToJSON class, though its currently named Value instead:

class Value a where
    toJson :: a -> Json

Now, Json is an opaque type, which is why this is a robust abstraction. All you can do with it is to turn it into a string, and use it to build bigger Json values.(In fact, if you look inside the Internal module, you’ll learn that Json is just a newtype wrapper around blaze-builder.)

Now, to take the example that Johan uses, let’s say you want to provide Value instance that serializes a Person record into a Json Object. Now, there are serialization instances for Data.Map and Data.HashMap, so you could take Aeson’s approach and build one of those first. Or you could circumvent the abstraction and produce Builders yourself. But what you really want to do is use the JsObject type class:

data Person = Person { name :: !Text, born :: !Int }

instance JsObject Person where
    toObject p = mconcat [
                   "name" `row` name p
                 , "born" `row` born p
                 ]

This code is identical (modulo renaming) to the code that Johan gave to turn a Person into an Aeson structure. What it does is construct an Object value, which is an opaque type that builds Json Object. Object values have a very simple API. It only provides a singleton constructor and a Monoid instance. And you can turn an Object value into a Json value, of course:

instance Value Person where
    toJson p = toJson (toObject p)

Unlike aeson, you have full control the order in which the object’s fields appear. Unfortunately, json-builder will also happily produce JSON objects with duplicate field names, whereas aeson ensures that field names are unique. Neither issue is likely to be a very big deal in practice.

Now, json-builder has a couple of potentially interesting advantages over aeson. Let’s look at the serialization code for Haskell lists:

instance Value a => JsArray [a] where
    toArray = foldr (\x xs -> element x `mappend` xs) mempty

instance Value a => Value [a] where
    toJson = toJson . toArray

Unlike Aeson’s list serialization, this is a good consumer, and thus can fuse with good producers. So for example, when compiled with optimization, toJson [1..10^9] shouldn’t create a list at all, but rather directly produce a Json list of integers.

Also, this code is incremental even if it doesn’t fuse. It doesn’t need the entire list to start producing the Json string. Aeson, by contrast, marshals the entire list into a Vector before it produces anything.

Whether or not either of these advantages mean much to real world applications remain to be seen. I would guess that for most such applications, the structure-agnostic aspects are a bigger win.

This generality doesn’t cost anything over aeson in either serialization speed or ease of use; for example, here’s an instance for Aeson’s data structure:

instance Value Aeson.Value where
    toJson (Object v) = toJson v
    toJson (Array  v) = toJson v
    toJson (String v) = toJson v
    toJson (Number v) = toJson v
    toJson (Bool   v) = toJson v
    toJson  Null      = toJson ()

instance Value Number where
    toJson (I x) = toJson x
    toJson (D x) = toJson x

instance Value a => JsArray (Vector a) where
    toArray = Vector.foldr (\x xs -> element x `mappend` xs) mempty

instance Value a => Value (Vector a) where
    toJson = toJson . toArray

This turns out to be almost exactly equal in performance as the serialization code in aeson. (and perhaps I should add an instance for vector to json-builder) Take note that you don’t need to use such a simple recursion in either aeson or json-builder. You can easily tweak the serialization of any part of a data structure by calling something other than toJson. For example, say you have a map of Maybe values, and you don’t want to include keys associated with Nothing. (These would normally be rendered as null.) Then you can use this code:

noNothings :: (JsString k, Value v) => Map k (Maybe v) -> Object
noNothings = Map.foldrWithKey f mempty
   where f k mv xs = case mv of
                       Nothing -> xs
                       Just v  -> row k v `mappend` xs

Json-builder only solves half of the problem that aeson solves, but it solves that half in a more flexible and potentially more efficient way without sacrificing ease of use in common cases.

2011 December 30

Announcing postgresql-simple

Filed under: Uncategorized — lpsmith @ 5:04 am

I’ve done a bit of database programming off and on over the years, and when I started into a larger project and decided to use Haskell and PostgreSQL, I didn’t understand exactly how bad of a choice Haskell was for PostgreSQL development at the time. So postgresql-simple and postgresql-libpq is hopefully a small step towards remedying the situation. This work is a fork of Bryan O’Sullivan’s mysql-simple library and Grant Monroe’s libpq binding, with some code informed by Chris Done’s pgsql-simple library, and I’d to thank all for their terrific work.

So, there are five things that, taken together, distinguish postgresql-simple from the other PostgreSQL options on hackage:

1. postgresql-simple uses libpq, PostgreSQL’s official client library for C

Chris Done’s pgsql-simple is a pure Haskell implementation of a small subset of the PostgreSQL frontend/backend protocol. I do think there is technical merit to a native Haskell implementation, and in particular I’ve identified a potentially significant advantage that cannot be achieved via libpq. However writing a native implementation is a significant undertaking, and such efforts invariably lag behind in terms of features. And though pgsql-simple is experimental software that hasn’t been officially released and I don’t want to be unfair to it, pgsql-simple performs badly. I sped up a program I had been using to test pgsql-simple by over a factor of 20 simply by switching to postgresql-simple.

Among other things using libpq means that postgresql-simple gets the full range of connection and authentication options out of box, including support for Unix Domain Sockets, SSL, and GSSAPI.

2. postgresql-simple uses libpq-based string escaping.

HSQL doesn’t provide parameterized SQL statements at all, requiring programmers to dynamically generate sql and handle escaping issues themselves. And experience has shown punting on this issue is unacceptable, as dynamically generating SQL is risky and forcing programmers to handle escaping issues is an order of magnitude worse.

HDBC and pgsql-simple support parameterized SQL, but they attempt to do the escaping themselves. The problem is that this is usually a bit more subtle than it first appears, leading to bugs and potential vulnerabilities. For example, in PostgreSQL, escape syntax depends on the value of the standard_conforming_strings server setting, which libpq will detect and accommodate accordingly.

3. postgresql-simple makes no pretense of supporting prepared statements

In PostgreSQL, a prepared statement allows one to send off a parameterized sql statement to a backend, and then re-use that statement as many times as you like by just filling in the parameters. This means the text of the prepared statement does not repeatedly traverse the wire, the backend does not parse the sql query multiple times, the backend re-uses the query plan generated when the statement was prepared, and you get to use protocol-level parameters which can eliminate the need for escaping strings and converting numbers to base-10 and back.

Note that prepared statements are very often, but not universally, advantageous. Starting from nothing, they require two round trips to get any information out, so preparation can be disadvantageous for one-shot queries. Also, occasionally re-planning a query can be advantageous, so infrequently executed queries can be harmed by preparation as well.

HDBC nominally supports prepared statements, but in fact all HDBC-postgresql does is cache some of the preprocessing of the query. No preparation in the database sense ever occurs. So while the lack of support for prepared statements is a definite disadvantage, neither does postgresql-simple pretend to prepare statements when it doesn’t.

4. postgresql-simple provides a simple API that most programmers would find familiar

Takusen appears to be the one PostgreSQL database access library for Haskell that gets basic implementation details more-or-less correct. Unfortunately it exports an esoteric API that is not applicable in all situations. In particular, web applications often use various forms of connection caching or pooling, which is fundamentally incompatible with the deterministic connection resource guarantees provided by Takusen.

Also, the *-simple libraries have a relatively nice interface compared to HDBC. Ultimately that was the breaking point that caused me to spend the time to create postgresql-simple; no one database library did everything I needed, and while working on some new code where my mangled fork of HDBC made the most sense, I realized I really wished I was using pgsql-simple instead. Also, I paid attention to ensure that application code could add support for user-defined PostgreSQL types with a minimum amount of fuss and without modifying the library, something that neither HDBC nor pgsql-simple could really do.

5. Support for Listen/Notify

Listen/Notify is the perfect solution when you want to write a program that responds to changes made to a PostgreSQL database. Informing these programs when changes are available consumes less resources and provides lower latencies than periodic polling. And since you can use a rule or trigger to send the notification, these notifications can be robust; you don’t have to assume that the program making the changes even knows that anybody wants to be informed of the changes. Listen/notify is aware of and consistent with transactions; notifications don’t get sent until the transaction commits. Even if you are willing to put that kind of logic in the database clients, using listen/notify solves the otherwise sticky problem of finding and coordinating with the other clients.

Nothing else on hackage supports asynchronous notifications, unless you use a low-level libpq binding directly.

Looking Forward

Creating a mid-level database access library based on libpq is a significant undertaking, and there is an awful lot that isn’t supported, including prepared statements, binary data formats, copy in/copy out support, more and better support for PostgreSQL data types, and PostGIS support. Many other things could be improved, such as using libpq asynchronously, better end-to-end typechecking that discovers more errors at compile time, and other interface improvements. For example, SqlQQ is a simplistic quasiquoter to improve the literal syntax for multi-line SQL queries; one could certainly imagine extending the syntax to support including a $haskellvariable as a SQL parameter instead of going through the syntactic indirection of ? in the SQL string with haskellvariable in a separate Haskell parameter.

Funnily enough, I ran across this thread in which PostgreSQL luminaries were complaining about the quality of Python’s PostgreSQL libraries. And Python is considerably ahead of Haskell in this regard.

2011 October 24

Concurrency and Foreign Functions in the Glasgow Haskell Compiler

Filed under: haskell — lpsmith @ 12:59 pm

I had an informative chat with Simon Marlow and Jules Bean a little while ago about concurrency in the Glasgow Haskell Compiler and its interaction with foreign code, specifically the difference between safe and unsafe calls. It turns out that the implementation is both simpler and more powerful than my vague misconceptions about how it worked. And since there seems to be quite a few myths out there, I thought I should write up an overview of GHC’s implementation.

Haskell’s concurrency model is relatively simple. Conceptually, Haskell threads are OS threads, and in most cases the programmer can treat them as such. This is known as a 1-1 threading model. In actuality, GHC’s implementation uses an M-N model, but provides an illusion of a 1-1 model. As one would expect, there are a few caveats to this illusion, but they tend to be minor. First, we’ll cover three fundamental components of GHC’s concurrency:

Haskell Threads

Haskell threads are implemented in GHC via cooperative multitasking. They are scheduled by GHC’s run-time system, making them the M of the M-N model. Yield points are generated automatically by the compiler, which provides an illusion of preemptive multitasking to the programmer. Also, the stack for each thread starts small but grows as needed. Cooperative multitasking and growable stacks make Haskell threads cheap and efficient, typically scaling to millions of threads on contemporary computer systems.

The downside is that GHC’s threads cannot by themselves take advantage of multiple CPUs. Also, foreign functions don’t cooperatively multitask. Notably, since GHC implements its arbitrary-precision Integer datatype via the GNU Multiple Precision Library, the illusion of preemptive multitasking can be less than perfect when executing Integer-heavy code.

To make use of Haskell threads, all you have to do is create them by calling Control.Concurrent.forkIO. There are no special compile-time, link-time, or run-time options needed to enable them.

Operating System Threads

OS threads offer true preemptive multitasking and are scheduled by the kernel. They are also more expensive, typically scaling to thousands of threads on contemporary systems. However, they can execute on multiple CPUs, and they provide a way to deal with foreign calls that either block or take a long time to execute.

To use OS threads, all you do need to link your executable with GHC’s threaded runtime system using the -threaded option. There is nothing particularly special you need to do inside Haskell code to utilize operating system threads; GHC’s threaded runtime will endeavor to maintain the illusion that Haskell threads are OS threads as best it can.

Capabilities

In the context of GHC, a capability is an operating system thread that is able to execute Haskell code. GHC schedules Haskell threads among the capabilities, making them the N of the M-N model. Previous versions of GHC only supported one capability, but GHC now supports as many capabilities as you want. While the total number of capabilities remains fixed, the collection of OS threads that are capabilities changes over time.

You can set the number of capabilities “x” by using the -N[x] run-time option. This can be done either by passing the executable +RTS -N[x] on the command line, or by setting the default RTS options by linking the executable with -with-rtsopts="-N[x]". Note that you will need the -threaded and may need the -rtsopts link-time options.

The Foreign Function Interface

The FFI supports two kinds of calls: safe and unsafe. A “safe” call blocks the calling OS thread and Haskell thread until the call finishes. Blocking these is unavoidable. However, a safe call does not block the capability. When GHC performs a safe call, it performs the call inside the current OS thread, which is a capability. The capability then moves to another OS thread. If no other threads are available, GHC will transparently create a new OS thread. OS threads are pooled to try to avoid creating or destroying them too often.

An unsafe call blocks the capability in addition to blocking the OS and Haskell threads. This was a pretty big deal when GHC only supported a single capability. In effect, an unsafe call would block not only the Haskell thread that made the call, but every Haskell thread. This gives rise to the myth that unsafe calls block the entire Haskell runtime, which is no longer true. An unsafe foreign call that blocks is still undesirable, but depending on configuration, may not be as bad as it used to be.

The advantage of unsafe calls is that they entail significantly less overhead. Semantically, a foreign function that might call back into Haskell must be declared as a “safe” call, whereas foreign function that will never call back into Haskell may be declared as “unsafe”. As a result, safe calls must perform extra bookkeeping.

(As an aside, this vocabulary is not particularly consistent with Haskell’s other use of “safe” and “unsafe”. In this other sense, foreign calls are unsafe because they can be used to break the abstractions that GHC provides.)

In practice, most foreign functions will never call back into Haskell. Thus choosing whether to mark a foreign function as safe or unsafe usually revolves around the trade-off between concurrency and overhead. If needed, you can declare both a safe and an unsafe binding to the same foreign function.

Bound threads

Finally, the most significant caveat to GHC’s 1-1 threading illusion is that some foreign code (notably OpenGL) expects to be run in a single OS thread, due to the use of thread-local storage or the like.

GHC’s answer is “bound threads”. GHC provides forkOS, which creates a Haskell thread with a special property: every foreign call that the Haskell thread makes is guaranteed to happen inside the same OS thread. It is perhaps a bad choice of name, as there is no need to call forkOS to utilize operating system threads. The only time you need to call forkOS is when you need an honest-to-goodness 1-1 correspondence between a Haskell thread and a kernel thread.

Finishing up

These kinds of illusions are a common theme in GHC: what appears to be synchronous blocking IO is in fact asynchronous non-blocking IO, what appears to be preemptive multitasking is in fact cooperative multitasking, what appears to be 1-1 threading is actually M-N threading. GHC endeavors to present the user with a nice interface, while using more efficient techniques behind the scenes. And, for the most part, these illusions hold up pretty well.

I hope the community finds this overview helpful. Most of what I’ve talked about is covered in “Extending the Haskell Foreign Function Interface with Concurrency” by Simon Marlow, Simon Peyton Jones, and Wolfgang Thaller. I hope this post is a useful introduction to the paper.

2010 August 25

Fun with the Lazy State Monad, Part 2

Filed under: Uncategorized — lpsmith @ 5:52 am

In a previous post, I demonstrated how to use the lazy state monad to turn Geraint Jones and Jeremy Gibbon’s breadth-first labeling algorithm into a reusable abstraction, and mentioned that I did not know how to combine the result with continuations. In this post, I will do exactly that.

A hat tip goes to Twan van Laarhoven for suggesting that I split my fresh operator into two, which I adopt in this post. In the comments, down is defined as:

type Fresh n a = State [n] a

down :: Fresh n a -> Fresh n a
down m = do
    (n:ns) <- get
    put ns
    a <- m
    ns' <- get
    put (n:ns')
    return a

Now, if we naïvely apply the continuation transformer in the Monad Transformer Library, the get and put operators are lifted as follows:

type FreshCC r n a = ContT r (State [n]) a

lift :: Monad m => m a -> ContT r m a
lift m = ContT (m >>=)

down :: FreshCC r n a -> FreshCC r n a
down m = do
    (n:ns) <- lift get
    lift (put ns)
    a <- m
    ns' <- lift get
    lift (put (n:ns'))
    return a

The problem is that this replaces the >>= of the lazy state monad with the >>= of the continuation transformer, and these two operators have different strictness properties. This, in turn, leads to non-termination. The trick is not to lift get and put, but rather conjugate down:

lower :: Monad m => ContT a m a -> m a
lower m = runContT m return

down :: FreshCC a n a -> FreshCC r n a
down m = lift $ do
    (n:ns) <- get
    put ns
    a <- lower m
    ns' <- get
    put (n:ns')
    return a

-- *or*

down = lift . original_down . lower

Now, the conjugation preserves the use of the lazy state monad’s >>= in the definition of down, however it changes the type of the argument from FreshCC r n a to FreshCC a n a! The other definitions contained in the previous post stay much the same.

Feel free to download freshcc.hs, a full working demonstration of this post. One can even use callCC, fresh, and down in the same computation and terminate! Sadly, I don’t have any demonstrations of this combination, nor do I have any applications in mind. My intuition about callCC is currently quite limited in this context.

I have implemented Dan Friedman’s angel-devil-milestone control operators in conjunction with fresh and down, and have used it to duplicate labels and change the shape of a spirited tree; but I’m definitely cheating at this point, as all I have done is observe that the definitions compute something. I have no control over what’s going on, nor do I know what the definitions are supposed to compute. (yet)

Older Posts »

The Shocking Blue Green Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.