Tuesday, May 3, 2016

Abusing -XOverloadedStrings to Implement Type Directed Parsing

The -XOverloadedStrings GHC extension is a wonderful hack to fix Haskell's somewhat inefficient default string type. The idea is simple. When we turn on -XOverloadedStrings all the string literals go from type [Char] to type IsString a => a. Then, when type inference demands that the literal be specified to a given type, like say, Data.Text.Text or just regular old [Char], GHC invokes the fromString function defined in the IsString typeclass. It can feel a little ugly when you wish all your strings were just Text to begin with, but it is a pretty elegant solution to having multiple representations of strings.

Haskeller's love implementing DSLs. We have lightweight monadic DSLs and heavy duty embedded DSLs. You can write a DSL with combinators or with Template Haskell (and its cousin QuasiQuotes), but both approaches have problems. If you decide on a combinator library, your language will be constrained by Haskell's (admittedly flexible) syntax, and if you decide to go with TH or QuasiQuotes you have to deal with the Q monad. Template Haskell is powerful, but it is a bit like a chainsaw; all that power comes with a lot of potential to cut yourself.

The Wyvern language improves on this state of affairs by providing type directed parsing (there is also some stuff about white space, but the marriage between the type checker and the parser is the juicy bit). You can read the paper for a more in depth discussion of what is going on, but the basic idea is that you can leave holes in your program (denoted with a ~ in Wyvern) and then provide some DSL code to fill in those holes. Type inference figures out what type the hole should have via the usual method, and finds a DSL which produces a thing of the appropriate type. Then it applies the parser for that DSL and gets back an object of the appropriate type which it splices into the hole.

What does all of this have to do with -XOverloadedStrings? Well type directed parsing is just a way to use the type system to select a specific type of operation (in this case parsing). That sounds a whole lot like what Haskell's typeclasses do. IsString is a typeclass. You see where this is going.

Let's write a regex DSL using type directed parsing. First we need our handy dandy GHC extension.

 {-# LANGUAGE OverloadedStrings #-}  

We are also going to need access to the IsString typeclass.

 import Data.String  

You know how I said we were going to write a regex DSL? That sounds hard. Let's let someone else do the work.

 import Text.Regex  

Now comes the meat of the post. You ready?

 instance IsString Regex where  
   fromString = mkRegex  

That's it. We've just defined a type directed parser for regular expressions. If we fire up ghci we get the following.

 > :set -XOverloadedStrings  
 > matchRegex "h(el)lo" "helloworld"  
 Just ["el"]  

Remember to tell ghci that we are no longer using vanilla Haskell or it will whine about expecting a Regex when it could only find a [Char]. This is great! By abusing one common language feature we can get the convenience and guarantees of type directed parsing. Assuming GHC recognized the opportunity for optimization "h(el)lo" will be pre-compiled to a Regex; no more mucking about with Strings at runtime. Type directed parsing in this manner is less powerful than a full blown code gen system like TH, but it is much safer. We maintain hygiene, while getting most of the benefits of a TH DSL.

This is fun. Let's do URIs.

 {-# LANGUAGE FlexibleInstances #-}  
 import Network.URI  
 instance IsString (Maybe URI) where  
   fromString = parseURI  

We need the -XFlexibleInstances extension in order to define an instance for (Maybe URI). This solution is a little less neat than the regular expression one. You can always parse a string as a regex, so there is no need to worry about any error conditions. On the other hand URIs have plenty of error conditions. The only thing to do is to ask the user to handle parse errors at runtime. Still, it's always nice to get as far away from stringly typed programming as possible.

It would be really nice if support for these sort of custom literals was a first class use case rather than a nice side effect of the elegant solution to multiple string representations. We can imagine a class with the following definition.

 class IsLiteral a where  
   fromStringRep :: a -> Either String a  

Then when a hypothetical -XCustomLiterals GHC extension was turned on GHC could behave as it normally does when it comes to -XOverloadedStrings, only it would display an error message (stored in the Left of the Either) if any fromStringRep function failed. This would bring the full benefits of type checking to these little DSLs. Anything to close the OOTA loop. One confusing aspect of abusing -XOverloadedStrings is that our custom literals are delimited by quotes, which is a bit counter intuitive. It would be better if custom literals were delimited by some other symbol like #, $, % or ~.

Next time you sit down to design a DSL think about abusing -XOverloadedStrings. If you don't need the full power of Template Haskell, why take the risk of cutting yourself? You can have your compile-time DSL parsing cake and your type safety cake!

All the code for this post (with a dependencies and everything) can be found here.