JSON Parsing from Scratch in Haskell: Error Reporting

In the previous post we wrote a simple but correct JSON parser in Haskell. The parser was written very naively: if it failed, it returned nothing. You couldn’t tell what the failure was or where it happened. That’s OK for a toy parser but error reporting is an absolute must requirement for all good parsers. So in this post, we’ll add simple but useful error reporting capability to our JSON parser.

This is the second post in a series of posts:

  1. JSON Parsing from Scratch in Haskell
  2. JSON Parsing from Scratch in Haskell: Error Reporting

Introduction

The JSON parser we wrote in the previous post works correctly and passes all tests. However, if we run it with an invalid input, it returns Nothing:

parseJSON :: String -> Maybe JValue

This is a not a very user-friendly parser. In the real-world, we often have to run potentially invalid input through parsers and we expect the parsers to help us out in figuring what is wrong, and sometimes even to keep going by letting us handle or ignore the errors. Different parsers support error handling and reporting to different degrees. Megaparsec is a Haskell parser library which has good support for it. Here’s how a parsing error looks when working with Megaparsec:

1:4:
  |
1 | aaacc
  |    ^
unexpected 'c'
expecting 'a' or 'b'
in foo, in bar

The error report tells us what the parser was expecting and what it got. It also tells us where the error happened. It even tells us the context of the error by telling us that it happened “in foo, in bar”. Such an error report is definitely quite useful to track down the problems with the inputs or even with the parsers1.

Setup

We want to implement similar error reporting for our JSON parser. To be specific, we want the error reporting to tell us:

  • The nature of errors: what is expected and what is wrong.
  • The position of errors in terms of line and column numbers in the input.
  • The context of errors in terms of the JSON syntax.

Here’s how it will look like when we are done:

> printResult $ parseJSON "[{\"c\"\t:\n  \n  \t[\r\"\\g\"]}]"
ERROR:
Invalid escaped character: 'g' at line 3, column 8: ·\t[\r"\g"]}]

→ Expected a string at line 3, column 6: ··\t[\r"\g"]}

→ Expected a JSON value at line 3, column 6: ··\t[\r"\g"]}

→ Expected an array at line 3, column 4: ··\t[\r"\g"

→ Expected an object value at line 1, column 8: {"c"\t:\n

→ Expected an object key-value pair at line 1, column 3: [{"c"\t:\n

→ Expected an object at line 1, column 2: [{"c"\t:

→ Expected a JSON value at line 1, column 2: [{"c"\t:

→ Expected an array at line 1, column 1: [{"c"\t

Adding support for error reporting will be a major code change. We will rely on the property-based tests which we wrote in the previous post to make sure that nothing breaks2. A lot of code though will stay the same. Instead of showing such parts in this post, I’ll link to the relevant sections in the previous post.

To start with, the imports are below.

{-# LANGUAGE DeriveGeneric, TupleSections, LambdaCase, MultiWayIf #-}
module JSONParser where

import Control.Applicative (Alternative(..))
import Control.Monad (replicateM)
import Data.Bits (shiftL)
import Data.Char (isDigit, isHexDigit, isSpace, chr, ord, digitToInt)
import Data.Functor (($>))
import Data.List (intercalate)
import Data.List.Split (split, dropFinalBlank, keepDelimsR, onSublist)
import qualified Data.List.NonEmpty as NEL
import GHC.Generics (Generic)
import Numeric (showHex)
import Prelude hiding (lines)
import Text.Printf (printf)
import Test.QuickCheck hiding (Positive, Negative)

Here’s the JValue data type for a refresher:

data JValue = JNull
            | JBool Bool
            | JString String
            | JNumber { int :: Integer, frac :: [Int], exponent :: Integer }
            | JArray [JValue]
            | JObject [(String, JValue)]
            deriving (Eq, Generic)

The instances for JValue and the JSON generators remain the same.

Adding Error

The old Parser type we defined in the previous post returned Nothing in case of failures:

newtype Parser i o = Parser { runParser :: i -> Maybe (i, o) }

To be able to return errors, we start with creating a new type to capture the possible results of parsing:

data ParseResult a = Error [String] | Result a

ParseResult—the result of parsing—is now either an Error with a list of error messages, or a Result with the result of successful parsing.

Let’s quickly write the various typeclass instances for it:

instance Show a => Show (ParseResult a) where
  show (Result res) = show res
  show (Error errs) = formatErrors (reverse errs)
    where
      formatErrors []         = error "No errors to format"
      formatErrors [err]      = err
      formatErrors (err:errs) =
        err <> delim <> intercalate delim (map (concatMap padNewline) errs)

      delim = "\n→ "
      padNewline '\n' = '\n':replicate (length delim - 1) ' '
      padNewline c    = [c]

instance Functor ParseResult where
  fmap _ (Error errs) = Error errs
  fmap f (Result res) = Result (f res)

instance Applicative ParseResult where
  pure = Result
  Error  errs <*> _   = Error errs
  Result f <*> result = fmap f result

The Show instance shows each error message in the list on its own line, starting with the last message. The results are shown verbatim. The Functor and Applicative instances propagate errors while operating on results as expected.

Let’s print an error in GHCi to get a feel of it:

> print $ Error ["something went wrong", "and we know"]
and we know
→ something went wrong

Now we write a new Parser type which returns ParseResult instead of Maybe3:

newtype Parser1 i o = Parser1 { runParser1 :: i -> ParseResult (i, o) }

And the instances for the new Parser1 type:

instance Functor (Parser1 i) where
  fmap f parser = Parser1 $ fmap (fmap f) . runParser1 parser

instance Applicative (Parser1 i) where
  pure x    = Parser1 $ pure . (, x)
  pf <*> pa = Parser1 $ \input -> case runParser1 pf input of
    Error err        -> Error err
    Result (rest, f) -> fmap f <$> runParser1 pa rest

instance Alternative (Parser1 i) where
  empty = Parser1 $ const $ Error ["Unknown error."]
  parser1 <|> parser2 = Parser1 $ \input ->
    case runParser1 parser1 input of
      Result res -> Result res
      Error _    -> runParser1 parser2 input

instance Monad (Parser1 i) where
  parser >>= f = Parser1 $ \input -> case runParser1 parser input of
    Error errs            -> Error errs
    Result (rest, output) -> runParser1 (f output) rest

The instances are similar to the ones from the previous post, with additional Error propagation. Some work is delegated to the instances of ParseResult.

Next, we write some helper functions to return errors while parsing:

parseError1 :: String -> ParseResult a
parseError1 err = Error [err]

throw1 :: String -> Parser1 String o
throw1 = Parser1 . const . parseError1

And finally, we rewrite our old parsers to return errors on failures instead of returning Nothing:

satisfy1 :: (Char -> Bool) -> (Char -> String) -> Parser1 String Char
satisfy1 predicate mkError = Parser1 $ \case
  (c:cs) | predicate c -> Result (cs, c)
  (c:_)                -> parseError1 (mkError c)
  _                    -> parseError1 "Empty input"

char1 :: Char -> Parser1 String Char
char1 c = satisfy1 (== c) $ printf "Expected '%v', got '%v'" c

string1 :: String -> Parser1 String String
string1 ""     = pure ""
string1 (c:cs) = (:) <$> char1 c <*> string1 cs

satisfy1 is a “higher-order” parser, it takes a function to create the error message—which char1 and other parsers which call satisfy1 pass—to create contextual error messages. The string1 parser stays the same.

At this point, we can play with these parsers in GHCi:

> runParser1 (string1 "abc") "abc"
("","abc")
> runParser1 (string1 "abc") "abx"
Expected 'c', got 'x'
> runParser1 (string1 "abc") ""
Empty input

Great, it works! Let’s try it out by rewriting the JSON bool parser from the previous post:

jBool1 :: Parser1 String JValue
jBool1 =   string1 "true"  $> JBool True
       <|> string1 "false" $> JBool False

Over to GHCi:

> runParser1 jBool1 "trux"
Expected 'f', got 't'

Oops. That error message is wrong. It should have said Expected 'e', got 'x'. Somehow, the message is about t instead of x. What’s going on here?

The Backtracking Problem

The problem is with the backtracking done in the jBool1 parser. When the first parser branch of true is unable to parse the input, it backtracks to the start of the input and tries to parse it with the second branch of false. And then it fails while trying to match f with the first character of the input t, hence the error message. The solution to this is to abandon backtracking and write a Predictive Parser with lookahead.

lookahead1 :: Parser1 String Char
lookahead1 = Parser1 $ \case
  input@(c:_) -> Result (input, c)
  _           -> parseError1 "Empty input"

jBool2 :: Parser1 String JValue
jBool2 = do
  c <- lookahead1
  JBool <$> case c of
    't' -> string1 "true"  $> True
    'f' -> string1 "false" $> False
    _   -> throw1 $
      printf "Expected: 't' for true or 'f' for false; got '%v'" c

The lookahead1 function lets us peek at the first character of the input without consuming it. We use it in the jBool2 function to choose one of the true or false branches categorically, without any backtracking. If the lookahead is neither t or f then we throw an error.

Let’s see if it works:

> runParser1 jBool2 "trux"
Expected 'e', got 'x'
> runParser1 jBool2 "falze"
Expected 's', got 'z'
> runParser1 jBool2 "null"
Expected: 't' for true or 'f' for false; got 'n'

It works as expected. But this signals a big change for all our previous JSON parsers. We’ll need to switch from backtracking to lookahead everywhere. Fortunately, the JSON syntax is such that any JSON input can be parsed unambiguously with lookahead of only one character and we’ll not require any radical changes. But first, let’s figure out how to add position tracking to our parser.

Tracking Position

We need to track the position where our parser is currently at—in terms of line and column numbers in the input text—so that we can include that information in our error messages. One obvious way of doing this is to make the parser stateful. We can do this by layering the StateT monad transformer over the basic Parser monad. Then we can have the current line and column numbers in the state and update them while processing the input in the parsers we write.

But we choose to be more adventurous! We’ll instead use an often talked about but seldom used technique: Zippers4.

Zippers

Quoting the Wikipedia article on Zippers:

A zipper is a technique of representing an aggregate data structure so that it is convenient for writing programs that traverse the structure arbitrarily and update its contents, especially in purely functional programming languages.

Basically, zippers are a special view of data structures, which allow one to navigate and update them easily. A zipper always has a focus or cursor which is the current element of the data structure we are “at”. Alongside, it also captures the rest of the data structure in a way that makes it easy to move around it. We can update the data structure by updating the element at the focus. Let’s take the example of a non-empty list to understand zippers.

List zipper

For the above list, when are “at” or interested in the element 4, the focus of the list zipper is 4. It also contains two lists named left and right to capture the elements of the list left and right of the focus respectively. To move the focus from 4 to 3 on its left, we just need to uncons 3 from the left list, make it the focus and cons 4 to the right list. Here’s the code for list zipper:

data ListZipper a =
  ListZipper {
    lzLeft  :: [a]
  , lzFocus :: a
  , lzRight :: [a]
  } deriving (Show)

list2zipper :: NEL.NonEmpty a -> ListZipper a
list2zipper list = ListZipper [] (NEL.head list) (NEL.tail list)

lzMoveRight :: ListZipper a -> ListZipper a
lzMoveRight (ListZipper l f [])     = ListZipper l f []
lzMoveRight (ListZipper l f (x:xs)) = ListZipper (f:l) x xs

lzMoveLeft :: ListZipper a -> ListZipper a
lzMoveLeft (ListZipper [] f r)     = ListZipper [] f r
lzMoveLeft (ListZipper (x:xs) f r) = ListZipper xs x (f:r)

zipper2list :: ListZipper a -> NEL.NonEmpty a
zipper2list (ListZipper l f r) = NEL.fromList $ reverse l ++ f:r

Let’s see it in action in GHCi:

> lz = list2zipper $ NEL.fromList [1..9]
> lz
ListZipper {lzLeft = [], lzFocus = 1, lzRight = [2,3,4,5,6,7,8,9]}
> lzMoveLeft lz
ListZipper {lzLeft = [], lzFocus = 1, lzRight = [2,3,4,5,6,7,8,9]}
> lzMoveRight lz
ListZipper {lzLeft = [1], lzFocus = 2, lzRight = [3,4,5,6,7,8,9]}
> lzMoveRight $ lzMoveRight lz
ListZipper {lzLeft = [2,1], lzFocus = 3, lzRight = [4,5,6,7,8,9]}
> lzMoveRight $ lzMoveRight $ lzMoveRight lz
ListZipper {lzLeft = [3,2,1], lzFocus = 4, lzRight = [5,6,7,8,9]}
> lz' = lzMoveRight $ lzMoveRight $ lzMoveRight lz
> lzMoveLeft lz'
ListZipper {lzLeft = [2,1], lzFocus = 3, lzRight = [4,5,6,7,8,9]}
> lzMoveLeft $ lzMoveLeft lz'
ListZipper {lzLeft = [1], lzFocus = 2, lzRight = [3,4,5,6,7,8,9]}
> lzMoveLeft $ lzMoveLeft $ lzMoveLeft lz'
ListZipper {lzLeft = [], lzFocus = 1, lzRight = [2,3,4,5,6,7,8,9]}
> NEL.toList $ zipper2list lz'
[1,2,3,4,5,6,7,8,9]

With the understanding of the list zipper, let’s figure out a zipper for our parser input5.

Text Zipper

Though the input to our parser is a String, for the purpose of error reporting, we should think of it as two-dimensional text with rows of lines from top-to-bottom and columns of characters from left-to-right. For this representation, we can devise a zipper as shown in the diagram below:

Text zipper

If we think of our parser moving through this 2D text one character at a time—as a cursor moving through a text document—this zipper structure makes sense. The character just right of the cursor is the current character that the parser is going to consume next. There are some characters to the left of the cursor in the same line which have already been consumed and there are some to the right which are yet to be consumed. Similarly, there are some line above the current line which the parser has already seen and there are some yet unseen lines below the current line. With this view in mind, we can write the code for TextZipper:

data TextZipper a =
  TextZipper {
    tzLeft  :: a
  , tzRight :: a
  , tzAbove :: [a]
  , tzBelow :: [a]
  }

instance Show a => Show (TextZipper a) where
  show (TextZipper left right above below) =
    "TextZipper{left=" <> show left
      <> ", right=" <> show right
      <> ", above=" <> show above
      <> ", below=" <> show below
      <> "}"

textZipper :: [String] -> TextZipper String
textZipper []           = TextZipper "" "" [] []
textZipper (first:rest) = TextZipper "" first [] rest

currentPosition :: TextZipper String -> (Int, Int)
currentPosition zipper =
  (length (tzAbove zipper) + 1, length (tzLeft zipper) + 1)

currentChar :: TextZipper String -> Maybe Char
currentChar zipper = case tzRight zipper of
  []    -> Nothing
  (c:_) -> Just c

lines :: String -> [String]
lines = (split . dropFinalBlank . keepDelimsR . onSublist) "\n"

Finding the current position of the cursor in TextZipper is trivially easy. The current row number is just the count of lines above the current line plus one. Similarly, the current column number is the count of characters left of the cursor plus one. The currentChar function returns the character just right of the cursor, if there’s one present.

The lines function is a slightly modified version of Prelude’s lines function which leaves the newlines (\n) in the output. We do this so that we can report an error if there is a newline at any wrong position like in middle of a JSON string.

Quick trial in GHCi:

> Prelude.lines "some\nnext\n\nmore"
["some","next","","more"]
> lines "some\nnext\n\nmore"
["some\n","next\n","\n","more"]

> text = lines "some text\nnext line\nmore lines"
> text
["some text\n","next line\n","more lines"]
> tz = textZipper text
> tz
TextZipper{left="", right="some text\n", above=[], below=["next line\n","more lines"]}
> currentPosition tz
(1,1)
> currentChar tz
Just 's'

Next, we write functions to move forward and backward in the text zipper:

moveByOne :: TextZipper String -> TextZipper String
moveByOne zipper
  -- not at end of line
  | not $ null (tzRight zipper) =
      zipper { tzLeft  = head (tzRight zipper) : tzLeft zipper
             , tzRight = tail $ tzRight zipper
             }
  -- at end of line but not at end of input
  | not $ null (tzBelow zipper) =
      zipper { tzAbove = tzLeft zipper : tzAbove zipper
             , tzBelow = tail $ tzBelow zipper
             , tzLeft  = ""
             , tzRight = head $ tzBelow zipper
             }
  -- at end of input
  | otherwise = zipper

move :: TextZipper String -> TextZipper String
move zipper = let zipper' = moveByOne zipper
  in case currentChar zipper' of
       Just _  -> zipper'
       Nothing -> moveByOne zipper'

The moveByOne function moves forward in the text zipper by one character. It considers three cases:

  1. When not at the end of line indicated by tzRight not being empty, it moves the cursor by one character in the same line.
  2. When at the end of line but not at the end of input indicated by tzBelow not being empty, it moves the cursor to the beginning of the next line below.
  3. When at the end of the input, it does nothing.

The move function calls moveByOne one or two times to move past end of lines.

The moveBackByOne function is similar to moveByOne except it moves backwards in the zipper:

moveBackByOne :: TextZipper String -> TextZipper String
moveBackByOne zipper
  -- not at start of line
  | not $ null (tzLeft zipper) =
      zipper { tzLeft  = tail $ tzLeft zipper
             , tzRight = head (tzLeft zipper) : tzRight zipper
             }
  -- at start of line but not at start of input
  | not $ null (tzAbove zipper) =
      zipper { tzAbove = tail $ tzAbove zipper
             , tzBelow = tzRight zipper : tzBelow zipper
             , tzLeft  = head $ tzAbove zipper
             , tzRight = ""
             }
  -- at start of input
  | otherwise = zipper

Phew. That was a lot. Let’s try them out in GHCi to build our understanding:

> text = lines "some\nnext\n\nmore"
> text
["some\n","next\n","\n","more"]
> tz = textZipper text
> tz
TextZipper{left="", right="some\n", above=[], below=["next\n","\n","more"]}
> -- demostrating moveByOne
> moveByOne tz -- moves by one char
TextZipper{left="s", right="ome\n", above=[], below=["next\n","\n","more"]}
> f `times` n = (!! n) . iterate f
> moveByOne `times` 1 $ tz -- moves by one char
TextZipper{left="s", right="ome\n", above=[], below=["next\n","\n","more"]}
> moveByOne `times` 4 $ tz -- moves by four chars
TextZipper{left="emos", right="\n", above=[], below=["next\n","\n","more"]}
> moveByOne `times` 5 $ tz -- moves by five chars
TextZipper{left="\nemos", right="", above=[], below=["next\n","\n","more"]}
> moveByOne `times` 6 $ tz -- moves by six chars
TextZipper{left="", right="next\n", above=["\nemos"], below=["\n","more"]}
> -- demostrating move
> tz' = moveByOne `times` 6 $ tz
> tz'
TextZipper{left="", right="next\n", above=["\nemos"], below=["\n","more"]}
> move tz' -- moves by one char
TextZipper{left="n", right="ext\n", above=["\nemos"], below=["\n","more"]}
> move `times` 1 $ tz' -- moves by one char
TextZipper{left="n", right="ext\n", above=["\nemos"], below=["\n","more"]}
> move `times` 4 $ tz' -- moves by four chars
TextZipper{left="txen", right="\n", above=["\nemos"], below=["\n","more"]}
> move `times` 5 $ tz' -- moves by six chars, moving past end of line
TextZipper{left="", right="\n", above=["\ntxen","\nemos"], below=["more"]}
> -- demonstrating moveBackByOne
> tz' = moveByOne `times` 10 $ tz
> tz'
TextZipper{left="txen", right="\n", above=["\nemos"], below=["\n","more"]}
> moveBackByOne tz' -- moves back by one char
TextZipper{left="xen", right="t\n", above=["\nemos"], below=["\n","more"]}
> moveBackByOne `times` 1 $ tz' -- moves back by one char
TextZipper{left="xen", right="t\n", above=["\nemos"], below=["\n","more"]}
> moveBackByOne `times` 3 $ tz' -- moves back by three chars
TextZipper{left="n", right="ext\n", above=["\nemos"], below=["\n","more"]}
> moveBackByOne `times` 4 $ tz' -- moves back by four chars
TextZipper{left="", right="next\n", above=["\nemos"], below=["\n","more"]}
> moveBackByOne `times` 5 $ tz' -- moves back by five chars
TextZipper{left="\nemos", right="", above=[], below=["next\n","\n","more"]}

That works as expected. We are now ready to add position tracking to our error reporting parser.

Zippered Parser

Adding TextZipper to the parser is simple. We just change the input to be of type TextZipper i.

newtype Parser i o = Parser {
    runParser_ :: TextZipper i -> ParseResult (TextZipper i, o)
  }

runParser :: Parser String o -> String -> ParseResult (String, o)
runParser parser input =
  case runParser_ parser (textZipper $ lines input) of
    Error errs             -> Error errs
    Result (restZ, output) -> Result (leftOver restZ, output)
  where
    leftOver tz = concat (tzRight tz : tzBelow tz)

We also change the runParser function to convert the input string into a text zipper and to convert the text zipper for the leftover input back into a string at the end of parsing.

Finally, we rewrite the instances for Parser without any change in logic:

instance Functor (Parser i) where
  fmap f parser = Parser $ fmap (fmap f) . runParser_ parser

instance Applicative (Parser i) where
  pure x    = Parser $ pure . (, x)
  pf <*> pa = Parser $ \input -> case runParser_ pf input of
    Error err        -> Error err
    Result (rest, f) -> fmap f <$> runParser_ pa rest

instance Monad (Parser i) where
  parser >>= f = Parser $ \input -> case runParser_ parser input of
    Error err        -> Error err
    Result (rest, o) -> runParser_ (f o) rest

Notice that there is no Alternative instance of Parser anymore unlike the previous post. This is because we are eschewing the backtracking functionality provided by the Alternative instance for our current parser. This also means that we cannot use any convenience functions provided by Alternative like many, some and optional. But that’s okay because we will not need them when using lookahead.

Now that we have the parser with position tracking and error reporting separately, let’s integrate them together.

Errors with Position

We want to add positions of the errors to the error messages along with a sample text around the error position. The addPosition function does that:

addPosition :: String -> TextZipper String -> String
addPosition err zipper =
  let (ln, cn) = currentPosition zipper
      err'     = printf (err <> " at line %d, column %d: ") ln cn
      left     = reverse $ tzLeft zipper
      right    = tzRight zipper
      left'    = showStr $ drop (length left - ctxLen) left
      right'   = showStr $ take ctxLen right
      line     = left' <> right'
  in printf (err' <> "%s\n%s↑")
            line
            (replicate (length err' + length left') ' ')
  where
    ctxLen = 6
    showStr = concatMap showCharForErrorMsg

showCharForErrorMsg :: Char -> String
showCharForErrorMsg c = case c of
  '\b' -> "\\b"
  '\f' -> "\\f"
  '\n' -> "\\n"
  '\r' -> "\\r"
  '\t' -> "\\t"
  ' '  -> "·"
  _ | isControl c -> "\\" <> show (ord c)
  _ -> [c]

addPosition takes an error message and a text zipper. It finds the current position in the zipper and gets the text around the current position. Then it adds this text and the position in the error message and returns it. It also takes care of replacing characters in a way that makes the error messages more readable. Let’s see it at work in GHCi:

> text = lines "some\nnext\n\nmore"
> text
["some\n","next\n","\n","more"]
> tz = textZipper text
> tz
TextZipper{left="", right="some\n", above=[], below=["next\n","\n","more"]}
> f `times` n = (!! n) . iterate f
> tz' = move `times` 7 $ tz
> tz'
TextZipper{left="en", right="xt\n", above=["\nemos"], below=["\n","more"]}
> putStrLn $ addPosition "Something went wrong" tz'
Something went wrong at line 2, column 3: next\n

It works perfectly. Now we can enhance our error related helper functions to add position info in errors.

parseError :: String -> TextZipper String -> ParseResult a
parseError err zipper = Error [addPosition err zipper]

throw :: String -> Parser String o
throw = Parser . parseError

elseThrow :: Parser String o -> String -> Parser String o
elseThrow parser err = Parser $ \input ->
  case runParser_ parser input of
    Result (rest, a) -> Result (rest, a)
    Error errs       -> Error (addPosition err input : errs)

We also define a parser combinator elseThrow which tries to run the parser given to it and returns an error with position info in case the parser fails. We’ll see it in action soon.

Our new parser with position-ful error reporting is complete now. Next, we rewrite all our parsers to use lookahead as we mentioned before, starting with the basic parsers.

Basic Parsers

We rewrite the parsers lookahead and satisfy to use our new TextZipper based parser.

lookahead :: Parser String Char
lookahead = Parser $ \input -> case currentChar input of
  Just c  -> Result (input, c)
  Nothing -> parseError "Empty input" input

safeLookahead :: Parser String (Maybe Char)
safeLookahead = Parser $ \input -> case currentChar input of
  Just c  -> Result (input, Just c)
  Nothing -> Result (input, Nothing)

satisfy :: (Char -> Bool) -> String -> Parser String Char
satisfy predicate expectation = Parser $ \input -> case currentChar input of
  Just c | predicate c -> Result (move input, c)
  Just c               -> flip parseError input $
    expectation <> ", got '" <> showCharForErrorMsg c <> "'"
  _                    -> flip parseError input $
    expectation <> ", but the input is empty"

char :: Char -> Parser String Char
char c = satisfy (== c) $ printf "Expected '%v'" $ showCharForErrorMsg c

digit :: Parser String Int
digit = digitToInt <$> satisfy isDigit "Expected a digit"

string :: String -> Parser String String
string ""     = pure ""
string (c:cs) = (:) <$> char c <*> string cs

We have an additional function safeLookahead which is like lookahead but instead of throwing a parser error on failure, it returns Nothing. Notice that lookahead and safeLookahead only call currentChar but not move, whereas satisfy calls both of them. This means the two lookahead functions do not consume from the input stream but satisfy does. Other parsers are barely changed. We can exercise them in GHCi to see the new functionalities:

> runParser lookahead "abc"
("abc",'a')
> runParser lookahead ""
Empty input at line 1, column 1:

> runParser safeLookahead "abc"
("abc",Just 'a')
> runParser safeLookahead ""
("",Nothing)
> runParser (string "abh") "abhinav"
("inav","abh")
> runParser (string "abc") "abhinav"
Expected 'c', got 'h' at line 1, column 3: abhinav

> runParser digit "12s"
("2s",1)
> runParser digit "abhinav"
Expected a digit, got 'a' at line 1, column 1: abhina

We get correct results and correct error messages with right error positions. Next, we rewrite all JSON parsers using our new basic parsers to use lookahead instead of backtracking.

JNull and JBool Parsers

The jNull and jBool parsers stay the same as seen before.

jNull :: Parser String JValue
jNull = string "null" $> JNull

jBool :: Parser String JValue
jBool = do
  c <- lookahead
  JBool <$> case c of
    't' -> string "true"  $> True
    'f' -> string "false" $> False
    _   -> throw $
      errorMsgForChar "Expected: 't' for true or 'f' for false; got '%v'" c

errorMsgForChar :: String -> Char -> String
errorMsgForChar err c = printf err $ showCharForErrorMsg c

JString Parser

JSON strings have a complex syntax and the parser for them is going to be the most complex one. First, we write the jsonChar parser to parse a JSON character:

jsonChar :: Parser String (Char, Int)
jsonChar = lookahead >>= \case
  '\\' -> char '\\' *> escapedChar
  _    -> (,1) <$> otherChar
  where
    escapedChar = lookahead >>= \case
      '"'  -> ('"',  2) <$ char '"'
      '\\' -> ('\\', 2) <$ char '\\'
      '/'  -> ( '/', 2) <$ char  '/'
      'b'  -> ('\b', 2) <$ char 'b'
      'f'  -> ('\f', 2) <$ char 'f'
      'n'  -> ('\n', 2) <$ char 'n'
      'r'  -> ('\r', 2) <$ char 'r'
      't'  -> ('\t', 2) <$ char 't'
      'u'  -> (,6) <$> (char 'u' *> unicodeChar)
      c    -> throw $ errorMsgForChar "Invalid escaped character: '%v'" c

    unicodeChar =
      chr . fromIntegral . digitsToNumber 16 0 <$> replicateM 4 hexDigit

    hexDigit = digitToInt <$>
      satisfy isHexDigit "Expected a hex digit"

    otherChar = satisfy (not . isQuoteEscapeOrControl)
      "Did not except '\"', '\\' or control characters"

    isQuoteEscapeOrControl c = c == '\"' || c == '\\' || isControl c

digitsToNumber :: Int -> Integer -> [Int] -> Integer
digitsToNumber base =
  foldl (\num digit -> num * fromIntegral base + fromIntegral digit)

We do two lookaheads in jsonChar6. The first is for checking if the first character is backslash (\). If so, we consume it and parse the rest of the input for a JSON escaped character. Otherwise, we parse the input for a non escaped and non-control character.

When parsing for escaped characters, we do another lookahead and based on it, we consume one of the known escaped characters. Or if the lookahead gives u, we parse the rest of the input for a unicode character represented with four hex digits.

jsonChar also returns the number of characters consumed from the input for the purpose that’ll become clear when we go over the jString parser. We also make sure to throw appropriate errors when parsing fails. Let’s check out jsonChar in GHCi:

> runParser jsonChar "a"
("",('a',1))
> runParser jsonChar "\\b"
("",('\b',2))
> runParser jsonChar "\\u0040"
("",('@',6))
> runParser jsonChar "\\g"
Invalid escaped character: 'g' at line 1, column 2: \g

> runParser jsonChar "\\u0x40"
Expected a hex digit, got 'x' at line 1, column 4: \u0x40

> runParser jsonChar "\0"
Did not except '"', '\' or control characters, got '\0' at line 1, column 1: \0

Both positive and negative tests work fine and the error messages are correct too. Let’s move on to writing the JSON string parser.

jString :: Parser String JValue
jString = JString <$> (char '"' *> jString')

The jString parser simply consumes the leading double quote character of a JSON string and invokes the ancillary parser jString' that does the rest of the parsing.

jString' :: Parser String String
jString' = do
  c <- lookahead `elseThrow` "Expected rest of a string"
  if c == '"'
  then "" <$ char '"'
  else jFirstChar

The jString' parser first does a lookahead, failing which it throws an error. If the lookahead returns the double quote character, then it consumes the character and returns an empty string. Otherwise, it calls the jFirstChar parser to parse the first JSON character.

jFirstChar :: Parser String String
jFirstChar = do
  (first, count) <- jsonChar
  if | not (isSurrogate first) -> (first:) <$> jString'
     | isHighSurrogate first   -> jSecondChar first
     | otherwise               -> do
         pushback count
         throw
           . errorMsgForChar "Expected a high surrogate character, got '%v'"
           $ first

pushback :: Int -> Parser String ()
pushback count = Parser $ \input ->
  Result (iterate moveBackByOne input !! count, ())

jFirstChar calls the jsonChar parser to get the first JSON character and the count of the characters consumed from the input. If the first JSON character is not a unicode surrogate character then it just calls jString' to parse the rest of the string and returns the first character consed with it7.

If the first character is a high surrogate character then it calls jSecondChar to parse the rest of the string. Else, the first character is a low surrogate character, which is an error case as Unicode surrogate pairs must start with a high surrogate character8. We can throw an error to report the same but there is a catch here. Since we have already consumed the first JSON character from the input, throwing an error at this point will report the wrong position. To fix this, we need to move back in the input by the number of characters consumed by the first JSON character. For this purpose, we call a special parser pushback which rewinds the cursor in the input text zipper by calling moveBackByOne the right number of times. After invoking the pushback parser, we throw the error.

jSecondChar :: Char -> Parser String String
jSecondChar first = do
  (second, count) <- jsonChar `elseThrow`
    "Expected a second character of a surrogate pair"
  if isLowSurrogate second
  then (combineSurrogates first second :) <$> jString'
  else do
    pushback count
    throw
      . errorMsgForChar "Expected a low surrogate character, got '%v'"
      $ second

The jSecondChar parser is similar to the jFirstChar parser. If it finds a low surrogate JSON character then it combines the high and the low surrogate characters and calls jString' to parse the rest of the string. Else it pushes back the input by the correct number of characters and throw an error9.

That completes our most complex parser. Let’s test it out in GHCi:

> runParser jString "\"abc\""
("","abc")
> runParser jString "\"abc"
Empty input at line 1, column 5: "abc

→ Expected rest of a string at line 1, column 5: "abc

> runParser jString "\"\\uD834\\uDD1E\""
("","𝄞")
> runParser jString "\"\\uD834\""
Did not except '"', '\' or control characters, got '"' at line 1, column 8: \uD834"

→ Expected a second character of a surrogate pair at line 1, column 8: \uD834"

> runParser jString "\"\\uD834\\u0040\""
Expected a low surrogate character, got '@' at line 1, column 8: \uD834\u0040

> runParser jString "\"\\uDD1E\\uDD1E\""
Expected a high surrogate character, got '?' at line 1, column 2: "\uDD1E

We try out all the success and failure cases. Everything works out right.

JNumber Parser

Numbers in JSON can be in different formats as shown by this syntax. Let’s start by writing the parser for a JSON integer.

jUInt :: Parser String Integer
jUInt = (`elseThrow` "Expected an unsigned integer") $
  lookahead >>= \case
    '0'           -> fromIntegral <$> digit
    c | isDigit c -> digitsToNumber 10 0 <$> digits
    c             -> throw $ printf "Expected a digit, got '%v'" c

jInt :: Parser String Integer
jInt = (`elseThrow` "Expected a signed integer") $
  lookahead >>= \case
    '-' -> negate <$> (char '-' *> jUInt)
    _   -> jUInt

digits :: Parser String [Int]
digits = ((:) <$> digit <*> digits') `elseThrow` "Expected digits"
  where
    digits' = safeLookahead >>= \case
      Just c | isDigit c -> (:) <$> digit <*> digits'
      _                  -> return []

jUint is a parser for an unsigned JSON integer. It starts with a lookahead and matches the character to see if it is a digit. If so it parses one-or-more characters as digits using the digits parser and converts the list of digits to a number. It handles the case of the first digit being zero specially because only JSON number which can begin with zero is zero itself.

jInt adds support for parsing optionally negative signed integers over jUint. Note that there are no positive signed integers in JSON.

The digits parser parses one or more characters into a list of digits as integers. It uses safeLookahead so it stops parsing when it encounters a non-digit character.

Moving on to parsing fractions and exponents:

jFrac :: Parser String [Int]
jFrac = (char '.' *> digits) `elseThrow` "Expected a fraction"

jExp :: Char -> Parser String Integer
jExp c = (char c *> jExp') `elseThrow` "Expected an exponent"
  where
    jExp' = lookahead >>= \case
      '-' -> negate <$> (char '-' *> jUInt)
      '+' -> char '+' *> jUInt
      _   -> jUInt

The jFrac parser parses simply for a dot (.) followed by one or more digits.

The jExp parser parses for a exponent symbol character (e or E) followed by a positive or negative or no sign unsigned integer. It uses jUint to do that. The exponent character is provided by the jNumber parser below.

Finally, the number parser jNumber brings all these together to parse any JSON number:

jNumber :: Parser String JValue
jNumber = do
  i <- jInt
  safeLookahead >>= \case
    Just '.' -> do
      f <- jFrac
      safeLookahead >>= \case
        Just c' | isExpSym c' -> JNumber i f <$> jExp c' -- int+frac+exp
        _                     -> pure $ JNumber i f 0    -- int+frac
    Just c | isExpSym c       -> JNumber i [] <$> jExp c -- int+exp
    _                         -> pure $ JNumber i [] 0   -- int
  where
    isExpSym c = c == 'e' || c == 'E'

The first part of any JSON number is a signed integer, which is followed by an optional fraction part, followed by an optional exponent part. We do two lookaheads to determine if there are following fractional and/or exponent parts and accordingly use the previously defined parsers to parse them. Then we put the parts together depending on the four cases10.

Let’s take jNumber for a spin in GHCi:

> runParser jNumber "0"
("",0)
> runParser jNumber "123"
("",123)
> runParser jNumber "123.1"
("",123.1)
> runParser jNumber "123e-22"
("",123e-22)
> runParser jNumber "123.11E+3"
("",123.11e3)
> runParser jNumber "01"
("1",0)
> runParser jNumber "-a"
Expected a digit, got 'a' at line 1, column 2: -a

→ Expected an unsigned integer at line 1, column 2: -a

→ Expected a signed integer at line 1, column 1: -a

> runParser jNumber "1.a"
Expected a digit, got 'a' at line 1, column 3: 1.a

→ Expected digits at line 1, column 3: 1.a

→ Expected a fraction at line 1, column 2: 1.a

> runParser jNumber "1ex"
Expected a digit, got 'x' at line 1, column 3: 1ex

→ Expected an unsigned integer at line 1, column 3: 1ex

→ Expected an exponent at line 1, column 2: 1ex

> runParser jNumber "[email protected]"
Expected a digit, got '@' at line 1, column 6: [email protected]

→ Expected an unsigned integer at line 1, column 6: [email protected]

→ Expected an exponent at line 1, column 4: [email protected]

Great! We were able to write small parsers for the parts of JSON numbers and combine them together using the Applicative and Monad capabilities to create a parser for any JSON number. This covers all the scalar types in JSON. Coming up are the parser for the two composite types.

JArray Parser

We start with rewriting the helper functions:

surroundedBy :: Parser i a -> Parser i b -> Parser i a
surroundedBy parser1 parser2 = parser2 *> parser1 <* parser2

separatedBy :: Parser String v -> Char -> String -> Parser String [v]
separatedBy parser sepChar errMsg = do
  res <- parser  `elseThrow` errMsg
  safeLookahead >>= \case
    Just c | c == sepChar ->
      (res:) <$> (char sepChar *> separatedBy parser sepChar errMsg)
    _ -> return [res]

spaces :: Parser String String
spaces = safeLookahead >>= \case
  Just c | isWhitespace c -> (:) <$> char c <*> spaces
  _                       -> return ""
  where
    isWhitespace c = c == ' ' || c == '\n' || c == '\r' || c == '\t'

surroundedBy stays the same. separatedBy now takes a separator character instead of a separator parser because we need to match the character with the lookahead. It also takes an error message to throw in case the given parser errors out. Rest of the changes are to convert separatedBy and spaces from Alternative based parsing to lookahead based parsing.

Now, we can rewrite the jArray parser using these helpers:

jArray :: Parser String JValue
jArray = JArray <$> do
  _ <- char '[' <* spaces
  c <- lookahead `elseThrow` "Expected a JSON value or ']'"
  case c of
    ']' -> [] <$ char ']'
    _   -> separatedBy jValue ',' "Expected a JSON value" <* 
              satisfy (== ']') "Expected ',' or ']'"

jArray is now written in a monadic style instead of the earlier applicative style because we need to use lookahead11. First, it consumes the opening bracket ([) and any spaces that follow. Then it does a lookahead and checks to see if it is the closing bracket (]). If so, it consumes the character and returns an empty array.

Otherwise, it recursively parses for one-or-more JSON values separated by comma using the yet to be defined jValue parser, followed by the closing bracket (]). It also takes care of throwing appropriate errors. Let’s check it out in GHCi:

> runParser jArray "" -- empty input
Expected '[', but the input is empty at line 1, column 1:

> runParser jArray "[" -- no closing bracket
Empty input at line 1, column 2: [

→ Expected a JSON value or ']' at line 1, column 2: [

> runParser jArray "[]" -- empty array
("",[])
> runParser jArray "[\t \r \n]" -- empty array with whitespace
("",[])
> runParser jArray "[\t\t  0]" -- one element array with whitespace
("",[0])
> runParser jArray "[0]" -- one element array without whitespace
("",[0])
> runParser jArray "[0,]" -- closing bracket after comma
Unexpected character: ']' at line 1, column 4: [0,]

→ Expected a JSON value at line 1, column 4: [0,]

> runParser jArray "[0,1]" -- two element array
("",[0, 1])
> runParser jArray "[0," -- no element after comma
Empty input at line 1, column 4: [0,

→ Expected a JSON value at line 1, column 4: [0,

> runParser jArray "[      " -- no closing bracket after whitespace
Empty input at line 1, column 8: ······

→ Expected a JSON value or ']' at line 1, column 8: ······

> runParser jArray "[0;1]" -- semicolon instead of comma
Expected ',' or ']', got ';' at line 1, column 3: [0;1]

It works as expected for all positive and negative tests. Moving on to the last composite parser for JSON objects.

JObject Parser

The jObject parser is very similar to the jArray parser: written in a monadic style using lookahead. Only difference is that the surrounding characters are braces ({}) and the recursive parsing is for one-or-more pairs of JSON string and JSON value12.

jObject :: Parser String JValue
jObject = JObject <$> do
  _ <- char '{' <* spaces
  c <- lookahead `elseThrow` "Expected a JSON value or '}'"
  case c of
    '}' -> [] <$ char '}'
    _   -> separatedBy pair ',' "Expected an object key-value pair" <*
             satisfy (== '}') "Expected ',' or '}'"
  where
    pair = (\ ~(JString s) j -> (s, j)) <$> key <*  char ':' <*> value
    key = (jString `surroundedBy` spaces) `elseThrow` "Expected an object key"
    value = jValue `elseThrow` "Expected an object value"

Testing in GHCi:

> runParser jObject "" -- empty input
Expected '{', but the input is empty at line 1, column 1:

> runParser jObject "{" -- no closing brace
Empty input at line 1, column 2: {

→ Expected a JSON value or '}' at line 1, column 2: {

> runParser jObject "{}" -- empty object
("",{})
> runParser jObject "{\t  \n}" -- empty object with whitespace
("",{})
> runParser jObject "{\t  \n \"a\": 1}" -- one element object with whitespace
("",{"a": 1})
> runParser jObject "{\"a\": 1}" -- one element object without whitespace
("",{"a": 1})
> runParser jObject "{\"a\": 1,}" -- closing brace after comma
Expected '"', got '}' at line 1, column 9: a":·1,}

→ Expected an object key at line 1, column 9: a":·1,}

→ Expected an object key-value pair at line 1, column 9: a":·1,}

> runParser jObject "{\"a\":}" -- no value in key-value pair
Unexpected character: '}' at line 1, column 6: {"a":}

→ Expected an object value at line 1, column 6: {"a":}

→ Expected an object key-value pair at line 1, column 2: {"a":}

> runParser jObject "{\"a\": 1," -- no key after comma
Expected '"', but the input is empty at line 1, column 9: a":·1,

→ Expected an object key at line 1, column 9: a":·1,

→ Expected an object key-value pair at line 1, column 9: a":·1,

> runParser jObject "{1}" -- wrong key type
Expected '"', got '1' at line 1, column 2: {1}

→ Expected an object key at line 1, column 2: {1}

→ Expected an object key-value pair at line 1, column 2: {1}

> runParser jObject "{\"a\"}" -- no colon after key
Expected ':', got '}' at line 1, column 5: {"a"}

→ Expected an object key-value pair at line 1, column 2: {"a"}

> runParser jObject "{      " -- no closing brace after whitespace
Empty input at line 1, column 8: ······

→ Expected a JSON value or '}' at line 1, column 8: ······

> runParser jObject "{\"a\": 1;\"b\": 2}" -- semicolon instead of comma
Expected ',' or '}', got ';' at line 1, column 8: "a":·1;"b":·

> runParser jObject "{\"a\"# 1}" -- hash instead of semicolon
Expected ':', got '#' at line 1, column 5: {"a"#·1}

→ Expected an object key-value pair at line 1, column 2: {"a"#·1

That’s a lot of tests! But everything works out. On to the final parser which brings everything together.

JValue Parser

The earlier version of the jValue parser was based on backtracking: it tried to parse the input with parsers for each JSON type one-by-one, trying the next parser in case of failure. Since we have eschewed backtracking in this post for the purpose of having correct error message, we need to replace backtracking with lookahead in the jValue parser too.

jValue :: Parser String JValue
jValue = jValue' `surroundedBy` spaces
  where
    jValue' = lookahead >>= \case
      'n'  -> jNull   `elseThrow` "Expected null"
      't'  -> jBool   `elseThrow` "Expected true"
      'f'  -> jBool   `elseThrow` "Expected false"
      '\"' -> jString `elseThrow` "Expected a string"
      '['  -> jArray  `elseThrow` "Expected an array"
      '{'  -> jObject `elseThrow` "Expected an object"
      c | c == '-' || isDigit c ->
        jNumber `elseThrow` "Expected a number"
      c    -> throw $ printf "Unexpected character: '%v'"
                    $ showCharForErrorMsg c

That turns out to be easier than expected as the JSON syntax is unambiguous. We can always choose the right sub-parser to use by looking at the character lookahead returns. We also sprinkle right error messages for each sub-parser and for the default case. And we are done!

We write the top-level parseJSON function now which runs the jValue parser over the input and return the result as an Either. The printResult function just pretty-prints the result of parseJSON.

parseJSON :: String -> Either String JValue
parseJSON s = case runParser jValue s of
  Result ("", j) -> Right j
  Result (i, _)  -> Left $ "Leftover input: " <> i
  err@(Error _)  -> Left $ show err

printResult :: Either String JValue -> IO ()
printResult = putStrLn . either ("ERROR:\n" <>) (("RESULT:\n" <>) . show)

Finally, the test case promised at the beginning of the post:

> printResult $ parseJSON "[{\"c\"\t:\n  \n  \t[\r\"\\g\"]}]"
ERROR:
Invalid escaped character: 'g' at line 3, column 8: ·\t[\r"\g"]}]

→ Expected a string at line 3, column 6: ··\t[\r"\g"]}

→ Expected a JSON value at line 3, column 6: ··\t[\r"\g"]}

→ Expected an array at line 3, column 4: ··\t[\r"\g"

→ Expected an object value at line 1, column 8: {"c"\t:\n

→ Expected an object key-value pair at line 1, column 3: [{"c"\t:\n

→ Expected an object at line 1, column 2: [{"c"\t:

→ Expected a JSON value at line 1, column 2: [{"c"\t:

→ Expected an array at line 1, column 1: [{"c"\t

> printResult $ parseJSON "[{\"c\"\t:\n  \n  \t[\r\"\\n\"]}]"
RESULT:
[{"c": ["\n"]}]

We also run the property-based tests to make sure that nothing is broken:

> runTests
== prop_genParseJString ==
+++ OK, passed 100 tests.
== prop_genParseJNumber ==
+++ OK, passed 100 tests.
== prop_genParseJArray ==
+++ OK, passed 100 tests.
== prop_genParseJObject ==
+++ OK, passed 100 tests.
== prop_genParseJSON ==
+++ OK, passed 100 tests.

We have successfully added useful error-reporting capabilities to our JSON parser.

Conclusion

We rewrote the JSON parser we wrote in the earlier post to add support for error reporting. In the course of doing that, we learned about Zippers and how to use then to move around within a data structure. We also learned about lookahead based predictive parsing and how it is different from backtracking parsing. The full code for the JSON parser can be found here. You can discuss this post on lobsters, r/haskell, twitter or in the comments below.


  1. Megaparsec provides a lot of other facilities for working with errors. You can define custom errors with custom messages which seamlessly work with the built-in pretty printing. You can catch errors in your parsers and choose to do another thing. You can even report multiple errors in a single run, which is useful when you are writing some sort of validation/inspection tool. This tutorial goes in full depth about all the capabilities of Megaparsec.↩︎

  2. Having unit tests before refactoring your code is always a good idea. Refactoring should be a behavior-preserving code change. Running the tests continuously ensures that the refactoring steps have not changed any behavior of the code. Property-based tests go a step further ahead by literally capturing the behaviors of your code as tests, no hand-written test data required. Though it is not shown in this post, the tests written in the previous post were a huge help to me and caught many edge-cases while adding the error reporting capabilities.↩︎

  3. The suffix 1 has been added to the Parser type and the functions in the following code because this is not the final form of the parser we are going to write. The final ones will not have suffixes.↩︎

  4. Let me reiterate that the parser implementation in this series of posts is for illustrative and learning/teaching purposes only. So it’s okay to have some fun using an interesting technique. A production grade parser will certainly not use Zippers like the way we do.↩︎

  5. The Zippers chapter from the Learn you a Haskell book is a great resource to learn about zippers in detail.↩︎

  6. For comparison: the jsonChar parser which uses backtracking from the previous post.↩︎

  7. We are using the Multi-way If GHC extension here.↩︎

  8. To learn about Unicode surrogate characters and what they have to do with JSON parsing, see the jString parser section in the previous post.↩︎

  9. For comparison: the jString parser from the previous post.↩︎

  10. For comparison: the jNumber parser which uses backtracking from the previous post.↩︎

  11. For comparison: the jArray parser from the previous post.↩︎

  12. For comparison: the jObject parser from the previous post.↩︎

Posted by

Like this post? Subscribe to get future posts by email.

Twitter

Like or Retweet this post on Twitter
Cancel Reply

Got suggestions, corrections, or thoughts? Post a comment!

Markdown is allowed
Email is used just to show an avatar image and is not displayed or stored
Comments are moderated. They will appear below after they are approved.

0 comments

11 Mentions

4 Reposts abhin4vArun Raghavanabhin4vTiago Oliveira