JSON Parsing from Scratch in Haskell
- A forty-one minute read
- 23 comments
- 11 🗣️ 56 ❤️ 23 🔁
JSON is probably the most used standard file format for storing and transmitting data on the internet in recent times. Though it was historically derived from JavaScript, it is a programming language independent format and is now supported by almost all languages. JSON has a simple syntax specification with only four scalar data types and two composite data types. So, writing a parser for JSON is a great exercise for learning the basics of parsing. Let’s write one from scratch in Haskell.
This is the first post in a series of posts:
- JSON Parsing from Scratch in Haskell
- JSON Parsing from Scratch in Haskell: Error Reporting—Part 1
- JSON Parsing from Scratch in Haskell: Error Reporting—Part 2
Introduction
JSON (JavaScript Object Notation) grew out of JavaScript as a way to exchange data between browsers and servers. Douglas Crockford, an American computer programmer and author of the popular book JavaScript: The Good Parts, wrote the first specification of the JSON format. After seeing wide adoption as a general data-exchange format, IETF standardized JSON as an Internet Standard with RFC 7159 and later with RFC 8259. Now JSON is the lingua franca of the Web Service world, both RPC and REST varieties. It has also become a commonly used configuration file format and database storage format. With such broad uses, it is no wonder that almost all programming languages support JSON in some form or another. As of the time of writing this post, the json.org website lists 167 JSON libraries across 60 languages.
Since JSON came out of JavaScript—a dynamically typed, prototype-based object-oriented language with a curly bracket syntax—it heavily borrows the data types and syntax from JavaScript. It has only four scalar data types:
- Null: a null value.
- Boolean: a boolean value.
- String: a string value, a sequence of zero or more Unicode characters.
- Number: a numeric value representing integral and real numbers with support for scientific notation.
Along with these four scalar data types, JSON supports only two composite data types:
- Array: an ordered list of values.
- Object: a collection of name-value pairs.
JSON Syntax
Let’s see how these types are represented syntactically in JSON.
Null and Boolean
The Null value is represented simply by the exact string null
. Boolean data is either truthy or falsey, represented by the exact strings true
and false
respectively.
String
A String in JSON is sequence of zero or more Unicode characters (except Control characters), wrapped in double quotes ("
). Some special characters can be escaped using backslashes. Additionally, all characters can also be represented with their four hex-digit codes prefixed with \u
. This transition diagram depicts the String syntax:
Number
A Number in JSON is represented as a combination of an integral part, a fractional part and an exponent. All these parts are optional but they must follow some rules. For example, these numbers are invalid in JSON:
- 012
- integral part cannot start with 0
- 1.
- fractional part cannot be empty
- .123
- integral part cannot be empty
- 1.23e
- exponent part cannot be empty
And these numbers are valid: 0, 1234, 1.23, 0.222, 1e5, 5E-45, 1.23e9, 1.77E-9.
Whitespace
In JSON, whitespace is a string of zero or more valid whitespace characters which are space ( ), newline (\n
), return (\r
) and tab (\t
).
Array
A JSON Array is an ordered list of zero or more JSON values separated by commas (,
). An array begins with a left bracket ([
) and ends with a right bracket (]
) and may contain whitespace between them if empty.
Object
A JSON Object is a collection of zero or more name-value pairs separated by commas (,
). An object begins with a left brace ({
) and ends with a right brace (}
) and may contain whitespace between them if empty. Names and values are separated by colons (:
) and are optionally surrounded by whitespace.
Value
Finally, a JSON value is a string, or a number, or a boolean, or null, or an object, or an array, surrounded by optional whitespace. As you may have noticed, we referred to JSON values in the Array and Object sections above. Hence, the definition of the JSON syntax is recursive.
Parsing
Parsing1 is the process of taking textual input data and converting it to a data structure—often a hierarchal structure like a parse tree—while checking the input for correct syntax. Parsing is an important first step2 in compilers and interpreters for programming languages to check and convert source text files into internal representations which are used by later steps in the processes. However, parsing is also used for other purposes like converting data from one format to other, for linting, and for pretty-printing. Our use-case here is converting textual JSON data into Haskell’s internal data structures.
{"a": 1, "b": [false, null]}
The syntax of a Language is defined by a set of rules. This set of rules is called the Grammar of the language. If an input does not adhere to a language’s grammar, it is considered incorrect. JSON’s grammar is a Deterministic context-free grammar which is a subset of Context-free grammar3. Many programming languages, markup languages and data-definition languages can be described by context-free grammars4. Being deterministic, JSON grammar will never allow multiple parse trees for the same input.
Parsing is a widely studied field and as such, multiple parsing algorithms have been invented over time. These algorithms differ in the kind of grammars they can parse and their performances. In general, there are three broad categories of parsing algorithms for context-free grammars:
- Universal parsing algorithms like Earley’s algorithm which can parse any grammar. However, these algorithms are generally too slow to be used in real-world settings.
- Top-down parsing algorithms which start at the root level of parse tree and work down to leaf nodes.
- Bottom-up parsing algorithms which start at the leaf nodes of parse tree and work up to the root node.
Both top-down and bottom-up parsers are widely used. In this post, we will implement a Recursive descent parser which is a top-down parser which executes a set of mutually recursive functions to process its input.
Lastly, there are different ways of writing parsers. We can write the entire parser by hand: read the input character-by-character and call different functions depending on the characters read, parsing the input. Or we can use a Parser generator program to generate the code for a parser by providing the language’s grammar.
Alternatively, we can use a Parser combinator system. A parser combinator is a way of combining smaller parsers using higher-order functions to create larger parsers. Let’s say we start with a simple parser to parse one digit. We can then combining it with itself to create a parser to parse a natural number. In the same way, we can start with parsers for constituent parts of a language’s grammar and combine them by following the grammar to create a parser for the whole language. Haskell, with its support for higher-order functional programming, has many good parser combinator libraries but we are going to write one from scratch here5. Here we go!
Setup
We are going to write a simple but correct JSON parser from scratch in Haskell, as a Parser combinator. This parser will be for illustrative and learning/teaching purposes only and will not be for production usage6. As such, we will not care about error handling and reporting, performance or ease of use. Our purpose here is to learn about some basics of parsing, nuances of the JSON syntax, and parser combinators and property-based testing in Haskell. We will use the RFC 8259 document as the reference for the JSON language specification.
To test our parser for correctness, we will use the QuickCheck library. QuickCheck is a Property-based Testing framework. The key idea of property-based testing is to write properties of our code that hold true for any input and then, to automatically generate arbitrary inputs and make sure that that the properties are indeed true for them7. Since we are writing a JSON parser—or rather, several of them for small parts of the JSON syntax—we will generate arbitrary textual data which are valid JSON and we will throw them at our parsers and assert that they work correctly. We will use GHCi, the interactive Haskell REPL, to run the tests. Since we will be writing the parser from scratch, we will not use any libraries other than the base library.
Let’s start by writing the required imports:
{-# LANGUAGE DeriveGeneric, TupleSections, LambdaCase #-}
module JSONParser where
import Control.Applicative (Alternative(..), optional)
import Control.Monad (replicateM)
import Data.Bits (shiftL)
import Data.Char (isDigit, isHexDigit, isSpace, chr, ord, digitToInt)
import Data.Functor (($>))
import Data.List (intercalate)
import GHC.Generics (Generic)
import Numeric (showHex)
import Test.QuickCheck hiding (Positive, Negative)
JSON Data Type
The data type for JSON in Haskell JValue
directly reflects the JSON data types:
data JValue = JNull
| JBool Bool
| JString String
| JNumber { int :: Integer, frac :: [Int], exponent :: Integer }
| JArray [JValue]
| JObject [(String, JValue)]
deriving (Eq, Generic)
The JSON null
type is represented by a singleton value type with the constructor JNull
with no parameters. The JSON boolean type is just a wrapper constructor JBool
over the Haskell Bool
type. Similarly, the JSON string type is a wrapper over the Haskell String
type. The JSON number type is represented as a collection of its integer, fraction and exponent parts. The integer and exponent parts are Haskell Integer
s which are signed integers of unbounded size. Whereas the fraction part is a list of digits represented as Haskell Int
s. A list of digits is needed here because the fraction part can have leading zeros.
The JSON array type is a wrapped Haskell list, with its elements being of any JSON data types. Likewise, the JSON object type is an Association list of Haskell String
and any JSON data type.
Let’s write a Show
instance for the JSON type so that we can easily inspect its values:
instance Show JValue where
show value = case value of
JNull -> "null"
JBool True -> "true"
JBool False -> "false"
JString s -> showJSONString s
JNumber s [] 0 -> show s
JNumber s f 0 -> show s ++ "." ++ concatMap show f
JNumber s [] e -> show s ++ "e" ++ show e
JNumber s f e -> show s ++ "." ++ concatMap show f ++ "e" ++ show e
JArray a -> "[" ++ intercalate ", " (map show a) ++ "]"
JObject o -> "{" ++ intercalate ", " (map showKV o) ++ "}"
where
= showJSONString k ++ ": " ++ show v
showKV (k, v)
showJSONString :: String -> String
= "\"" ++ concatMap showJSONChar s ++ "\""
showJSONString s
isControl :: Char -> Bool
isControl c = c `elem` ['\0' .. '\31']
showJSONChar :: Char -> String
= case c of
showJSONChar c '\'' -> "'"
'\"' -> "\\\""
'\\' -> "\\\\"
'/' -> "\\/"
'\b' -> "\\b"
'\f' -> "\\f"
'\n' -> "\\n"
'\r' -> "\\r"
'\t' -> "\\t"
| isControl c -> "\\u" ++ showJSONNonASCIIChar c
_ -> [c]
_ where
=
showJSONNonASCIIChar c let a = "0000" ++ showHex (ord c) "" in drop (length a - 4) a
We want this Show
instance to show the JSON values as they appear in JSON text data. We do this so that we can reuse this instance to convert JSON values to text, to test our parsers later. Most of the cases are straightforward. For numbers, we handle the empty fraction and zero exponent cases separately, omitting those in the text form. JSON strings however, require some special handing for possible escape sequences and control characters. Note that we do not use Data.Char.isControl
function here to detect the control characters, instead we write our own. This is because the JSON definition of control characters is different from the Haskell one. We show control characters as their four hex-digit representations prefixed by \u
8. Also note that JSON strings are shown with surrounding double-quotes ("
).
A quick test in GHCi confirms that it works fine:
> :{
> json =
> JObject [("a", JNumber 1 [] 0), ("b", JArray [JBool False, JString "\20A"])]
> :}
> print json
{"a": 1, "b": [false, "\u0014A"]}
JSON Generators
As mentioned earlier, let’s write some QuickCheck generators to generate arbitrary JSON text data to use with property-based testing later. The plan is to generate arbitrary values of type JValue
and convert them to text using the Show
instance we wrote earlier. QuickCheck has the typeclass Arbitrary
for the types for which it can generate random values. We can implement this typeclass for the JValue
type but the problem with that is, we can’t have different generators for JSON numbers and strings and other cases. So instead, we write functions to directly create generators for different JSON value types.
You may skip this section and jump to the Parser section if you wish. You can come back here and read it when we start implementing tests for our parsers.
Scalar Generators
The Gen
monad lets us write generators by combining the built-in generators. We use the existing generators of Bool
, Integer
and list types in QuickCheck to write the generators for JNull
, JBool
and JNumber
values.
jNullGen :: Gen JValue
= pure JNull
jNullGen
jBoolGen :: Gen JValue
= JBool <$> arbitrary
jBoolGen
jNumberGen :: Gen JValue
= JNumber <$> arbitrary <*> listOf (choose (0, 9)) <*> arbitrary jNumberGen
Here, the <$>
operator is the infix symbolic form of the fmap
function, and the <*>
operator is the applicative apply function.
The JSON string generator is bit more complicated because we need to generate strings with both unescaped and escaped characters.
jsonStringGen :: Gen String
=
jsonStringGen concat <$> listOf (oneof [ vectorOf 1 arbitraryUnicodeChar
, escapedUnicodeChar ])where
= ("\\u" ++) <$> vectorOf 4 (elements hexDigitLetters)
escapedUnicodeChar = ['0'..'9'] ++ ['a'..'f'] ++ ['A'..'F']
hexDigitLetters
jStringGen :: Gen JValue
= JString <$> jsonStringGen jStringGen
Let’s test them with the generate
function from QuickCheck in GHCi:
> generate jNullGen
null> generate jBoolGen
true> generate jNumberGen
2.76e-2> generate jStringGen
"\\uB6dC\\u365E\\u5085\\uCF54\\u47d8\\u17CA\\u10Fd𣲙\\uE62a𡅪"
Note that jStringGen
may generate strings with any Unicode character so the generated string may not be renderable entirely on terminals or on browsers.
Composite Generators
The generators for composite values—Arrays and Object—take an Int
parameter to control the size of the generated values. They invoke jValueGen
which we are yet to define, to generate the component values recursively.
jArrayGen :: Int -> Gen JValue
= fmap JArray . scale (`div` 2) . listOf . jValueGen . (`div` 2)
jArrayGen
jObjectGen :: Int -> Gen JValue
= fmap JObject . scale (`div` 2) . listOf . objKV . (`div` 2)
jObjectGen where
= (,) <$> jsonStringGen <*> jValueGen n objKV n
`div` 2
used twice is to reduce the size of generated values which are otherwise too large and take a long time to generate. Trial in GHCi:
> generate $ jArrayGen 6
[true, "\\u8d78\\uC2C0", null, {}, null, ["\\uD28b", null, null, null], "\\uaC63\\u3Fec\\u55Fa\\uaB47\\uEea0\\u3BB5", false, null]> generate $ jObjectGen 6
{"\\uB2c5\\uB6f4\\udee6\\u3E6F𨔂\\u6037": [[[true]]], "\\uf57b\\ua499\\uE936": null, "\\u9D5a": -7.3310625010e-10}
And finally, we have the generator for any JValue
. It also takes a parameter to control the size. For small values of the parameter it tends towards generating more scalar values and does the opposite for larger values. It does so by calling the generators we defined earlier.
jValueGen :: Int -> Gen JValue
= if n < 5
jValueGen n then frequency [(4, oneof scalarGens), (1, oneof (compositeGens n))]
else frequency [(1, oneof scalarGens), (4, oneof (compositeGens n))]
where
= [jNullGen , jBoolGen , jNumberGen , jStringGen]
scalarGens = [jArrayGen n, jObjectGen n] compositeGens n
Quick trial again:
> generate $ jValueGen 2
9.99546402304186496400162205e-13> generate $ jValueGen 6
{"\\u8F1D\\ua32E\\u8d8D𰜢𱽹\\u1b21": false, "\\u56dd\\uCEbb\\uED43": 13e3, "\\u0de3\\uFFB6颮\\ufb8A\\uFCBa\\u03fa": 5.546567497889e3, "\\u631e\\u9d95\\u2Bb8": {"\\u3a0B𧭯\\ue05E𱋫쀦": -5.397e-1, "\\u9BcD\\u3dbd": false, "": "\\uD65b"}, "\\u0BDb\\ufdEB\\u0749\\ucc92\\u9da3\\u9079\\uDCF1\\udcF3": null, "憝\\udB70\\u8E9a\\ud3a4": true, "\\ubF82\\uf8bD\\u29E0\\uC60A": "\\ub5D7\\u98Ea\\uec7E\\uB27A\\u6bb2\\uFc4C\\uB9cC\\uDEC9", "\\u2fde𛃀\\uF490\\uaC02": true}
We use jValueGen
to write the Arbitrary
instance for JValue
:
instance Arbitrary JValue where
= sized jValueGen
arbitrary = genericShrink shrink
The Generic
instance derivation for JValue
lets us use the genericShrink
function from QuickCheck to automatically shrink test input on test failure.
And finally, one last missing piece:
Adding Whitespace
The JSON grammar allows whitespaces around many of its parts as depicted in the transition diagrams in the JSON syntax section. But our current implementation of the Show
instance for JValue
does not add any extra whitespace around anything. This is because the show
function is pure, and hence cannot generate arbitrary amount of whitespaces. But the Gen
monad can! So let’s write a function to “stringify” JValue
with arbitrary whitespace:
jsonWhitespaceGen :: Gen String
=
jsonWhitespaceGen round . sqrt . fromIntegral)
scale (. listOf
. elements
$ [' ' , '\n' , '\r' , '\t']
stringify :: JValue -> Gen String
= pad . go
stringify where
= l ++ j ++ r
surround l r j = surround <$> jsonWhitespaceGen <*> jsonWhitespaceGen <*> gen
pad gen = pad . pure . intercalate ","
commaSeparated
= case value of
go value JArray elements ->
mapM (pad . stringify) elements
>>= fmap (surround "[" "]") . commaSeparated
JObject kvs ->
mapM stringifyKV kvs >>= fmap (surround "{" "}") . commaSeparated
-> return $ show value
_
=
stringifyKV (k, v) <$> pad (pure $ showJSONString k) <*> stringify v <*> pure ":" surround
jsonWhitespaceGen
is a generator for valid JSON whitespace only strings. We use it in the stringify
function to traverse over the JValue
structure and recursively show parts of it and pad them with arbitrary whitespace.
With everything in place now, we can generate arbitrary JSON text in GHCi:
> generate $ jValueGen 6 >>= stringify
"\t\n\t\t\r[\t \tnull \t\n\t\r, \r \rtrue,\r\t \n\r\t\"\22981\93569\34480\873701\689193\476209\\\\ubacc\\\\u794A\\\\u1C30\"\t \n ,\r \n \t [ \n\n\n \t\t\n\t\ntrue\n\r\t\n\n\n\n,\t\r\t\r \r{\r\r\n \t \n\"\\\\u8a1F\\\\uCcDc\895076\"\r\r \r:\r\t\t\r\tfalse \n\t\r , \"\248461\"\r: \t\t\n {\r\n} \t\r\t\t},\t \n \t\t\r\t{\"\"\t:\r 3e-3\t,\n\"\\\\u5F81\\\\uc031\"\t\n:\"\803844\"\t\t\t ,\"\\\\u29b1\"\n:\r null\t\r\t\r\t \t\t\t\t}\r\n ,\n\t \rtrue\r\t \t,\n \t\t\r{\r\"\\\\u2fA6\759074\"\r\t\t:\t\n[\n \r\r\rnull]\n\t,\n\r\n\n \"\\\\uEee3\\\\u5Dab\61593\" : \n\tnull\n\n \t\r\n\r\r}\n \n \r\r\r,\r \r \n\"\951294\\\\u9dd3\\\\u0B39\"\t\n\t,\n\n \t\n \nfalse\r\r\n\r\n \t \r]\n \n\t\r, \r\t \n\t \"\16324\\\\uf6DE\733261\\\\u8b38\\\\ueBa2\382636\474586\\\\uCDDc\\\\u49ee\\\\ua989\"\n ,\n\t\r\rnull \r\r\n\n\t ]\r\r"
You can go over the output and verify that it indeed is a valid JSON text data.
Parser
With the generators set up, let’s write the parsers now. So what exactly is a Parser? A parser takes some input, reads some part of it and maybe “parses” it into some relevant data structure. And it leaves the rest of the input to be potentially parsed later. That sounds like a function! Let’s write it down:
newtype Parser i o =
Parser { runParser :: i -> Maybe (i, o) }
As per our definition, a parser is just a wrapper over the function type i -> Maybe (i, o)
9. If a parser succeeds in parsing then it returns the rest of the input and the output it parsed the input to, else it returns nothing. This definition is simple but it will do for our purpose.
Let’s write our first parser to illustrate this type.
Char Parser
We are starting simple. We are going to write a parser to match the first character of the input with a given character.
char1 :: Char -> Parser String Char
= Parser $ \case
char1 c :xs) | x == c -> Just (xs, x)
(x-> Nothing _
char1
parser matches the given character with the input string and succeeds only if the input starts with the given character10. It returns the rest of the input and the matched character on success. Let’s exercise this on GHCi:
> runParser (char1 'a') "abhinav"
Just ("bhinav",'a')> runParser (char1 'a') "sarkar"
Nothing
Great! We just wrote and ran our first parser. We can generalized this parser by extracting the predicate satisfaction clause out:
satisfy :: (a -> Bool) -> Parser [a] a
= Parser $ \case
satisfy predicate :xs) | predicate x -> Just (xs, x)
(x-> Nothing
_
char :: Char -> Parser String Char
= satisfy (== c) char c
Digit Parser
Moving on, let’s write a parser to parse a digit:
digit1 :: Parser String Int
= Parser $ \i -> case runParser (satisfy isDigit) i of
digit1 Nothing -> Nothing
Just (i', o) -> Just (i', digitToInt o)
We simply use the satisfy
parser with the isDigit
function to parse a character which is a digit (0–9) and then run the digitToInt
function on the output character to convert it to an Int
. Trial run:
> runParser digit1 "123"
Just ("23",1)> runParser digit1 "abc"
Nothing
However, we can do some refactoring:
digit2 :: Parser String Int
= Parser $ \i -> case runParser (satisfy isDigit) i of
digit2 Nothing -> Nothing
Just (i', o) -> Just . fmap digitToInt $ (i', o)
digit3 :: Parser String Int
= Parser $ \i -> fmap (fmap digitToInt) . runParser (satisfy isDigit) $ i digit3
Hmm, it is staring to look like …
Parser
is a Functor
instance Functor (Parser i) where
fmap f parser = Parser $ fmap (fmap f) . runParser parser
Now we can rewrite the digit parser as:
digit :: Parser String Int
= digitToInt <$> satisfy isDigit digit
String Parser
Let’s write a parser to parse out a given string from the input:
string1 :: String -> Parser String String
= case s of
string1 s "" -> Parser $ \i -> Just (i, "")
:cs) -> Parser $ \i -> case runParser (char c) i of
(cNothing -> Nothing
Just (rest, _) -> case runParser (string1 cs) rest of
Nothing -> Nothing
Just (rest', _) -> Just (rest', c:cs)
The string1
parser is written recursively. As the base case, if the given string is empty, we simply return the input and an empty string as the result. Otherwise, we match the first character of the given string with the input by parsing it with the char
parser. If it fails, the string1
parser fails. Else, we recursively run the string1
parser with the rest of the given string against the rest of the input. If the parsing succeeds, we cons the first parsed character with the rest of the parsed characters. Trying in GHCi:
> runParser (string1 "hello") "hello world"
Just (" world","hello")> runParser (string1 "hello") "help world"
Nothing
Let’s refactor this a bit:
string2 :: String -> Parser String String
= case s of
string2 s "" -> Parser $ pure . (, "")
:cs) -> Parser $ \i -> case runParser (char c) i of
(cNothing -> Nothing
Just (rest, c) -> fmap (c:) <$> runParser (string2 cs) rest
If you squint a little bit, what do you think that looks like? Yes, you are right …
Parser
is an Applicative
instance Applicative (Parser i) where
pure x = Parser $ pure . (, x)
<*> po = Parser $ \input -> case runParser pf input of
pf Nothing -> Nothing
Just (rest, f) -> fmap f <$> runParser po rest
Take a minute to read and digest this. With the Applicative
instance, we can now rewrite the string1
parser as:
string :: String -> Parser String String
"" = pure ""
string :cs) = (:) <$> char c <*> string cs string (c
The Functor
and Applicative
instances for Parser
make it really powerful. With the Functor
instance, we can lift pure functions to do operations on parsers. With the Applicative
instance, we can combine multiple parsers together with Applicative
functions (like <*>
) to create new parsers. Now we are ready to write our first JSON parsers.
JNull and JBool Parsers
The parser for JNull
is merely a string
parser for the string null
:
jNull :: Parser String JValue
= string "null" $> JNull jNull
We use the $>
operator to discard the parsed string and return the JNull
value.
The parser for JSON boolean values needs to parse for the string true
, falling back on parsing for the string false
if failed. This is called Backtracking in parsing parlance. To achieve this easily in Haskell, we have to make it so that …
Parser is an Alternative
instance Alternative (Parser i) where
= Parser $ const empty
empty <|> p2 = Parser $ \input -> runParser p1 input <|> runParser p2 input p1
The Alternative
typeclass does exactly what is sounds like. The <|>
function lets you choose a different alternative if the first option fails, hence allowing backtracking. With this, we can write the JBool
parser simply as:
jBool :: Parser String JValue
= string "true" $> JBool True
jBool <|> string "false" $> JBool False
Over to GHCi:
> runParser jNull "null"
Just ("",null)> runParser jNull "dull"
Nothing> runParser jBool "true"
Just ("",true)> runParser jBool "false"
Just ("",false)> runParser jBool "truth"
Nothing> runParser jBool "falsities"
Nothing
These two parsers were pretty simple. The next one is going to be a tad more complicated.
JString Parser
Before writing the JSON string parser, we need a parser to parse JSON characters.
As explained in the String part of the JSON syntax section, the JSON spec allows characters in JSON strings to escaped with some special sequences or with a \u
prefix and characters’ hex-digit codes. Also, JSON control characters cannot be written directly in JSON strings. So we write the JSON character parser as a combination of all these alternatives:
jsonChar :: Parser String Char
= string "\\\"" $> '"'
jsonChar <|> string "\\\\" $> '\\'
<|> string "\\/" $> '/'
<|> string "\\b" $> '\b'
<|> string "\\f" $> '\f'
<|> string "\\n" $> '\n'
<|> string "\\r" $> '\r'
<|> string "\\t" $> '\t'
<|> unicodeChar
<|> satisfy (\c -> not (c == '\"' || c == '\\' || isControl c))
where
=
unicodeChar chr . fromIntegral . digitsToNumber 16 0
<$> (string "\\u" *> replicateM 4 hexDigit)
= digitToInt <$> satisfy isHexDigit
hexDigit
digitsToNumber :: Int -> Integer -> [Int] -> Integer
=
digitsToNumber base foldl (\num d -> num * fromIntegral base + fromIntegral d)
Note that the order of the alternative clauses is important here. The most eager clause is the last one.
The *>
function from the Applicative
typeclass lets us run the parser on its left, discard the parsed value on success and run the parser on its right. replicateM
runs the given parser n
times, gathering the results in a list. The digitsToNumber
function takes a list of digits as Int
s and combines them to create a number with the given base. We use these functions to write the unicodeChar
parser which parses the \u
prefix character representations.
Now that we have the JSON character parser, it should be really easy to parse a JSON string, right? After all, a string is just a list of characters. Wrong! Quoting from the String section of RFC 8259:
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as
"\\uD834\\uDD1E"
.
Now it’s time for short detour to the Unicode Land.
Unicode Planes and Surrogate Characters
In the Unicode standard, a plane is a contiguous group of 216 code points. The first of these planes which covers most of the commonly used characters is called the Basic Multilingual Plane (BMP).
The characters which are not in the BMP can be encoded into it using the code points from the High Surrogate (U+D800–U+DBFF) and Low Surrogate (U+DC00–U+DFFF) blocks of the BMP. A pair of a High Surrogate and a Low Surrogate code points can be used to encode a non-BMP character. A lone surrogate code point cannot be a valid character. G clef 𝄞 residing in the Plane 1, is an example character with code point U+1D11E and surrogate representation (U+D834, U+DD1E).
So, to parse a JSON string, we need to work by character pairs and not just one character at a time. Our current abstractions of Functor
and Applicative
are not powerful enough for this because they work with only one element at a time. We need something more powerful. We need to learn that …
Parser is a Monad
instance Monad (Parser i) where
>>= f = Parser $ \input -> case runParser p input of
p Nothing -> Nothing
Just (rest, o) -> runParser (f o) rest
The Monad
typeclass lets us sequence operations in a context so that the second operation can depend on the result of the first operation. Let’s use it to write the JSON string parser:
jString :: Parser String JValue
= JString <$> (char '"' *> jString') -- 1
jString where
= do
jString' <- optional jsonChar -- 2
optFirst case optFirst of
Nothing -> "" <$ char '"' -- 3
Just first | not (isSurrogate first) -> -- 4
:) <$> jString' -- 5
(firstJust first -> do -- 6
<- jsonChar -- 7
second if isHighSurrogate first && isLowSurrogate second -- 8
then (combineSurrogates first second :) <$> jString' -- 9
else empty -- 10
This code is quite dense so let’s look at it line-by-line. Match the number cues in the code comments with the step numbers below:
- Parse the starting double-quote (
"
) and run the rest of the string through the ancillary parserjString'
. Also wrap the returned result with theJString
constructor at the end. - Parse and get the optional first character using the
optional
function. - If there is no first character, the input is empty. Try to match the ending double-quote (
"
) and return an empty string as output11. - If there is a first character and it is not a surrogate then:
- Run the
jString'
parser recursively on the rest of the input and return this character consed with the rest of the output just as it was done in the String parser. - Else, that is, if the first character is a surrogate then:
- Parse and get the second character. Note that this is not an optional operation like step 2 because there can be no lone surrogates.
- If the first character is a High Surrogate and the second character is a Low Surrogate, that is, if we have a valid surrogate pair:
- Combine the two surrogates, parse the rest of the string with the
jString'
parser recursively, cons the combined character with the rest of the output and return it. - Else fail because the surrogate pair is invalid.
In summary, we read two characters from the input instead of one and see if we can find a valid surrogate pair.
The helper functions are:
highSurrogateUpperBound :: Int
highSurrogateLowerBound,= 0xD800
highSurrogateLowerBound = 0xDBFF
highSurrogateUpperBound
lowSurrogateUpperBound :: Int
lowSurrogateLowerBound,= 0xDC00
lowSurrogateLowerBound = 0xDFFF
lowSurrogateUpperBound
isSurrogate :: Char -> Bool
isHighSurrogate, isLowSurrogate,=
isHighSurrogate a ord a >= highSurrogateLowerBound && ord a <= highSurrogateUpperBound
=
isLowSurrogate a ord a >= lowSurrogateLowerBound && ord a <= lowSurrogateUpperBound
= isHighSurrogate a || isLowSurrogate a
isSurrogate a
combineSurrogates :: Char -> Char -> Char
= chr $
combineSurrogates a b ord a - highSurrogateLowerBound) `shiftL` 10)
((+ (ord b - lowSurrogateLowerBound) + 0x10000
The do
syntax is a syntactic-sugar on top of the monadic bind operation >>=
which allows us to sequence monadic operations. That’s how we are able to read the first character and choose to do different things depending on whether it is a surrogate or not12. This cannot be done without the Monad
instance of Parser
.
Let’s give jString
a try in GHCi:
> runParser jString "\"abhinav\""
Just ("","abhinav")> runParser jString "\"\\u1234\""
Just ("","ሴ")> runParser jString "\"\\uD834\\uDD1E\""
Just ("","𝄞")> runParser jString "\"\\uD834\"" -- lone surrogate is invalid
Nothing> runParser jString "\"\\uD834\\uE000\"" -- \uEOOO is not a surrogate
Nothing
It seems to work but we can’t be sure yet. Let’s write our first QuickCheck property to test it throughly:
prop_genParseJString :: Property
=
prop_genParseJString $ \js ->
forAllShrink jStringGen shrink case runParser jString (show js) of
Nothing -> False
Just (_, o) -> o == js
We use the JSON string generator jStringGen
which we wrote in the Scalar Generators section to generate arbitrary JSON strings. Then we parse them using the jString
parser and equate the parsed result with the generated value for confirming that the parser works. The forAllShrink
function from QuickCheck takes care of input generation and input shrinking in case of failures, automatically. We test this property in GHCi using the quickCheck
function:
> quickCheck prop_genParseJString
+++ OK, passed 100 tests.
Brilliant! That was a complicated parser to write. Let’s move on to parsing numbers.
JNumber Parser
Numbers in JSON can be in different formats. They can be an integer, or a real number with a integral and fractional part, or in scientific notation with or without a fractional part. We will write separate parsers for each of these formats and then combine them to create the number parser. We start with the integer parser:
jUInt :: Parser String Integer
= (\d ds -> digitsToNumber 10 0 (d:ds)) <$> digit19 <*> digits
jUInt <|> fromIntegral <$> digit
digit19 :: Parser String Int
= digitToInt <$> satisfy (\x -> isDigit x && x /= '0')
digit19
digits :: Parser String [Int]
= some digit
digits
jInt' :: Parser String Integer
= signInt <$> optional (char '-') <*> jUInt
jInt'
signInt :: Maybe Char -> Integer -> Integer
Just '-') i = negate i
signInt (= i signInt _ i
jUInt
is a parser for unsigned integers. Integers in JSON cannot start with leading zeros. So if there are multiple digits, jUInt
makes sure that the first digit is 1–9. Alternatively, there can be one digit in range 0–9. digitsToNumber
is used to combine parsed digits into an Integer
. jInt'
add support for an optional -
sign over jUInt
.
We use the some
function here for writing the digits
function. some
runs the given parser one or more times and returns gathering the results in a list.
Parsers for the fractional and exponent parts are simple:
jFrac :: Parser String [Int]
= char '.' *> digits
jFrac
jExp :: Parser String Integer
= (char 'e' <|> char 'E')
jExp *> (signInt <$> optional (char '+' <|> char '-') <*> jUInt)
Now we can combine these parsers to create a parser for various number formats:
jInt :: Parser String JValue
= JNumber <$> jInt' <*> pure [] <*> pure 0
jInt
jIntExp :: Parser String JValue
= JNumber <$> jInt' <*> pure [] <*> jExp
jIntExp
jIntFrac :: Parser String JValue
= (\i f -> JNumber i f 0) <$> jInt' <*> jFrac
jIntFrac
jIntFracExp :: Parser String JValue
= (\ ~(JNumber i f _) e -> JNumber i f e) <$> jIntFrac <*> jExp jIntFracExp
And finally, the jNumber
parser is a combination of all the format parser alternatives, ordered from most eager to least eager:
jNumber :: Parser String JValue
= jIntFracExp <|> jIntExp <|> jIntFrac <|> jInt jNumber
We can verify it in GHCi:
> runParser jNumber "01"
Just ("1",0)> runParser jNumber "4"
Just ("",4)> runParser jNumber "44"
Just ("",44)> runParser jNumber "44.0"
Just ("",44.0)> runParser jNumber "44.45"
Just ("",44.45)> runParser jNumber "0.444"
Just ("",0.444)> runParser jNumber "44E4"
Just ("",44e4)> runParser jNumber "44.3e-7"
Just ("",44.3e-7)> runParser jNumber "0.35e+34"
Just ("",0.35e34)
Nice. But to be sure, let’s write a QuickCheck property for jNumber
:
prop_genParseJNumber :: Property
=
prop_genParseJNumber $ \jn ->
forAllShrink jNumberGen shrink case runParser jNumber (show jn) of
Nothing -> False
Just (_, o) -> o == jn
And run it:
> quickCheck prop_genParseJNumber
+++ OK, passed 100 tests.
That concludes the parsers for the scalar JSON types.
JArray Parser
A JSON array can contain zero or more items of any JSON types separated by commas (,
). So it’s natural that the array parser will be recursive in nature. Arrays can also contain any amount of JSON whitespace around the items. First we write some helper functions to ease our parser implementation:
surroundedBy ::
Parser String a -> Parser String b -> Parser String a
= p2 *> p1 <* p2
surroundedBy p1 p2
separatedBy :: Parser i v -> Parser i s -> Parser i [v]
= (:) <$> v <*> many (s *> v)
separatedBy v s <|> pure []
spaces :: Parser String String
= many (char ' ' <|> char '\n' <|> char '\r' <|> char '\t') spaces
We are using the previously introduced operators <$>
from the Functor
typeclass, *>
, <*>
and <*
from the Applicative
typeclass, and <|>
from the Alternative
typeclass. The function many
is like some
except it runs the given parser zero or more times. Let’s see these in action:
jArray :: Parser String JValue
= JArray <$>
jArray '['
(char *> (jValue `separatedBy` char ',' `surroundedBy` spaces)
<* char ']')
It’s amazing how this definition almost reads like the spec for JSON array itself. We use the yet undefined jValue
parser here to recursively parse any JSON value. Let’s try this out:
> :{
> runParser jArray
> "[1, \"hello\", \n3.5, null, [false,true]]"
> :}
Just ("",[1, "hello", 3.5, null, [false, true]])> runParser jArray "[ [ [ true ] ] ]"
Just ("",[[[true]]])> runParser jArray "[123"
Nothing
Let’s write the QuickCheck property for testing the parser:
prop_genParseJArray :: Property
=
prop_genParseJArray $ \ja -> do
forAllShrink (sized jArrayGen) shrink <- dropWhile isSpace <$> stringify ja
jas return . counterexample (show jas) $ case runParser jArray jas of
Nothing -> False
Just (_, o) -> o == ja
We generate arbitrary JSON arrays using the sized jArrayGen
generator, we stringify
the arrays and parse the text data to equate them with the original generated arrays. Since the jArray
parser does not deal with leading whitespace, we need to discard it before parsing the text. sized
lets QuickCheck control the size of generated values. We also add additional info to the QuickCheck error reports using the counterexample
function.
Running the test13:
> quickCheck prop_genParseJArray
+++ OK, passed 100 tests.
JObject Parser
On to the final piece, the JSON object parser almost writes itself after all we have learned till now:
jObject :: Parser String JValue
= JObject <$>
jObject '{' *> pair `separatedBy` char ',' `surroundedBy` spaces <* char '}')
(char where
= (\ ~(JString s) j -> (s, j))
pair <$> (jString `surroundedBy` spaces)
<* char ':'
<*> jValue
The property for testing is quite similar to that of the jArray
parser:
prop_genParseJObject :: Property
=
prop_genParseJObject $ \jo -> do
forAllShrink (sized jObjectGen) shrink <- dropWhile isSpace <$> stringify jo
jos return . counterexample (show jos) $ case runParser jObject jos of
Nothing -> False
Just (_, o) -> o == jo
And the test:
> quickCheck prop_genParseJObject
+++ OK, passed 100 tests.
JSON Parser
Finally, it’s time to put all the puzzle pieces together to write the JSON parser:
jValue :: Parser String JValue
= jValue' `surroundedBy` spaces
jValue where
= jNull
jValue' <|> jBool
<|> jString
<|> jNumber
<|> jArray
<|> jObject
That was easier than expected14! Now we can write the parseJSON
function to … parse JSON:
parseJSON :: String -> Maybe JValue
= case runParser jValue s of
parseJSON s Just ("", j) -> Just j
-> Nothing _
And now we write the final property which just straight-up generates arbitrary JSON values, stringify
s them and matches the parsed values with the original generated values:
prop_genParseJSON :: Property
= forAllShrink (sized jValueGen) shrink $ \value -> do
prop_genParseJSON <- stringify value
json return . counterexample (show json) . (== Just value) . parseJSON $ json
Let’s skip testing this property and instead write a test to test all of them:
runTests :: IO ()
= do
runTests putStrLn "== prop_genParseJString =="
quickCheck prop_genParseJString
putStrLn "== prop_genParseJNumber =="
quickCheck prop_genParseJNumber
putStrLn "== prop_genParseJArray =="
quickCheck prop_genParseJArray
putStrLn "== prop_genParseJObject =="
quickCheck prop_genParseJObject
putStrLn "== prop_genParseJSON =="
quickCheck prop_genParseJSON
As the tradition goes, let’s do a final run for all the tests15:
> runTests
== prop_genParseJString ==
+++ OK, passed 100 tests.
== prop_genParseJNumber ==
+++ OK, passed 100 tests.
== prop_genParseJArray ==
+++ OK, passed 100 tests.
== prop_genParseJObject ==
+++ OK, passed 100 tests.
== prop_genParseJSON == +++ OK, passed 100 tests.
Hurray! We have written a simple but correct JSON parser from scratch.
Conclusion
In the short span of forty-ish minutes, we have learned how to write a JSON parser from scratch in Haskell. We have also learned some basics of parsing and a great deal of details about the JSON syntax. We also gained some understanding of how to write Property-based tests with QuickCheck. I hope all these things will be useful to you. The full code for the JSON parser can be seen here.
Acknowledgements
Many thanks to Ankur Sethi and Nirbheek Chauhan for helping me understand the intricacies of Unicode, and to Steven Deobald for reviewing a draft of this article.
parse: verb [with object]
resolve (a sentence) into its component parts and describe their syntactic roles
Origin: mid 16th century: perhaps from Middle English pars ‘parts of speech’, from Old French pars ‘parts’ (influenced by Latin pars ‘part’).↩︎
Parsing is often preceded by a Lexing step which converts raw text input into a sequence of tokens which may be annotated with some extra syntactic information. Alternatively, Lexing can be skipped as a separate step and combined with the Parsing step.↩︎
Context-free grammars are a part of the Chomsky hierarchy of formal grammars.↩︎
C++ famously has a non-context-free grammar, aka a Context-sensitive grammar.↩︎
For comparison, here is the grammar of JSON to be used by the popular parser generator framework ANTLR. And this post shows how to implement a JSON parser by hand in Javascript.↩︎
For real production usage, I recommend using the Aeson library.↩︎
Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit input and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code.↩︎
If you are confused by so many backslashes (
\
) in theshowJSONChar
function, they are there because to write a backslash in Haskell code or in JSON text data, you need to escape it with … another backslash.↩︎Real-world parsers are of course more complicated than our definition. They need to report proper error messages and often need to be stateful. For example, the parser type in the popular Attoparsec parser library looks like this:
↩︎newtype Parser i o = Parser { runParser :: forall r. State i -> Pos -> More -> Failure i (State i) r -> Success i (State i) o r -> IResult i r }
In case you are confused by the dangling
\case
, we are using the LambdaCase GHC extension here.↩︎As you may have guessed,
<$
is a flipped version of$>
. It discards the parser result on left and returns the value on right.↩︎Essentially, what the
Monad
abstraction gives our parsers is Lookahead, the capability to “look ahead” in the input and make decisions accordingly. With the lookahead capability, we can write a Predictive parser, which is a Recursive decent parser without the need to do backtracking. Predictive parsing is faster than backtracking because we don’t need to go over the input multiple times. With the lookahead of one character, we can write an alternative version of thejBool
parser like this:↩︎lookahead :: Parser String Char = Parser $ \case lookahead @(x:_) -> Just (i, x) i-> Nothing _ jBoolAlt :: Parser String JValue = do jBoolAlt <- lookahead c JBool <$> case c of 't' -> string "true" $> True 'f' -> string "false" $> False -> empty _
You can also run the tests with the
verboseCheck
function to see all the values being generated by QuickCheck.↩︎Alternatively, we can write a predictive parser for JSON:
↩︎jValueAlt :: Parser String JValue = jValueAlt `surroundedBy` spaces jValue' where = do jValue' <- lookahead c case c of 'n' -> jNull 't' -> jBool 'f' -> jBool '\"' -> jString '[' -> jArray '{' -> jObject -> jNumber _
Our astute readers may have noticed that all the QuickCheck properties we wrote are for Positive Tests, that is, we only generate valid JSON text to test our parsers. They do not guarantee that the parsers fail for invalid inputs. Since the point of this post is not to learn testing in details, I’ll leave the Negative Testing as an exercise for the readers. However, I’ve ascertained the correctness of this JSON parser by running it against this test suite written by the author of the famous Parsing JSON is a Minefield article.↩︎
Got suggestions, corrections, or thoughts? Post a comment!
23 comments
longkaht
John Weathers
Abhinav Sarkar
Hugo Sousa
Bibek Pandey
hk_hooda
Abhinav Sarkar
Linus
Vilem might have
Abhinav Sarkar
taylorfausak
NihilistDandy
ranjitjhala
Abhinav Sarkar
Iceland_jack
Iceland_jack
przemo_li
crygnus
sondr3_
sfvisser
Abhinav Sarkar
[deleted]
Abhinav Sarkar