Moje zdjęcie
Software Craftsman's Blog by Marcin Pieciukiewicz
Java and Scala development

Monday, August 26, 2013

Writing a simplified JSON parser in Clojure

Today I want to write a step by step example of creating a parser for JSON. What is extraordinary, I'll use a Clojure to do that. This is because I have to learn Clojure for my current job and I think parsing problem is very good for this, due it's complexity and usefulness.

What is Clojure? Basically it is a programming language that is a dialect of another language called Lisp. And it is targeted for Java Virtual Machine. Here is some minor description from clojure.org:

Clojure is predominantly a functional programming language, and features a rich set of immutable, persistent data structures.

For those that aren't familiar with Clojure this article might be too difficult, because Clojure has a tremendously different syntax than ie. Java or Scala. In that case it would be better to start with some other tutorial that covers Clojure basics or with some book, ie. "Practical Clojure". I found that book to be a very good, easy to read and a comprehensive introduction to the language.

Enough introduction, let's go to our parser problem.

Define parser requirements
We want to write a JSON parser. Why JSON? I've chosen it because of a couple of reasons. First it is very popular in many commercial usages (ie. RESTful services or NoSQL databases), on the second hand it has very simple syntax so it will be good for training.

JSON syntax is defined on a page www.json.org. So basically in JSON we could have 6 different value types:
  1. Object - wrapped in curly brackets, properties separated by commas and values separated from properties names separated with colon. ie. {name:"Marcin", surname:"Pieciukiewicz"}
  2. Array - wrapped in square brackets, values separated with commas, ie. ["dog","cat","cow"]
  3. String - sequence of characters wrapped in double quotes, ie. "dog"
  4. Number
  5. Boolean - true or false
  6. Null - null
For our purpose I would like to make some simplifications to the syntax, so we could focus on general parsing problems:
  1. No white characters are allowed (outside a string)
  2. Number can be non negative integer only
  3. String cannot contain double quotes and escape character won't be supported, so "\" will be treated as normal character
  4. Parsed JSON is always correct
All those simplifications are introduced so our parser could be simple enough to well fit in this article.

Parsing based on sequence traversing
Clojure provides us with a powerful sequence mechanism and we will use it to traverse through a JSON string. You can imagine that we want to have a cursor over a JSON string that will be used to read subsequent characters, and it will only move forward (from the beginning of the json string, to the end).
To do so we'll have to utilize two commands:
(first some-string)
and
(rest some-string)
first command simply returns first element from the given sequence, and rest command returns a sequence of the elements after the first element. We'll use this commands extensively mainly in recursive operations. It's important to notice that first and rest functions work on sequence passed as parameter, in our case this functions implicitly converts passed string into a sequence.

To understand better how we'll utilize those functions this is a simple example of printing every character from stream in separate line:
(defn print-in-lines [text]
  (if (not (empty? text))
    (do
      (println (first text))
      (recur (rest text)))))
This function first checks if given text is not empty and, if so, it prints out the first character of text and then calls itself recursively with text parameter stripped of the first character.

Detect element type
Because our parser will traverse json string only once we need to detect what kind of element is going to be next. To do that we'll define a set of functions that will check the first character from json sequence and compare it with a character from JSON syntax. Ie. we want to be able to detect beginning of an object. To do so we'll define is-object-opening? function:
(defn is-object-opening? [json] (= \{ (first json)))
As you can see (which might be hard in Clojure ;)) this function takes a json sequence and checks if it's first character is an opening curly bracket “{“. And in the end it returns boolean value. Also we would like to know when this object ends, so let's define function to detect closing curly bracket:
(defn is-object-ending? [json] (= \} (first json)))
This is very simple, so let's define this kind of functions for array and string detection:
(defn is-array-opening? [json] (= \[ (first json)))
(defn is-array-ending? [json] (= \] (first json)))

(defn is-string-opening? [json] (= \" (first json)))
(defn is-string-ending? [json] (= \" (first json)))
Also we'll need to detect integers, boolean values and nulls. To do that we'll assume that integer value starts with a digit from 0 to 9, boolean value starts with lowercase 't' (for true) or lowercase 'f' (for false), and null starts with lowercase 'n'. Now we can define functions that will detect those values:
(defn is-boolean-opening? [json] (or (= \t (first json)) (= \f (first json))))

(defn is-null-opening? [json] (= \n (first json)))

(defn is-digit? [json]
  (and
    (>= (int (first json)) (int \0))
    (<= (int (first json)) (int \9))))

(defn non-digit? [json]
  (not (is-digit? json)))
As you can see I haven't defined functions that detects the end of boolean or null value. That is because we exactly know what values can start with 't', 'f' or 'n' and in particular how long they are. And we'll use this knowledge later.
For integers instead of creating is-integer-opening? and is-integer-ending? I've created more generally named functions that do exactly the same thing.

Parsing primitive values - number, string, boolean and null
When we know what is going to be next in the sequence we want to parse it. We'll start with a basic types, object and array we'll leave for later.
So we need to create a function for each type that will return a parsed value, and, evenly important, a new sequence that is pointing just after parsed value. You can think about it as returning new cursor position in json text.

Parsing string.
String parsing is about reading our json stream until we occur double quote symbol. We'll do that recursively (tail-recursive). What is important, we assume that json sequence passed to this function doesn't start with opening double quote, because it was "consumed" when we checked value type. This is parse-string function implementation (in first call string is initialized to an empty string ""):
(defn parse-string [json string]
  (if (is-string-ending? json)
    [string (rest json)]
    (recur (rest json) (str string (first json)))))
As you can see we check first character of json to see if we should finish parsing. If so we return string (which is our accumulator of read value) and the remaining json sequence. We've used rest function to skip closing double quote character. Otherwise, if we haven't found double quote, we remove first character of passed sequence and concatenate current string value with first character in current sequence and make a recursive call to parse-string function.

Parsing number.
Number parsing is very similar to string parsing. We simply read subsequent character until it's not a digit, and meanwhile we increase an accumulated value (in first call number have to be initialized to zero):
(defn char-to-number [character]
  (- (int character) (int \0)))

(defn parse-number [json number]
  (if (non-digit? json)
    [number json]
    (recur (rest json) (+ (char-to-number (first json)) (* number 10)))))
To convert a digit character into corresponding numerical value I've created char-to-number function that simply casts character into ASCII integer value and calculate difference between that value and integer value of '0' character.

Parsing boolean.
To parse a boolean we'll just have to check the first letter. If it is 't' the value is 'true', if the first letter is 'f' the boolean value is 'false'. Now implementation of parser function is very simple:
(defn parse-boolean [json]
  (if (= \t (first json))
     [true (drop 4 json)]
     [false (drop 5 json)]
    )
  )
We don't have to detect where the value string ends because we know that true is 4 characters long, and false is 5 characters long, so we'll just drop corresponding number of characters from json sequence.
To be clear, we could do that because we assumed that json we are parsing is correct.

Parsing null.
Parsing null value is even simpler than parsing a boolean value. Because we'll call our parse-null function after we detected null value, we'll only need to skip 4 characters from json sequence and return nil:
(defn parse-null [json]
    [nil (drop 4 json)]
  )
Call a parse function for a detected type
Now we have to combine our detection of element type with proper parse function. To do that we'll create parse-value function (because everything is a value ;)):
(declare parse-object parse-array)

(defn parse-value [json]
  (cond
    (is-digit? json) (parse-number json 0)
    (is-boolean-opening? json) (parse-boolean json)
    (is-null-opening? json) (parse-null json)
    (is-string-opening? json) (parse-string (rest json) "")
    (is-object-opening? json) (parse-object (rest json) {})
    (is-array-opening? json) (parse-array (rest json) [])
    :else (println "Incorrect character?")))
Also we've declared parse-object and parse-array functions, so parse-value function would compile. We'll implement those functions later.

Parse array
Array parsing will be a little more complex. Basically we need to read elements separated by comma character, until we occur closing square bracket “]” that is closing the array. We'll do that by recursively calling our array-parse function and in each run it will read one element of the array. This is the implementation:
(defn drop-comma-if-first [json]
  (if (= \, (first json))
    (rest json)
    json))

(defn parse-array [json array]
  (if (is-array-ending? json)
    [array (rest json)]
    (let
      [value-result (parse-value (drop-comma-if-first json))
       value (nth value-result 0)
       rest-json (nth value-result 1)]
      (recur rest-json (conj array value)))))
As you can see we simply stop the recursion when the first element of json stream is “]” character (is-array-ending? function). If not, we have to read element value and to do that we'll use parse-value function. There is also drop-colon-if-first function used - it simply skips the first colon in json sequence if it is present, this is the case when we want to read second or subsequent element from array and we need to skip separator.

Also in parse-array function we use let statement so we can acquire result of parse-value function that is a vector containing parsed value, and also the new json stream that points just after parsed value. After that we simply call parse-array function recursively with new json stream and array with added value we've just parsed (we've used conj function to do that).

Parse object
Object parsing is very similar to array parsing, with the difference that we also need to parse the key name of the value. To do that we have to introduce a method that will be responsible for parsing of that key:
(defn parse-property-name [json name]
  (if (= \: (first json))
    [name json]
    (recur (rest json)
           (str name (first json)))))
It works alike parse-string function, but instead of looking for double quote it is looking after colon to check if the name has ended. Next, let's look at the implementation of parse-object function:
(defn parse-object [json object]
  (if (is-object-ending? json)
    [object (rest json)]; rest to skip closing '}'
    (let
      [property-name-result (parse-property-name (drop-comma-if-first json) "")
       property-name (nth property-name-result 0)
       value-json (rest (nth property-name-result 1))
       value-result (parse-value value-json)
       value (nth value-result 0)
       rest-json (nth value-result 1)]
      (recur rest-json (assoc object property-name value)))))
As you can see it has very similar construction to parse-array function. In let statement it calls parse-property-name function, remembers it's result, then calls parse-value function. Also it uses assoc function to add new property to constructed object.

Main parser function
The last thing to do is to create a main function that can be access point to our parser:
(defn parse-json [json]
  (nth (parse-value json) 0))
It simply calls parse-value function and from it's result extracts first element which is a parsed element. That's all.

Summary
I hope that this example showed that Clojure is a good way to solve more complex problems, such as string parsing. It might not be a best language to do so, but it is not a great struggle to work with it either. For sure Clojure will force you to think differently than Java and this might be it's greatest advantage.

Source code with full solution is available on gist.github.com: json_parser.clj.

2 comments:

  1. Interesting way of handling the stream and "stack" functionally one character at a time. Perhaps some combination of take-while + drop would produce a bit cleaner code, but also a bit slower.

    A few rough edges:

    The assumption that parse-string has the first quote missing is weird - maybe it should be called with the whole string, including the first character? That way you can (parse-string "\"string\""), (parse-object "{\"foo\": \"bar\"}") etc.

    The recur with (str) needlessly creates a new string for every character - perhaps it should be passing a sequence all the time until the return?

    (nth x 0) is (first x), (nth x 1) is (second x). You also could use some destructuring:

    (let [[property-name value-json] (parse-property-name (drop-comma-if-first json) "")
    [value rest-json] (parse-value value-json)])

    ReplyDelete
  2. Hi Konrad, I have to admit that those are valid remarks :) I especially like this about not skipping first character when ie. string parsing, this change would be instant thanks to the character of clojure's sequence. I think that I've done this that way because I've created similar parser in Scala, but there I've created a sequence for String that allowed each character to be read only once. And now I see that this can be easily improved. Thanks for this comment.

    ReplyDelete