Today I want to
write a step by step example of creating a parser for JSON. What is
extraordinary, I'll use a Clojure to
do that. This is because I have to learn Clojure for my current job
and I think parsing problem is very good for this, due it's
complexity and usefulness.
What is Clojure?
Basically it is a programming language that is a dialect of another
language called Lisp. And it is targeted for Java Virtual Machine.
Here is some minor description from clojure.org:
Clojure is
predominantly a functional programming language, and features a rich
set of immutable, persistent data structures.
For those that aren't
familiar with Clojure this article might be too difficult, because
Clojure has a tremendously different syntax than ie. Java or Scala.
In that case it would be better to start with some other tutorial
that covers Clojure basics or with some book, ie. "Practical
Clojure". I found that book to be a very good, easy to read and
a comprehensive introduction to the language.
Enough introduction, let's
go to our parser problem.
Define parser
requirements
We want to write a JSON
parser. Why JSON? I've chosen it because of a couple of reasons.
First it is very popular in many commercial usages (ie. RESTful
services or NoSQL databases), on the second hand it has very simple
syntax so it will be good for training.
JSON syntax is
defined on a page www.json.org.
So basically in JSON we could have 6 different value types:
- Object - wrapped in curly brackets, properties separated by commas and values separated from properties names separated with colon. ie. {name:"Marcin", surname:"Pieciukiewicz"}
- Array - wrapped in square brackets, values separated with commas, ie. ["dog","cat","cow"]
- String - sequence of characters wrapped in double quotes, ie. "dog"
- Number
- Boolean - true or false
- Null - null
For our purpose I would
like to make some simplifications to the syntax, so we
could focus on general parsing problems:
- No white characters are allowed (outside a string)
- Number can be non negative integer only
- String cannot contain double quotes and escape character won't be supported, so "\" will be treated as normal character
- Parsed JSON is always correct
All those simplifications
are introduced so our parser could be simple enough to well fit in
this article.
Parsing based on
sequence traversing
Clojure provides us with a
powerful sequence mechanism and we will use it to traverse through a
JSON string. You can imagine that we want to have a cursor over a
JSON string that will be used to read subsequent characters, and it
will only move forward (from the beginning of the json string, to the
end).
To do so we'll have to
utilize two commands:
(first some-string)
and
(rest some-string)
first command simply returns first element from the given sequence,
and rest command returns a sequence of the elements after the first
element. We'll use this commands extensively mainly in recursive
operations. It's important to notice that first and rest functions
work on sequence passed as parameter, in our case this functions
implicitly converts passed string into a sequence.
To understand better how
we'll utilize those functions this is a simple example of printing
every character from stream in separate line:
(defn print-in-lines [text] (if (not (empty? text)) (do (println (first text)) (recur (rest text)))))This function first checks if given text is not empty and, if so, it prints out the first character of text and then calls itself recursively with text parameter stripped of the first character.
Detect element type
Because our parser will
traverse json string only once we need to detect what kind of element
is going to be next. To do that we'll define a set of functions that
will check the first character from json sequence and compare it with
a character from JSON syntax. Ie. we want to be able to detect
beginning of an object. To do so we'll define is-object-opening?
function:
(defn is-object-opening? [json] (= \{ (first json)))
As you can see (which might be hard in Clojure ;)) this function
takes a json sequence and checks if it's first character is an
opening curly bracket “{“. And in the end it returns boolean
value. Also we would like to know when this object ends, so let's
define function to detect closing curly bracket:
(defn is-object-ending? [json] (= \} (first json)))
This is very simple, so
let's define this kind of functions for array and string detection:
(defn is-array-opening? [json] (= \[ (first json))) (defn is-array-ending? [json] (= \] (first json))) (defn is-string-opening? [json] (= \" (first json))) (defn is-string-ending? [json] (= \" (first json)))
Also we'll need to detect integers, boolean values and nulls. To do
that we'll assume that integer value starts with a digit from 0 to 9,
boolean value starts with lowercase 't' (for true) or lowercase 'f'
(for false), and null starts with lowercase 'n'. Now we can define
functions that will detect those values:
(defn is-boolean-opening? [json] (or (= \t (first json)) (= \f (first json)))) (defn is-null-opening? [json] (= \n (first json))) (defn is-digit? [json] (and (>= (int (first json)) (int \0)) (<= (int (first json)) (int \9)))) (defn non-digit? [json] (not (is-digit? json)))
As you can see I haven't
defined functions that detects the end of boolean or null value. That
is because we exactly know what values can start with 't', 'f' or 'n'
and in particular how long they are. And we'll use this knowledge
later.
For integers instead of
creating is-integer-opening? and is-integer-ending? I've created more
generally named functions that do exactly the same thing.
Parsing
primitive values - number, string, boolean and null
When
we know what is going to be next in the sequence we want to parse it.
We'll start with a basic types, object and array we'll leave for
later.
So we
need to create a function for each type that will return a parsed
value, and, evenly important, a new sequence that is pointing just
after parsed value. You can think about it as returning new cursor
position in json text.
Parsing
string.
String
parsing is about reading our json stream until we occur double quote
symbol. We'll do that recursively (tail-recursive). What is
important, we assume that json sequence passed to this function
doesn't start with opening double quote, because it was "consumed"
when we checked value type. This is parse-string function
implementation (in first call string is initialized to an empty
string ""):
(defn parse-string [json string] (if (is-string-ending? json) [string (rest json)] (recur (rest json) (str string (first json)))))
As
you can see we check first character of json to see if we should
finish parsing. If so we return string (which is our accumulator of
read value) and the remaining json sequence. We've used rest function
to skip closing double quote character. Otherwise, if we haven't
found double quote, we remove first character of passed sequence and
concatenate current string value with first character in current
sequence and make a recursive call to parse-string function.
Parsing
number.
Number
parsing is very similar to string parsing. We simply read subsequent
character until it's not a digit, and meanwhile we increase an
accumulated value (in first call number have to be initialized to
zero):
(defn char-to-number [character] (- (int character) (int \0))) (defn parse-number [json number] (if (non-digit? json) [number json] (recur (rest json) (+ (char-to-number (first json)) (* number 10)))))
To
convert a digit character into corresponding numerical value I've
created char-to-number function that simply casts character into
ASCII integer value and calculate difference between that value and
integer value of '0' character.
Parsing
boolean.
To
parse a boolean we'll just have to check the first letter. If it is
't' the value is 'true', if the first letter is 'f' the boolean value
is 'false'. Now implementation of parser function is very simple:
(defn parse-boolean [json] (if (= \t (first json)) [true (drop 4 json)] [false (drop 5 json)] ) )
We
don't have to detect where the value string ends because we know that
true is 4 characters long, and false is 5
characters long, so we'll just drop corresponding number of
characters from json sequence.
To be
clear, we could do that because we assumed that json we are parsing
is correct.
Parsing
null.
Parsing
null value is even simpler than parsing a boolean value. Because
we'll call our parse-null function after we detected null value,
we'll only need to skip 4 characters from json sequence and return
nil:
(defn parse-null [json] [nil (drop 4 json)] )
Call
a parse function for a detected type
Now
we have to combine our detection of element type with proper parse
function. To do that we'll create parse-value function (because
everything is a value ;)):
(declare parse-object parse-array) (defn parse-value [json] (cond (is-digit? json) (parse-number json 0) (is-boolean-opening? json) (parse-boolean json) (is-null-opening? json) (parse-null json) (is-string-opening? json) (parse-string (rest json) "") (is-object-opening? json) (parse-object (rest json) {}) (is-array-opening? json) (parse-array (rest json) []) :else (println "Incorrect character?")))
Also
we've declared parse-object and parse-array functions, so parse-value
function would compile. We'll implement those functions later.
Parse
array
Array
parsing will be a little more complex. Basically we need to read
elements separated by comma character, until we occur closing square
bracket “]” that is closing the array. We'll do that by
recursively calling our array-parse function and in each run it will
read one element of the array. This is the implementation:
(defn drop-comma-if-first [json] (if (= \, (first json)) (rest json) json)) (defn parse-array [json array] (if (is-array-ending? json) [array (rest json)] (let [value-result (parse-value (drop-comma-if-first json)) value (nth value-result 0) rest-json (nth value-result 1)] (recur rest-json (conj array value)))))
As you can see we simply stop the recursion when the first element of
json stream is “]” character (is-array-ending? function). If not, we have
to read element value and to do that we'll use parse-value function.
There is also drop-colon-if-first function used - it simply skips the
first colon in json sequence if it is present, this is the case when
we want to read second or subsequent element from array and we need
to skip separator.
Also in parse-array function we use let statement so
we can acquire result of parse-value function that is a vector
containing parsed value, and also the new json stream that points
just after parsed value. After that we simply call parse-array
function recursively with new json stream and array with added value
we've just parsed (we've used conj function to do that).
Parse object
Object parsing is
very similar to array parsing, with the difference that we also need
to parse the key name of the value. To do that we have to introduce a
method that will be responsible for parsing of that key:
(defn parse-property-name [json name] (if (= \: (first json)) [name json] (recur (rest json) (str name (first json)))))It works alike parse-string function, but instead of looking for double quote it is looking after colon to check if the name has ended. Next, let's look at the implementation of parse-object function:
(defn parse-object [json object] (if (is-object-ending? json) [object (rest json)]; rest to skip closing '}' (let [property-name-result (parse-property-name (drop-comma-if-first json) "") property-name (nth property-name-result 0) value-json (rest (nth property-name-result 1)) value-result (parse-value value-json) value (nth value-result 0) rest-json (nth value-result 1)] (recur rest-json (assoc object property-name value)))))As you can see it has very similar construction to parse-array function. In let statement it calls parse-property-name function, remembers it's result, then calls parse-value function. Also it uses assoc function to add new property to constructed object.
Main parser function
The last thing to do is to create a main function that can be access point to our parser:
(defn parse-json [json] (nth (parse-value json) 0))It simply calls parse-value function and from it's result extracts first element which is a parsed element. That's all.
Summary
I hope that this example showed that Clojure is a good way to solve more complex problems, such as string parsing. It might not be a best language to do so, but it is not a great struggle to work with it either. For sure Clojure will force you to think differently than Java and this might be it's greatest advantage.
Source code with full solution is available on gist.github.com: json_parser.clj.