Monadic Parser Combinators - Haskell 風パーサー・コンビネーターの実装

Abstract for English readers:

This article describes an implementation of a purely functional, monadic parser combinator library in PLT Scheme.

With this library, one can easily build non-ambiguous, recursive-descent style parsers for string of characters, input port, or even list of tokens generated by a separate lexer process.

To accomodate parse failure, i.e., to perform backtracking, this library makes use of srfi-40's stream. One of the benefits of this approach is that backtracking will be enabled automatically on input ports, where one cannot always move the current port position backwards.

Sources:
parser.ss
json.ss (example parser)

ここ最近ジェネレータによる HTML パーサーの開発に取り組んでいたんですが、オプショナルな閉じタグの扱いが非常に困難であることが分かり、暗礁に乗り上げました。

そこで目先を変えまして、Haskell のテキストを読んで別なアプローチの解析技術を学んでみました。

これまで私が作ろうとしていたのは終端項から調べてツリーを組み上げていく、ボトムアップ式のパーサーであったのですが、それとは対照的な、トップダウン式の「パーサー・コンビネーター」と呼ばれるテクニックです。

解析したい言語の文法 (生成規則) を、小さなパーサーの組み合わせによって記述するというもので、非常に簡単かつエレガントにパーサーを作ることができます。

以下ライブラリの実装を示していきますが、先に感じを掴んでいただくためにパーサーの定義例をお見せしましょう。

;; Preview of how you'll be able to use our monadic parser combinator.
;; This is an example parser for HTML's <p> element.  Note the
;; smilarities with Haskell's "do" notation.

(define paragraph
  (doP (tag <- (open-tag 'p))
       (body <- (many inline))
       (end-tag/opt 'p)
       (return (append tag body))))

P タグのパーサーの例です。最初に開きタグがあり、次に0個以上のインライン要素があり、閉じタグは無いかもしれない、という文法をそのまま直感的に記述しています。

Haskell のモナドの >>= と return 関数を作り、その上に doP 構文を被せることで、ややこしい部分は地下に埋め込まれ、このようなシンプルな記述ができるようになっています。

;;; parser.ss

(require scheme/control  ; reset shift
         scheme/function ; negate curry
         srfi/8          ; receive
         srfi/40         ; stream
         mzlib/etc       ; identity
         )


(define (sequence->stream seq)
  (cond ((stream? seq) seq)
        ((list? seq) (list->stream seq))
        ((string? seq) (string->stream seq))
        ((input-port? seq) (port->stream seq))
        (else
         (error 'sequence->stream "Invalid type: ~s" seq))))

(define (list->stream l)
  (let loop ((l l))
    (if (null? l)
        stream-null
        (stream-cons (car l)
                     (loop (cdr l))))))

(define (string->stream s)
  (let ((len (string-length s)))
    (let loop ((i 0))
      (if (>= i len)
          stream-null
          (stream-cons (string-ref s i)
                       (loop (add1 i)))))))

(define (port->stream in)
  (let loop ((c (read-char in)))
    (if (eof-object? c)
        stream-null
        (stream-cons c
                     (loop (read-char in))))))

まず、入力形式を統一する関数です。

文字列や入力ポートなど、様々な入力に対応したいので、入力データをストリーム (遅延リスト) 化することで、入力の型によらず統一的にパーサーを定義できるようにしています。~~これに関しては PLT Scheme の for 構文と部分継続 (delimited continuation) を活用しています。~~

ストリーム化の別の利点として、入力ポート上でのバックトラックが可能になるということも挙げられるでしょう。

次にパーサーを実行する関数です。

(define (parse p input)
  (receive (x rest) (p (sequence->stream input))
    (unless (stream-null? rest)
      (printf "Unparsed: ~s~%" (stream-car rest)))
    x))

パーサー (p) の実行結果は常に (values parse-result rest-of-input) の形式で返されます。これは Haskell ではタプルでパース結果と残りの入力を返すことに対応しています。

さて、多値を使うということで、パーサを次々と繋げていく時にその値の受け渡しが少々煩雑になってしまいます。上の P タグのパーサーは、最初は次のように定義されていました:

;; - Intermission -
;; Definition of paragraph parser above was actually like this before
;; introducing monadic operators.
;; As you see, manually passing around multiple values and propagating
;; parse failure through a series of parsers gets quite cumbersome.
(define (paragraph input)
  (receive (tag rest) ((open-tag 'p) input)
    (if tag
        (let*-values (((body rest) ((many inline) rest))
                      ((/tag rest) ((end-tag/opt 'p) rest)))
          (values (append tag body) rest))
        (values #f input))))

ここでモナドを導入します。

;; Monadic operators

;; Return x, without consuming input
(define ((return x) input)
  (values x input))

;; Binding operator.  Note that only #f is treated as parse failure.
;; I.e., empty list should denote success without successful parsing.
(define ((>>= p f) input)
  (receive (x rest) (p input)
    (if x
        ((f x) rest)
        (fail input))))

(define-syntax >>
  (syntax-rules ()
    ((>> p q)
     (>>= p (lambda (_) q)))
    ((>> p q r ...)
     (>> p (>> q r ...)))))

(define-syntax doP
  (syntax-rules (<-)
    ((do e) e)
    ((do (v <- p) e ...)
     (>>= p (lambda (v) (doP e ...))))
    ((do p e ...)
     (doP (_ <- p) e ...))))

(define guard return)

次の item というパーサーがすべてのパーサー作りの基本となります。

;; Other monadic infrastructure functions

(define (peek input)
  (and (not (stream-null? input))
       (stream-car input)))

(define (next input)
  (and (not (stream-null? input))
       (stream-cdr input)))

;; `item' simply consumes one input.  This is the fundamental parser
;; upon which most other parsers will be built
(define (item input)
  (values (peek input) (next input)))

;; Special parser for signaling parse failure, optionally notifying
;; where the parse failed.
(define (fail [context stream-null])
  (values #f context))

次に OR 演算子のパーサー版を定義します。曖昧なパーサー (複数の解析候補を返すパーサー) であれば、それぞれのパーサーの結果の和集合を返すところなんですが、今回、曖昧でないパーサーを実装していますので、最初に成功したパース結果を返します。

;; Parsec's <|> (choice) combinator.  Version 1

(define ((|| . pp) input)
  (let loop ((pp pp))
    (if (null? pp)
        (fail input)
        (receive (x rest) ((car pp) input)
          (if x
              (values x rest)
              (loop (cdr pp)))))))

縦棒2つは空白文字 "" をシンボル化した時の表示形式なんですが、それを OR 演算子に見立ててみました。中置演算子ではないので2つ以上のパーサーを取れます。

上は比較的簡単な定義ですが、パース結果が空リストだった場合に別の候補を探すバージョンも作ってみました:

;; Version 2

(define fail? (curry eq? #f))

(define empty? null?)

;; Translation of `fun' (function union) from "On Lisp" by Paul Graham
(define (disjoin f . fs)
  (if (null? fs)
      f
      (let ((chain (apply disjoin fs)))
        (lambda xs
          (or (apply f xs) (apply chain xs))))))

(define succ? (negate (disjoin fail? empty?)))

(define ((|| . pp) input)
  (define (check x pp)
    (cond ((fail? x) #f)
          ((empty? x)         ;try alternative choices
           (receive (x rest) (loop (cdr pp))
             (when (succ? x)
               (shift k
                 (values x rest))))
           #t)
          (else #t)))
  (define (loop pp)
    (if (null? pp)
        (fail input)
        (receive (x rest) ((car pp) input)
          (if (check x pp)
              (values x rest)
              (loop (cdr pp))))))
  (reset (loop pp)))

以上に加えて、その他便利なパーサー群です:

;; Some utility parsers

(define (satisfy pred?)
  (doP (x <- item)
       (return (and (pred? x) x))))

;; (From Parsec) To avoid stack overflow, we accumulate parse
;; results tail-recursively with left-fold.
(define (many p)
  (doP (xs <- (many-fold cons p))
       (return (reverse xs))))

(define ((many-fold accum p) input)
  (let loop ((input input) (xs '()))
    (receive (x rest) (p input)
      (cond ((empty? x)
             (loop rest xs))
            (x
             (loop rest (accum x xs)))
            (else
             (values xs input))))))

(define (many1 p)
  (doP (x <- p)
       (xs <- (many p))
       (return (cons x xs))))

(define (skip-many p)
  (doP (many-fold void p) (return '())))

(define (skip-many1 p)
  (doP p (skip-many p)))

(define ((between open close) p)
  (doP open (x <- p) close (return x)))

(define (option default p)
  (|| p (return default)))

(define (one-of l)
  (satisfy (curryr memv l)))

(define (none-of l) (satisfy (negate (curryr memv l))))

(define digit (satisfy char-numeric?))

(define letter (satisfy char-alphabetic?))

(define alphanum (|| digit letter))

(define space (satisfy char-whitespace?))

(define spaces (many space))

(define (p:char c)
  (satisfy (curry char=? c)))

(define (p:string str)
  (if (string=? str "")
      (return str)
      (doP (p:char (string-ref str 0))
           (p:string (substring str 1))
           (return str))))

(define (many-till p end)
  (letrec ((scan
            (lambda (input)
              (let loop ((input input) (xs '()))
                (receive (x rest) (end input)
                  (if x
                      (values (reverse xs) rest)
                      (receive (x rest) (p input)
                        (if x
                            (loop rest (cons x xs))
                            (values #f input)))))))))
    scan))

;; EOF

ライブラリのコードはこんなところです。あとは適宜エクスポートする名前を provide してモジュールとして体裁を整えれば完成となります。

最後にテストとして、中置記法の計算式をパースして実行する例を示します:

;; Example.
;; Arithmetic expression parser almost directly translated from
;; http://www.cs.nott.ac.uk/~gmh/pearl.hs

(define (chainl p op a)
  (|| (p . chainl1 . op)
      (return a)))

(define (chainl1 p op)
  (define (rest a)
    (|| (doP (f <- op) (b <- p) (rest (f a b)))
        (return a)))
  (>>= p rest))

(define (symb cs)
  (token (p:string cs)))

(define (token p)
  (doP (a <- p) spaces (return a)))

(define natural
  ((doP (c <- digit)
        (return (- (char->integer c)
                   (char->integer #\0))))
   . chainl1 . (return (lambda (m n)
                         (+ (* 10 m) n)))))

(define nat (token natural))

(define integer
  (|| (doP (p:char #\-)
           (n <- nat)
           (return (- n)))
      nat))

(define addop
  (|| (>> (symb "+") (return +))
      (>> (symb "-") (return -))))

(define mulop
  (|| (>> (symb "*") (return *))
      (>> (symb "/") (return /))))

(define factor
  (|| integer
      (doP (symb "(")
           (n <- expr)
           (symb ")")
           (return n))))

(define term (factor . chainl1 . mulop))
(define expr (term . chainl1 . addop))

(define (calc arith)
  (parse (>> spaces expr) arith))

独立した字句解析の機構が特に存在しないのがお分かりでしょうか。空白を読み飛ばしながら有効な字句を読み取る字句パーサーがパーサー・コンビネーターで容易に書けてしまうため、構文解析と字句解析の区別がほとんど必要無いんですね。

では REPL で試してみましょう。

> (calc "(1 + 2 * (3 + 4)) / 5")
3

合ってます。よね？

ついでに、

> (calc (open-input-string "(1 + 2 * (3 + 4)) / 5"))
3
> (calc (string->list "(1 + 2 * (3 + 4)) / 5"))
3

参考文献:
ふつうのHaskellプログラミング 13 章
Functional Pearls: Monadic Parsing in Haskell
Parsec: Direct Style Monadic Parser Combinators For The Real World
Write Yourself a Scheme in 48 Hours/Parsing