goparsify/readme.md

251 lines
12 KiB
Markdown
Raw Normal View History

2017-08-10 16:08:08 +02:00
goparsify [![CircleCI](https://circleci.com/gh/Vektah/goparsify/tree/master.svg?style=shield)](https://circleci.com/gh/Vektah/goparsify/tree/master) [![godoc](http://b.repl.ca/v1/godoc-reference-blue.png)](https://godoc.org/github.com/Vektah/goparsify) [![Go Report Card](https://goreportcard.com/badge/github.com/vektah/goparsify)](https://goreportcard.com/report/github.com/vektah/goparsify)
2017-08-08 15:11:47 +02:00
=========
2017-08-10 13:04:14 +02:00
A parser-combinator library for building easy to test, read and maintain parsers using functional composition.
2017-08-08 15:11:47 +02:00
2017-08-10 14:30:58 +02:00
Everything should be unicode safe by default, but you can opt out of unicode whitespace for a decent ~20% performance boost.
2017-08-10 14:06:08 +02:00
```go
Run(parser, input, ASCIIWhitespace)
```
2017-08-08 15:11:47 +02:00
### benchmarks
2017-08-13 13:20:41 +02:00
I dont have many benchmarks set up yet, but the json parser is 50% faster than the stdlib.
2017-08-08 15:11:47 +02:00
```
2017-08-13 13:20:41 +02:00
$ go test -bench=. -benchmem -benchtime=5s ./json -run=none
BenchmarkUnmarshalParsec-8 100000 65682 ns/op 50464 B/op 1318 allocs/op
BenchmarkUnmarshalParsify-8 200000 32656 ns/op 42094 B/op 220 allocs/op
BenchmarkUnmarshalStdlib-8 200000 48023 ns/op 13952 B/op 262 allocs/op
2017-08-08 15:11:47 +02:00
PASS
2017-08-13 13:20:41 +02:00
ok github.com/vektah/goparsify/json 24.314s
2017-08-08 15:11:47 +02:00
```
2017-08-13 13:20:41 +02:00
Most of the remaining small allocs are from putting things in `interface{}` and are pretty unavoidable. https://www.darkcoding.net/software/go-the-price-of-interface/ is a good read.
2017-08-10 13:04:14 +02:00
### debugging parsers
When a parser isnt working as you intended you can build with debugging and enable logging to get a detailed log of exactly what the parser is doing.
1. First build with debug using `-tags debug`
2017-08-10 14:40:20 +02:00
2. enable logging by calling `EnableLogging(os.Stdout)` in your code
2017-08-10 13:04:14 +02:00
This works great with tests, eg in the goparsify source tree
```
2017-08-13 11:50:41 +02:00
adam:goparsify(master)$ go test -tags debug ./html -v
=== RUN TestParse
html.go:48 | <body>hello <p | tag {
html.go:43 | <body>hello <p | tstart {
html.go:43 | body>hello <p c | < found <
html.go:20 | >hello <p color | identifier found body
html.go:33 | >hello <p color | attrs {
html.go:32 | >hello <p color | attr {
html.go:20 | >hello <p color | identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:32 | >hello <p color | } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:33 | >hello <p color | } found
html.go:43 | hello <p color= | > found >
html.go:43 | hello <p color= | } found [<,body,,map[string]string{},>]
html.go:24 | hello <p color= | elements {
html.go:23 | hello <p color= | element {
html.go:21 | <p color="blue" | text found hello
html.go:23 | <p color="blue" | } found "hello "
html.go:23 | <p color="blue" | element {
html.go:21 | <p color="blue" | text did not find <>
html.go:48 | <p color="blue" | tag {
html.go:43 | <p color="blue" | tstart {
html.go:43 | p color="blue"> | < found <
html.go:20 | color="blue">w | identifier found p
html.go:33 | color="blue">w | attrs {
html.go:32 | color="blue">w | attr {
html.go:20 | ="blue">world</ | identifier found color
html.go:32 | "blue">world</p | = found =
html.go:32 | >world</p></bod | string literal found "blue"
html.go:32 | >world</p></bod | } found [color,=,"blue"]
html.go:32 | >world</p></bod | attr {
html.go:20 | >world</p></bod | identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:32 | >world</p></bod | } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:33 | >world</p></bod | } found [[color,=,"blue"]]
html.go:43 | world</p></body | > found >
html.go:43 | world</p></body | } found [<,p,,map[string]string{"color":"blue"},>]
html.go:24 | world</p></body | elements {
html.go:23 | world</p></body | element {
html.go:21 | </p></body> | text found world
html.go:23 | </p></body> | } found "world"
html.go:23 | </p></body> | element {
html.go:21 | </p></body> | text did not find <>
html.go:48 | </p></body> | tag {
html.go:43 | </p></body> | tstart {
html.go:43 | /p></body> | < found <
html.go:20 | /p></body> | identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:43 | </p></body> | } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:48 | </p></body> | } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:23 | </p></body> | } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:24 | </p></body> | } found ["world"]
html.go:44 | </p></body> | tend {
html.go:44 | p></body> | </ found </
html.go:20 | ></body> | identifier found p
html.go:44 | </body> | > found >
html.go:44 | </body> | } found [</,,p,>]
html.go:48 | </body> | } found "hello "
html.go:23 | </body> | } found html.htmlTag{Name:"p", Attributes:map[string]string{"color":"blue"}, Body:[]interface {}{"world"}}
html.go:23 | </body> | element {
html.go:48 | </body> | tag {
html.go:43 | </body> | tstart {
html.go:43 | /body> | < found <
html.go:20 | /body> | identifier did not find [a-zA-Z][a-zA-Z0-9]*
html.go:43 | </body> | } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:48 | </body> | } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:21 | </body> | text did not find <>
html.go:23 | </body> | } did not find [a-zA-Z][a-zA-Z0-9]*
html.go:24 | </body> | } found ["hello ",html.htmlTag{Name:"p", Attributes:map[string]string{"color":"blue"}, Body:[]interface {}{"world"}}]
html.go:44 | </body> | tend {
html.go:44 | body> | </ found </
html.go:20 | > | identifier found body
html.go:44 | | > found >
html.go:44 | | } found [</,,body,>]
html.go:48 | | } found [[<,body,,map[string]string{},>],,[]interface {}{"hello ", html.htmlTag{Name:"p", Attributes:map[string]string{"color":"blue"}, Body:[]interface {}{"world"}}},[</,,body,>]]
--- PASS: TestParse (0.00s)
2017-08-10 13:04:14 +02:00
PASS
2017-08-13 11:50:41 +02:00
ok github.com/vektah/goparsify/html 0.117s
2017-08-10 13:04:14 +02:00
```
### debugging performance
2017-08-08 15:11:47 +02:00
If you build the parser with -tags debug it will instrument each parser and a call to DumpDebugStats() will show stats:
2017-08-13 07:42:51 +02:00
| var name | matches | total time | self time | calls | errors | location
| -------------------- | -------------------- | --------------- | --------------- | ---------- | ---------- | ----------
| _value | Any() | 5.0685431s | 34.0131ms | 878801 | 0 | json.go:36
| _object | Seq() | 3.7513821s | 10.5038ms | 161616 | 40403 | json.go:24
| _properties | Some() | 3.6863512s | 5.5028ms | 121213 | 0 | json.go:14
| _properties | Seq() | 3.4912614s | 46.0229ms | 818185 | 0 | json.go:14
| _array | Seq() | 931.4679ms | 3.5014ms | 65660 | 55558 | json.go:16
| _array | Some() | 911.4597ms | 0s | 10102 | 0 | json.go:16
| _properties | string literal | 126.0662ms | 44.5201ms | 818185 | 0 | json.go:14
| _string | string literal | 67.033ms | 26.0126ms | 671723 | 136369 | json.go:12
| _properties | : | 50.0238ms | 45.0205ms | 818185 | 0 | json.go:14
| _properties | , | 48.5189ms | 36.0146ms | 818185 | 121213 | json.go:14
| _number | number literal | 28.5159ms | 10.5062ms | 287886 | 106066 | json.go:13
| _true | true | 17.5086ms | 12.5069ms | 252537 | 232332 | json.go:10
| _null | null | 14.5082ms | 11.007ms | 252538 | 252535 | json.go:9
| _object | } | 10.5051ms | 10.5033ms | 121213 | 0 | json.go:24
| _false | false | 10.5049ms | 5.0019ms | 232333 | 222229 | json.go:11
| _object | { | 10.0046ms | 5.0052ms | 161616 | 40403 | json.go:24
| _array | , | 4.5024ms | 4.0018ms | 50509 | 10102 | json.go:16
| _array | [ | 4.5014ms | 2.0006ms | 65660 | 55558 | json.go:16
| _array | ] | 0s | 0s | 10102 | 0 | json.go:16
2017-08-08 15:11:47 +02:00
All times are cumulative, it would be nice to break this down into a parse tree with relative times. This is a nice addition to pprof as it will break down the parsers based on where they are used instead of grouping them all by type.
This is **free** when the debug tag isnt used.
### example calculator
Lets say we wanted to build a calculator that could take an expression and calculate the result.
Lets start with test:
```go
func TestNumbers(t *testing.T) {
result, err := Calc(`1`)
require.NoError(t, err)
require.EqualValues(t, 1, result)
}
```
Then define a parser for numbers
```go
2017-08-13 04:56:46 +02:00
var number = NumberLit().Map(func(n Result) Result {
2017-08-08 15:11:47 +02:00
switch i := n.Result.(type) {
case int64:
2017-08-09 13:41:57 +02:00
return Result{Result: float64(i)}
2017-08-08 15:11:47 +02:00
case float64:
2017-08-09 13:41:57 +02:00
return Result{Result: i}
2017-08-08 15:11:47 +02:00
default:
panic(fmt.Errorf("unknown value %#v", i))
}
})
func Calc(input string) (float64, error) {
2017-08-09 13:26:27 +02:00
result, err := Run(y, input)
2017-08-08 15:11:47 +02:00
if err != nil {
return 0, err
}
return result.(float64), nil
}
2017-08-09 13:26:27 +02:00
2017-08-08 15:11:47 +02:00
```
This parser will return numbers either as float64 or int depending on the literal, for this calculator we only want floats so we Map the results and type cast.
Run the tests and make sure everything is ok.
Time to add addition
```go
func TestAddition(t *testing.T) {
result, err := Calc(`1+1`)
require.NoError(t, err)
require.EqualValues(t, 2, result)
}
var sumOp = Chars("+-", 1, 1)
2017-08-13 04:56:46 +02:00
sum = Seq(number, Some(And(sumOp, number))).Map(func(n Result) Result {
2017-08-08 15:11:47 +02:00
i := n.Child[0].Result.(float64)
for _, op := range n.Child[1].Child {
switch op.Child[0].Token {
case "+":
i += op.Child[1].Result.(float64)
case "-":
i -= op.Child[1].Result.(float64)
}
}
2017-08-09 13:41:57 +02:00
return Result{Result: i}
2017-08-08 15:11:47 +02:00
})
2017-08-09 13:26:27 +02:00
// and update Calc to point to the new root parser -> `result, err := ParseString(sum, input)`
2017-08-08 15:11:47 +02:00
```
2017-08-09 11:35:15 +02:00
This parser will match number ([+-] number)+, then map its to be the sum. See how the Child map directly to the positions in the parsers? n is the result of the and, `n.Child[0]` is its first argument, `n.Child[1]` is the result of the Some parser, `n.Child[1].Child[0]` is the result of the first And and so fourth. Given how closely tied the parser and the Map are it is good to keep the two together.
2017-08-08 15:11:47 +02:00
You can continue like this and add multiplication and parenthesis fairly easily. Eventually if you keep adding parsers you will end up with a loop, and go will give you a handy error message like:
```
typechecking loop involving value = goparsify.Any(number, groupExpr)
```
we need to break the loop using a pointer, then set its value in init
2017-08-09 11:35:15 +02:00
```go
2017-08-08 15:11:47 +02:00
var (
value Parser
2017-08-09 11:35:15 +02:00
prod = Seq(&value, Some(And(prodOp, &value)))
2017-08-08 15:11:47 +02:00
)
func init() {
value = Any(number, groupExpr)
}
```
Take a look at [calc](calc/calc.go) for a full example.
2017-08-10 13:58:14 +02:00
### preventing backtracking with cuts
A cut is a marker that prevents backtracking past the point it was set. This greatly improves error messages when used correctly:
```go
alpha := Chars("a-z")
// without a cut if the close tag is left out the parser will backtrack and ignore the rest of the string
nocut := Many(Any(Seq("<", alpha, ">"), alpha))
_, err := Run(nocut, "asdf <foo")
fmt.Println(err.Error())
// Outputs: left unparsed: <foo
// with a cut, once we see the open tag we know there must be a close tag that matches it, so the parser will error
2017-08-10 14:10:30 +02:00
cut := Many(Any(Seq("<", Cut(), alpha, ">"), alpha))
2017-08-10 13:58:14 +02:00
_, err = Run(cut, "asdf <foo")
fmt.Println(err.Error())
// Outputs: offset 9: expected >
```
2017-08-08 15:11:47 +02:00
### prior art
2017-08-08 15:29:00 +02:00
Inspired by https://github.com/prataprc/goparsec