Creating a simple text parser/compiler

martix · Jun 28, 2017

I need to create a simple text parser that replaces {tokens} with text from a table.
The one more complicated bit is the {?text?} tag where text must be included only if the preceding AND the following tokens exist and are not empty. You could do {?text} where it looks for a non-empty neighbouring tag only on that side.

I have already bodged it based on common sense and an hour with the debugger.

What I don't know is how to approach this in a formal manner.
Like starting with a formal EBNF specification, perhaps creating the correct Abstract Syntax Tree, or however one goes about (what I assume can be classified as) creating a new language.

As you can probably guess I have no formal CS education, but I am interested in learning about this in a more structured manner (whatever 'this' is - language/compiler design?).

QuantumQuest · Jun 28, 2017

martix said:

I have already bodged it based on common sense and an hour with the debugger.

Do you mean you have created a complete working program on your own or some working pieces?

martix said:

What I don't know is how to approach this in a formal manner.
Like starting with a formal EBNF specification, perhaps creating the correct Abstract Syntax Tree, or however one goes about (what I assume can be classified as) creating a new language.

As you can probably guess I have no formal CS education, but I am interested in learning about this in a more structured manner (whatever 'this' is - language/compiler design?).

In order to construct a compiler there are various interdisciplinary things - among various disciplines, to learn about. If you already have a decent formal knowledge on programming (with "formal" I mean knowing all the needed algorithmic stuff and good knowledge / expertise on at least one programming language) what you need is some automata theory - I strongly recommend taking a look at Stanford's online course taught by Jeff Ullman (it is free) and also a good textbook on this like this. You have to devote enough time to learn this subject. You also - this is important, must have a good knowledge on undergrad math (for more details see the prerequisites in the course I recommend for instance). Then you can go for a course on compilers. I strongly recommend this one from Stanford (it is free). Now, while you can go directly for a compilers course (for instance the one I recommend have no prerequisite knowledge) I think - according to my personal experience (I have formal CS education), that is better to know automata first and all the other "peripheral" things it requires but provided that you want to get some deeper knowledge. If this not the case then go directly to a compilers course.

I don't recommend reading just some textbooks on your own because the subjects involved are fairly complex and it is way better to learn in the interactive environment of an online course and complement all this with your own reading(s).

martix · Jun 28, 2017

QuantumQuest said:

Do you mean you have created a complete working program on your own or some working pieces?

Insomuch as you can call a single function "a program", yes. Here it is. The flavour is technically PHP, but there's nothing special in there. Anyone who can read C-like code should understand what's going on easily. Would welcome improvements.

QuantumQuest said:

In order to construct a compiler there are various interdisciplinary things - among various disciplines, to learn about. If you already have a decent formal knowledge on programming (with "formal" I mean knowing all the needed algorithmic stuff and good knowledge / expertise on at least one programming language) what you need is some automata theory - I strongly recommend taking a look at Stanford's online course taught by Jeff Ullman (it is free) and also a good textbook on this like this. You have to devote enough time to learn this subject. You also - this is important, must have a good knowledge on undergrad math (for more details see the prerequisites in the course I recommend for instance). Then you can go for a course on compilers. I strongly recommend this one from Stanford (it is free). Now, while you can go directly for a compilers course (for instance the one I recommend have no prerequisite knowledge) I think - according to my personal experience (I have formal CS education), that is better to know automata first and all the other "peripheral" things it requires but provided that you want to get some deeper knowledge. If this not the case then go directly to a compilers course.

I don't recommend reading just some textbooks on your own because the subjects involved are fairly complex and it is way better to learn in the interactive environment of an online course and complement all this with your own reading(s).

I'd like to call myself a programmer, but these days I'm really more of a code monkey. I do however have enough knowledge in plenty of C-like languages, and some understanding of algorithms - I can write you a basic DFS or construct you a tree recursively without much effort. At some point I may have been okay with undergrad math, but these days it's rusty(a recommendation here might also good). Thanks for the course suggestions; those were a large part of what I was looking for from this thread.

QuantumQuest · Jun 28, 2017

martix said:

At some point I may have been okay with undergrad math, but these days it's rusty(a recommendation here might also good).

For Automata you must feel yourself comfortable with proofs. As for specific math topics, you must have knowledge of graphs, trees etc. and also logic. Also you may want to do some refresh / expand your knowledge on algorithms and data structures. I would recommend the excellent Stanford online course "Algorithms: Design and Analysis" taught by Tim Roughgarden.

I must note that I have personally attended all the online courses I recommend some years before on Coursera, as refreshers, so I know about their topics / quality. As for logic and proofs there are also good online courses but I think that if you have the basic knowledge you can do fine with some readings.

newjerseyrunner · Jul 3, 2017

Sounds like you want something called a regular expression. There are lots of libraries that use various types of Regex formats, so depending on the library, you may have different syntax, but it'll certainly do what you need it to. You have Boost for C++, PCRE in PHP...

Stephen Tashi · Jul 3, 2017

martix said:

What I don't know is how to approach this in a formal manner.
Like starting with a formal EBNF specification, perhaps creating the correct Abstract Syntax Tree, or however one goes about (what I assume can be classified as) creating a new language.

A practical manner (as opposed to a formal manner), is to use software tools that generate parsers and compilers from sets of rules given as input - rules that must be expressed in the language used by that software. Unfortunately, such software has a "steep learning curve".

Its an interesting question, what generality describes the formal structure of programs like yacc, lex, and bison.

jim mcnamara · Jul 4, 2017

What you describe seems to be something a stream editor like UNIX/Linux sed was designed to handle. What platform are you on and more importantly what platform does this code have to run on? I am assuming it is not an ad hoc job.

jbriggs444 · Jul 5, 2017

I am not clear on the problem statement.

Suppose that we have the text lookup table:

EMPTY : ""
FULL : "full"
RECURSE : "{RECURSE}"

and you have the input string:

{EMPTY} {?FULL}

Does the ?FULL check the state of the non-empty token {EMPTY} or does it match the empty "" result after replacement?

What about...

{FULL?} {?FULL}

Does that parse out as "full full" or does it parse out as "empty empty"? Or both!

FactChecker · Jul 5, 2017

On a Unix/Lenix machine, you should look at things like Yacc, awk, sed, Perl, Python. Some may be appropriate or they may be overkill.
IMO Perl is better than awk, sed, and Python for regular expressions / text manipulation.
Consider Yacc if you are really doing heavy-duty parsing of a language. Perl and Python have extensions that are 'Yacc-like'.
Perl is often already somewhere on Windows machines and the others are fairly easy to install.

jack action · Jul 5, 2017

I'm suggesting regular expressions as well. To learn about regular expressions and do some debugging, I like regex101.com. This is my solution to your problem (the conditions might not give the output you want, but you'll get the idea):

PHP:

<?php
$format = "This is {token1} and this is {token2} and {token3}{? text1} and {text2 ?}{token4} and {token5}{? text3 ?}{token6}.";

$tokens = [
    "token1" => "token #1",
    "token2" => "token #2",
    "token3" => "token #3",
    "token4" => "token #4",
    "token5" => "token #5",
    "token6" => "token #6"
];

function format($format, $tokens){
    $pattern = "/
        {([^}]+)}{\?([^}]+)\?}{([^}]+)}   (?# finds {token1}{?text?}{token2})
        |{([^}]+)(\?}){([^}]+)}           (?# finds {text?}{token})
        |{([^}]+)}({\?)([^}]+)}           (?# finds {token}{?text})
        |{([^}]+)}                        (?# finds {token})
        |[^{]+                            (?# finds anything else)
    /x";
   
    preg_match_all($pattern, $format, $matches);
   
    $result = "";
   
    // $matches[0] contains the entire string with token separated:
    /*
    array (size=11)
          0 => string 'This is ' (length=8)
          1 => string '{token1}' (length=8)
          2 => string ' and this is ' (length=13)
          3 => string '{token2}' (length=8)
          4 => string ' and ' (length=5)
          5 => string '{token3}{? text1}' (length=17)
          6 => string ' and ' (length=5)
          7 => string '{text2 ?}{token4}' (length=17)
          8 => string ' and ' (length=5)
          9 => string '{token5}{? text3 ?}{token6}' (length=27)
          10 => string '.' (length=1)
    */
    foreach($matches[0] as $key => $match){
        if(substr($match, 0, 1) != "{"){ // It is not a token, write the result directly
            $result .= $match;
        }
        else{ // It is a token
       
            $tags = [];    //    Will contain the info about the token tag(s).  They are stored in $matches[X][$key], where X is not 0 and $matches[X][$key] is not empty.
           
            foreach($matches as $key2 => $tag){
                if($key2 != 0 && $tag[$key] != ""){
                    $tags[] = $tag[$key];
                }
            }
           
            $temp = "";
   
            /*
            4 possibilities for $tags:
               
                array (size=1)
                    0 => string 'token1' (length=6)
                   
                array (size=3)
                  0 => string 'token3' (length=6)
                  1 => string '{?' (length=2)
                  2 => string ' text1' (length=6)
                 
                array (size=3)
                  0 => string 'text2 ' (length=6)
                  1 => string '?}' (length=2)
                  2 => string 'token4' (length=6)
                 
                array (size=3)
                  0 => string 'token5' (length=6)
                  1 => string ' text3 ' (length=7)
                  2 => string 'token6' (length=6)
            */
           
            if(count($tags) === 1){    // there is only one tag, thus no conditional text
                $temp = $tokens[$tags[0]] ?? "";
            }
            elseif($tags[1] == "?}"){    //    left conditional text
                $temp = $tokens[$tags[2]] ?? "";
                if($temp != ""){
                    $temp = $tags[0].$temp;
                }
            }
            elseif($tags[1] == "{?"){    //    right conditional text
                $temp = $tokens[$tags[0]] ?? "";
                if($temp != ""){
                    $temp .= $tags[2];
                }               
            }
            else{    //    left and right conditional text
                $temp = $tokens[$tags[0]] ?? "";
                $temp2 = $tokens[$tags[2]] ?? "";
                if($temp != "" && $temp2 != ""){
                    $temp = $temp.$tags[1].$temp2;
                }
                else{ // at least one token wasn't found
                    $temp = "";
                }
               
            }
            if($temp == ""){
                $result .= "{token missing}";
            }
            else{    //    add text replacing token(s) to result
                $result .= $temp;
            }
        }
    }
   
    return $result;
}

var_dump(format($format, $tokens));

martix · Jul 8, 2017

Activity!

Starting from when last I was here:
1. Regexes. I am familiar with those. I just don't see how that would work. Given all the intricacies, it gets too hairy too fast. Actually it starts looking like parsing HTML with regex. And you know what they say about t̩̘̹̯̽̿ͦ̆̇ͮh̬̦̜͙̦͇̱͘ã̪̦̜̗̖͉͔͠t̞̱̳̲̓.

@jack action: What about multiple lookbehind or lookhead tags? Or normal text in between. Or malformed tags.
@jbriggs444: Given your set of data:
"{EMPTY} {?FULL}" = " " (if empty is empty/doesn't contain a string, we skip conditionals that rely on it)
"{FULL?} {?FULL}" = " " (no basic tags to match against)
"{FULL?} {FULL} {?FULL}" = "FULL full FULL" (matched both conditional tags, so put the text that was in there + the basic substitution tag)

2. YACC - sounds like overkill. Also above my head. The intent for this is to be an exceptionally simple syntax. Though depending on the details you could define it to do a number of more complex things(like allowing recursion).

3. sed and platform specific tools: This is mostly my own academic interest, but a production version of this would run on linux and php(which is why I started there). Also, sed is neat on general principle.

In the end I ended up doing a two-fold approach: 1. tokenize(i.e. this is a tag, this is plain text, this may look like a tag, but ended up being malformed and thus is plain text, etc.) 2. compile(process tokens). It also allows me to make recursive tags work easily too.

FactChecker · Jul 8, 2017

Regular expressions usually work fine if you do not have nested expressions to parse. If you have that, Yacc starts to be the better choice.

jack action · Jul 9, 2017

martix said:

What about multiple lookbehind or lookhead tags? Or normal text in between. Or malformed tags.

Because your problem ask to check if a token exists before doing anything, the solution has to be separated in two distinct components:

You need to parse the text to identify the tokens;
You need some logic to decide how to replace the text according to the tokens found.

The second part will be relatively the same no matter how you parse the text. The problem you asked about in this thread, is about the parser found in the first part.

You can do it by reading one character at a time, which is the safe way. The more complicated are your grammar rules (allowing escaped characters and nested tags for example) and/or if your input strings are really long, the better is this choice.

If you keep the grammar rules simple, then they can easily be identified with regexes. Since your thread title ask for a «simple text parser», it seems to fit that criterion.

For multiple lookbehind or lookahead tags, it is just a matter of adding a '+' sign here and there. The logic to replace the text properly will be done in part 2, no matter the type of parser you use.

About «normal text in between», I'm not following. There is is one tag alone or a succession of tags, anything else is «normal text in between», isn't it?

About malformed tags, well, isn't that a user problem? The grammar rules (whatever they may be), must be respected by the user. Malformed tags will be malformed with any parser. Garbage in = garbage out.

martix · Jul 9, 2017

This did seem the most logical approach to me and in the end a lot of people suggested it(you included).

Normal text example is "{cond?} text {tag}". For malformed, yes, but it matters how you process it - for example "{thing}}" - how do you tokenize that. Greedily or conservatively?

And while I did say simple, both requirements and my own knowledge change, so...

I think I'll be starting that automata theory course. Maybe at some point it will warm me up to reading that terse YACC documentation. :)

jbriggs444 · Jul 9, 2017

martix said:

.
"{EMPTY} {?FULL}" = " " (if empty is empty/doesn't contain a string, we skip conditionals that rely on it)

Good. I was probing there to see whether it was the token name or the token value that mattered. The answer is that it is the token value.

"{FULL?} {?FULL}" = " " (no basic tags to match against)

Ahhh, so you distinguish between a "token" in braces and a conditional text string, also in braces. The syntax is similar, but the semantics are different.

Chris Miller · Jul 28, 2017

https://en.wikipedia.org/wiki/Backus–Naur_form - BNS helped me back in my assembler coding days.

Creating a simple text parser/compiler

Similar threads

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

PHP My website presents the visitor with the choice of opting out of using cookies....

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect