Tokenizing a String: Regular Expression & Boost Solutions

FrostScYthe · Dec 22, 2008

Hi everyone,

I'm seeking some advice on how to approach this problem, I have a query that I need to tokenize. Here are a couple of examples

(animal cat & food meat) | (animal cow) into the following array:

[ ( ][ animal ][ cat ][ & ][ food ][ meat ][ ) ][ | ][ ( ][ animal ][ cow ][ ) ]

- I want to treat all consecutive whitespace (one or more) as a delimiter.

- However the parenthesis, even though not sepparated by whitespace, should be treated as a sepparate token.

I'm wondering can this be done with regular expressions, using boost. I am using this for now, but it only works for 1 whitespace as a delimiter.

Code:

boost::regex re(" ");                         			
boost::sregex_token_iterator                  			
    p(query.begin( ), query.end( ), re, -1);

Or maybe there's an easier approach I don't know, feel free to throw in ideas :)

Coin · Dec 22, 2008

I am not familiar with the exact capabilities of the Boost RE library. However if I had to do this in perl or python what I would do is make a regexp that fits the non-whitespace tokens you are trying to collect, and then tell the regexp to return a list of the matches to the regexp. (There should be some way to do this in any regexp library.)

For example the regexp:

"(\w+|\||\&|\(|\))"

The |s ensure that a token will consist of "a string of letters" or "a symbol" but that the two will not be mixed in a single match. If you applied that regexp to your string, you would receive the list of tokens you desire above.

FrostScYthe · Dec 24, 2008

Cool, that has worked great so far, but what If I want to make sure that the string I am tokenizing has a matching set of parenthesis.

Code:

boost::regex re("\\&|\\||\\(|\\)|\\w+");
boost::sregex_token_iterator            
    p(s.begin( ), s.end( ), re);        
boost::sregex_token_iterator end;

Coin · Dec 24, 2008

I suggest just iterating over the tokens the regex gives you and counting the parenthesis. Set some counter to 0. Look at each token, every time you see a "(", increment the counter, every time you see a ")", decrement the counter. If the counter ever goes below 0, return failure. If you get to the end of the tokens and the counter is MORE than zero, return failure. If you reach the end and neither of these things happen then you have a balanced set of parenthesis.

Sometimes it is possible to try to do too much using regexes only...

harborsparrow · Dec 26, 2008

The classical way to make sure that parentheses match is to use a "stack" data structure. Each time you encounter an open paren, push it onto the stack, and then when you encounter the closing paren, pop the stack. That way, the stack should be empty by the time you parse to the end of the string.

What language are you working with? Modern languages such as Java have the string "split" function. You could first split the main string on spaces, then check each token to see if it begins/ends with ( or ). Might work.

While regex's can always be made to work, when you come back to them 6 months from now, you won't have the slightest idea what they do. So if you do use regex, document well!

FrostScYthe · Dec 29, 2008

So if you do use regex, document well!

Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.

Contrapositive · Dec 31, 2008

FrostScYthe said:

Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.

As you are reading in each token, you'll want to compare it to the previous token. If an operator follows another operator simply throw an error. Likewise with operands. However, this won't work for implicit multiplication.

harborsparrow · Dec 31, 2008

one way to implement a parser is to model it as a finite state machine

Coin · Jan 1, 2009

FrostScYthe said:

Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.

I believe the canonical way for creating a parser in C++ is to just use Yacc. (I have not used it myself, though I have used an equivalent in Java, so I am not sure how easy this is or if there is a better equivalent.) Yacc basically allows you to specify a context-free grammar and match a parse tree out of that grammar.

chiro · Jan 3, 2009

FrostScYthe said:
Hi everyone,

I'm seeking some advice on how to approach this problem, I have a query that I need to tokenize. Here are a couple of examples

(animal cat & food meat) | (animal cow) into the following array:

[ ( ][ animal ][ cat ][ & ][ food ][ meat ][ ) ][ | ][ ( ][ animal ][ cow ][ ) ]

- I want to treat all consecutive whitespace (one or more) as a delimiter.

- However the parenthesis, even though not sepparated by whitespace, should be treated as a sepparate token.

I'm wondering can this be done with regular expressions, using boost. I am using this for now, but it only works for 1 whitespace as a delimiter.
Code:
boost::regex re(" ");                         			
boost::sregex_token_iterator                  			
    p(query.begin( ), query.end( ), re, -1);
Or maybe there's an easier approach I don't know, feel free to throw in ideas :)

If your going to parse something using a specific grammar you're probably better off using something already developed like Yacc.

If you want to write one yourself research BNF and EBNF grammars and compilers. If you want to hard-code a fairly simple tokenizer you could probably modify an existing open source string library and do the routines yourself in under an hour.

Tokenizing a String: Regular Expression & Boost Solutions

Similar threads

Hot Threads

Touch-typing for programmers

How to calculate Tension for a series of connected points?

Python Complaining About Python

Fortran Reading files in pre-f77 - handling end of file

Sequential Analog Computers?

Recent Insights

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers