Tokenizing a String: Regular Expression & Boost Solutions

  • Thread starter Thread starter FrostScYthe
  • Start date Start date
  • Tags Tags
    String
AI Thread Summary
The discussion revolves around tokenizing a query string using regular expressions, specifically with the Boost library in C++. The user seeks to convert a structured string, including parentheses and operators, into an array of tokens while treating consecutive whitespace as a delimiter. The initial attempt using a basic regex only accommodates single whitespace, prompting suggestions for a more effective regex pattern to capture multiple whitespace and special characters.Participants recommend using a regex pattern that identifies non-whitespace tokens, including operators and parentheses, while also addressing the need for balanced parentheses. Suggestions include iterating through tokens to count parentheses and using a stack data structure for validation. Additionally, the conversation touches on implementing a parser to validate expressions, with references to using Yacc for context-free grammar parsing and alternative methods for simpler implementations. Overall, the focus is on finding efficient ways to tokenize and validate expressions in C++.
FrostScYthe
Messages
80
Reaction score
0
Hi everyone,

I'm seeking some advice on how to approach this problem, I have a query that I need to tokenize. Here are a couple of examples

(animal cat & food meat) | (animal cow) into the following array:

[ ( ][ animal ][ cat ][ & ][ food ][ meat ][ ) ][ | ][ ( ][ animal ][ cow ][ ) ]

- I want to treat all consecutive whitespace (one or more) as a delimiter.

- However the parenthesis, even though not sepparated by whitespace, should be treated as a sepparate token.

I'm wondering can this be done with regular expressions, using boost. I am using this for now, but it only works for 1 whitespace as a delimiter.

Code:
boost::regex re(" ");                         			
boost::sregex_token_iterator                  			
    p(query.begin( ), query.end( ), re, -1);

Or maybe there's an easier approach I don't know, feel free to throw in ideas :)
 
Technology news on Phys.org
I am not familiar with the exact capabilities of the Boost RE library. However if I had to do this in perl or python what I would do is make a regexp that fits the non-whitespace tokens you are trying to collect, and then tell the regexp to return a list of the matches to the regexp. (There should be some way to do this in any regexp library.)

For example the regexp:

"(\w+|\||\&|\(|\))"

The |s ensure that a token will consist of "a string of letters" or "a symbol" but that the two will not be mixed in a single match. If you applied that regexp to your string, you would receive the list of tokens you desire above.
 
Cool, that has worked great so far, but what If I want to make sure that the string I am tokenizing has a matching set of parenthesis.

Code:
boost::regex re("\\&|\\||\\(|\\)|\\w+");
boost::sregex_token_iterator            
    p(s.begin( ), s.end( ), re);        
boost::sregex_token_iterator end;
 
I suggest just iterating over the tokens the regex gives you and counting the parenthesis. Set some counter to 0. Look at each token, every time you see a "(", increment the counter, every time you see a ")", decrement the counter. If the counter ever goes below 0, return failure. If you get to the end of the tokens and the counter is MORE than zero, return failure. If you reach the end and neither of these things happen then you have a balanced set of parenthesis.

Sometimes it is possible to try to do too much using regexes only...
 
The classical way to make sure that parentheses match is to use a "stack" data structure. Each time you encounter an open paren, push it onto the stack, and then when you encounter the closing paren, pop the stack. That way, the stack should be empty by the time you parse to the end of the string.

What language are you working with? Modern languages such as Java have the string "split" function. You could first split the main string on spaces, then check each token to see if it begins/ends with ( or ). Might work.

While regex's can always be made to work, when you come back to them 6 months from now, you won't have the slightest idea what they do. So if you do use regex, document well!
 
So if you do use regex, document well!

Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.
 
FrostScYthe said:
Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.

As you are reading in each token, you'll want to compare it to the previous token. If an operator follows another operator simply throw an error. Likewise with operands. However, this won't work for implicit multiplication.
 
one way to implement a parser is to model it as a finite state machine
 
FrostScYthe said:
Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.

I believe the canonical way for creating a parser in C++ is to just use Yacc. (I have not used it myself, though I have used an equivalent in Java, so I am not sure how easy this is or if there is a better equivalent.) Yacc basically allows you to specify a context-free grammar and match a parse tree out of that grammar.
 
  • #10
FrostScYthe said:
Hi everyone,

I'm seeking some advice on how to approach this problem, I have a query that I need to tokenize. Here are a couple of examples

(animal cat & food meat) | (animal cow) into the following array:

[ ( ][ animal ][ cat ][ & ][ food ][ meat ][ ) ][ | ][ ( ][ animal ][ cow ][ ) ]

- I want to treat all consecutive whitespace (one or more) as a delimiter.

- However the parenthesis, even though not sepparated by whitespace, should be treated as a sepparate token.

I'm wondering can this be done with regular expressions, using boost. I am using this for now, but it only works for 1 whitespace as a delimiter.

Code:
boost::regex re(" ");                         			
boost::sregex_token_iterator                  			
    p(query.begin( ), query.end( ), re, -1);

Or maybe there's an easier approach I don't know, feel free to throw in ideas :)

If your going to parse something using a specific grammar you're probably better off using something already developed like Yacc.

If you want to write one yourself research BNF and EBNF grammars and compilers. If you want to hard-code a fairly simple tokenizer you could probably modify an existing open source string library and do the routines yourself in under an hour.
 
Back
Top