Tokenizing a String: Regular Expression & Boost Solutions

  • Thread starter Thread starter FrostScYthe
  • Start date Start date
  • Tags Tags
    String
Click For Summary

Discussion Overview

The discussion revolves around the problem of tokenizing a string using regular expressions, particularly with the Boost library in C++. Participants explore various methods to achieve tokenization while ensuring that parentheses are treated as separate tokens and that whitespace is handled correctly. The conversation also touches on validating expressions and implementing a parser for arithmetic operations.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant seeks advice on tokenizing a string with specific requirements regarding whitespace and parentheses.
  • Another participant suggests using a regular expression that captures non-whitespace tokens and symbols, providing an example regex.
  • A later reply discusses ensuring matching parentheses by counting tokens and suggests a method to validate the balance of parentheses.
  • Another participant proposes using a stack data structure to manage parentheses matching, mentioning the use of a "split" function in modern languages.
  • One participant expresses interest in checking the validity of arithmetic statements and seeks advice on implementing a simple parser.
  • Another participant mentions modeling a parser as a finite state machine as a potential approach.
  • One participant recommends using Yacc for creating a parser, discussing its ability to specify context-free grammar.
  • A participant reiterates the initial tokenization problem, suggesting that using an existing library or researching grammar specifications may be beneficial.

Areas of Agreement / Disagreement

Participants present multiple competing views on how to approach tokenization and validation of expressions. There is no consensus on a single method, and various strategies are discussed, indicating a range of opinions on the best approach.

Contextual Notes

Participants mention limitations related to regex complexity and the potential difficulty of understanding regex patterns after some time. There are also references to the need for clear documentation if regex is used.

Who May Find This Useful

This discussion may be useful for programmers and developers interested in string manipulation, tokenization, and parsing techniques, particularly in C++ and using regular expressions.

FrostScYthe
Messages
80
Reaction score
0
Hi everyone,

I'm seeking some advice on how to approach this problem, I have a query that I need to tokenize. Here are a couple of examples

(animal cat & food meat) | (animal cow) into the following array:

[ ( ][ animal ][ cat ][ & ][ food ][ meat ][ ) ][ | ][ ( ][ animal ][ cow ][ ) ]

- I want to treat all consecutive whitespace (one or more) as a delimiter.

- However the parenthesis, even though not sepparated by whitespace, should be treated as a sepparate token.

I'm wondering can this be done with regular expressions, using boost. I am using this for now, but it only works for 1 whitespace as a delimiter.

Code:
boost::regex re(" ");                         			
boost::sregex_token_iterator                  			
    p(query.begin( ), query.end( ), re, -1);

Or maybe there's an easier approach I don't know, feel free to throw in ideas :)
 
Technology news on Phys.org
I am not familiar with the exact capabilities of the Boost RE library. However if I had to do this in perl or python what I would do is make a regexp that fits the non-whitespace tokens you are trying to collect, and then tell the regexp to return a list of the matches to the regexp. (There should be some way to do this in any regexp library.)

For example the regexp:

"(\w+|\||\&|\(|\))"

The |s ensure that a token will consist of "a string of letters" or "a symbol" but that the two will not be mixed in a single match. If you applied that regexp to your string, you would receive the list of tokens you desire above.
 
Cool, that has worked great so far, but what If I want to make sure that the string I am tokenizing has a matching set of parenthesis.

Code:
boost::regex re("\\&|\\||\\(|\\)|\\w+");
boost::sregex_token_iterator            
    p(s.begin( ), s.end( ), re);        
boost::sregex_token_iterator end;
 
I suggest just iterating over the tokens the regex gives you and counting the parenthesis. Set some counter to 0. Look at each token, every time you see a "(", increment the counter, every time you see a ")", decrement the counter. If the counter ever goes below 0, return failure. If you get to the end of the tokens and the counter is MORE than zero, return failure. If you reach the end and neither of these things happen then you have a balanced set of parenthesis.

Sometimes it is possible to try to do too much using regexes only...
 
The classical way to make sure that parentheses match is to use a "stack" data structure. Each time you encounter an open paren, push it onto the stack, and then when you encounter the closing paren, pop the stack. That way, the stack should be empty by the time you parse to the end of the string.

What language are you working with? Modern languages such as Java have the string "split" function. You could first split the main string on spaces, then check each token to see if it begins/ends with ( or ). Might work.

While regex's can always be made to work, when you come back to them 6 months from now, you won't have the slightest idea what they do. So if you do use regex, document well!
 
So if you do use regex, document well!

Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.
 
FrostScYthe said:
Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.

As you are reading in each token, you'll want to compare it to the previous token. If an operator follows another operator simply throw an error. Likewise with operands. However, this won't work for implicit multiplication.
 
one way to implement a parser is to model it as a finite state machine
 
FrostScYthe said:
Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.

I believe the canonical way for creating a parser in C++ is to just use Yacc. (I have not used it myself, though I have used an equivalent in Java, so I am not sure how easy this is or if there is a better equivalent.) Yacc basically allows you to specify a context-free grammar and match a parse tree out of that grammar.
 
  • #10
FrostScYthe said:
Hi everyone,

I'm seeking some advice on how to approach this problem, I have a query that I need to tokenize. Here are a couple of examples

(animal cat & food meat) | (animal cow) into the following array:

[ ( ][ animal ][ cat ][ & ][ food ][ meat ][ ) ][ | ][ ( ][ animal ][ cow ][ ) ]

- I want to treat all consecutive whitespace (one or more) as a delimiter.

- However the parenthesis, even though not sepparated by whitespace, should be treated as a sepparate token.

I'm wondering can this be done with regular expressions, using boost. I am using this for now, but it only works for 1 whitespace as a delimiter.

Code:
boost::regex re(" ");                         			
boost::sregex_token_iterator                  			
    p(query.begin( ), query.end( ), re, -1);

Or maybe there's an easier approach I don't know, feel free to throw in ideas :)

If your going to parse something using a specific grammar you're probably better off using something already developed like Yacc.

If you want to write one yourself research BNF and EBNF grammars and compilers. If you want to hard-code a fairly simple tokenizer you could probably modify an existing open source string library and do the routines yourself in under an hour.
 

Similar threads

  • · Replies 2 ·
Replies
2
Views
625
Replies
1
Views
2K
Replies
27
Views
6K