Tokenizing a String: Regular Expression & Boost Solutions

  • Thread starter FrostScYthe
  • Start date
  • Tags
    String
In summary, the conversation discusses different approaches to tokenizing a query and handling whitespace and parenthesis as delimiters. It is suggested to use a regular expression to match non-whitespace tokens and to manually check for a balanced set of parenthesis. A "stack" data structure is also mentioned as a way to ensure matching parenthesis. The conversation also touches on the importance of documenting and parsing a statement for validity, with suggestions to use a finite state machine or a tool like Yacc.
  • #1
FrostScYthe
80
0
Hi everyone,

I'm seeking some advice on how to approach this problem, I have a query that I need to tokenize. Here are a couple of examples

(animal cat & food meat) | (animal cow) into the following array:

[ ( ][ animal ][ cat ][ & ][ food ][ meat ][ ) ][ | ][ ( ][ animal ][ cow ][ ) ]

- I want to treat all consecutive whitespace (one or more) as a delimiter.

- However the parenthesis, even though not sepparated by whitespace, should be treated as a sepparate token.

I'm wondering can this be done with regular expressions, using boost. I am using this for now, but it only works for 1 whitespace as a delimiter.

Code:
boost::regex re(" ");                         			
boost::sregex_token_iterator                  			
    p(query.begin( ), query.end( ), re, -1);

Or maybe there's an easier approach I don't know, feel free to throw in ideas :)
 
Technology news on Phys.org
  • #2
I am not familiar with the exact capabilities of the Boost RE library. However if I had to do this in perl or python what I would do is make a regexp that fits the non-whitespace tokens you are trying to collect, and then tell the regexp to return a list of the matches to the regexp. (There should be some way to do this in any regexp library.)

For example the regexp:

"(\w+|\||\&|\(|\))"

The |s ensure that a token will consist of "a string of letters" or "a symbol" but that the two will not be mixed in a single match. If you applied that regexp to your string, you would receive the list of tokens you desire above.
 
  • #3
Cool, that has worked great so far, but what If I want to make sure that the string I am tokenizing has a matching set of parenthesis.

Code:
boost::regex re("\\&|\\||\\(|\\)|\\w+");
boost::sregex_token_iterator            
    p(s.begin( ), s.end( ), re);        
boost::sregex_token_iterator end;
 
  • #4
I suggest just iterating over the tokens the regex gives you and counting the parenthesis. Set some counter to 0. Look at each token, every time you see a "(", increment the counter, every time you see a ")", decrement the counter. If the counter ever goes below 0, return failure. If you get to the end of the tokens and the counter is MORE than zero, return failure. If you reach the end and neither of these things happen then you have a balanced set of parenthesis.

Sometimes it is possible to try to do too much using regexes only...
 
  • #5
The classical way to make sure that parentheses match is to use a "stack" data structure. Each time you encounter an open paren, push it onto the stack, and then when you encounter the closing paren, pop the stack. That way, the stack should be empty by the time you parse to the end of the string.

What language are you working with? Modern languages such as Java have the string "split" function. You could first split the main string on spaces, then check each token to see if it begins/ends with ( or ). Might work.

While regex's can always be made to work, when you come back to them 6 months from now, you won't have the slightest idea what they do. So if you do use regex, document well!
 
  • #6
So if you do use regex, document well!

Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.
 
  • #7
FrostScYthe said:
Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.

As you are reading in each token, you'll want to compare it to the previous token. If an operator follows another operator simply throw an error. Likewise with operands. However, this won't work for implicit multiplication.
 
  • #8
one way to implement a parser is to model it as a finite state machine
 
  • #9
FrostScYthe said:
Thanks for the advice.

I was also wondering: how can I check that a statement is actually valid.. something yells parser.. but i don't know how to implement this. I wonder does someone know of a simple parser for adding and substracting implemented in C++, or a way to do this easilly.

Something to determine that:

3 + 5 is a valid statement

and

3 5 is an invalid statement.

I believe the canonical way for creating a parser in C++ is to just use Yacc. (I have not used it myself, though I have used an equivalent in Java, so I am not sure how easy this is or if there is a better equivalent.) Yacc basically allows you to specify a context-free grammar and match a parse tree out of that grammar.
 
  • #10
FrostScYthe said:
Hi everyone,

I'm seeking some advice on how to approach this problem, I have a query that I need to tokenize. Here are a couple of examples

(animal cat & food meat) | (animal cow) into the following array:

[ ( ][ animal ][ cat ][ & ][ food ][ meat ][ ) ][ | ][ ( ][ animal ][ cow ][ ) ]

- I want to treat all consecutive whitespace (one or more) as a delimiter.

- However the parenthesis, even though not sepparated by whitespace, should be treated as a sepparate token.

I'm wondering can this be done with regular expressions, using boost. I am using this for now, but it only works for 1 whitespace as a delimiter.

Code:
boost::regex re(" ");                         			
boost::sregex_token_iterator                  			
    p(query.begin( ), query.end( ), re, -1);

Or maybe there's an easier approach I don't know, feel free to throw in ideas :)

If your going to parse something using a specific grammar you're probably better off using something already developed like Yacc.

If you want to write one yourself research BNF and EBNF grammars and compilers. If you want to hard-code a fairly simple tokenizer you could probably modify an existing open source string library and do the routines yourself in under an hour.
 

1. What is tokenizing a string?

Tokenizing a string is the process of breaking down a string of characters into smaller parts, called tokens, based on a specific set of rules or patterns.

2. Why do we need to tokenize a string?

Tokenizing a string can be useful for many reasons, such as parsing data, input validation, or text analysis. It allows us to break down a string into smaller, more manageable parts for further processing.

3. What is a regular expression and how does it relate to tokenizing a string?

A regular expression, also known as regex, is a sequence of characters that define a search pattern. It can be used to match and extract specific patterns from a string, which is useful for tokenizing a string based on a specific set of rules.

4. What is Boost and how does it help with tokenizing a string?

Boost is a collection of libraries for the C++ programming language that provides useful tools and functions for developers. It includes a library called "regex," which provides powerful tools for working with regular expressions and can greatly assist in tokenizing a string.

5. Can tokenizing a string be done with other programming languages besides C++?

Yes, tokenizing a string can be done with other programming languages such as Java, Python, and JavaScript. Each language may have its own methods and libraries for handling regular expressions and string tokenization.

Similar threads

  • General Discussion
Replies
27
Views
4K
  • General Discussion
3
Replies
78
Views
9K
Back
Top