C++ function to split a string by whitespace, ignoring any whitespace in quotes

Jamin2112 · Oct 29, 2014

Could someone comment on, or perhaps help me fix/simplify/optimize/elegantify, a function I'm trying to make to split a string by whitespace into a vector of strings, discounting any whitespace in quotes?

Intended behavior:

name="someName" class="class1 class2 class3" id="someId"

------------>

name="someName" , class="class1 class2 class3" , id="someId"

not

name="someName" , class="class1 , class2 , class3" , id="someId"

I can't figure out how to post code on this new forum platform, so it would be helpful if someone could post http://codepad.org/R5uun6uG for me.

phinds · Oct 29, 2014

Code:

I don't think the code tags have changed. 
Just put the code in between "code" & "/code" 
with square brackets delimiting the tag words
as has been done with these lines

Jamin2112 · Oct 29, 2014

phinds said:

Code:

I don't think the code tags have changed.
Just put the code in between "code" & "/code"
with square brackets delimiting the tag words
as has been done with these lines

It deformats as soon as I paste it, is the problem

jtbell · Oct 29, 2014

I tried copying the code from the page you gave, and pasting it between "code" tags. It didn't work for me, either. I don't think the page uses actual spaces to format the code, so none get copied. Use the "raw code" link to export the code as a text file with spaces, then open it in Notepad or something similar, and copy and paste from that.

Jamin2112 · Oct 29, 2014

Code:

std::vector<std::string> split_no_quotes (const std::string & str)
{
   std::vector<std::string> wordVec;
   bool inQuotes = false;
   std::string::const_iterator it(str.cbegin()), offend(str.cend());
   std::string thisWord;
   while (it != offend)
   {
      switch (*it)
      {
          case ' ': // whitespace
          {
                if (inQuotes)
                {
                      thisWord.push_back(*it++);
                }
               else
               {
                if (!wordVec.empty()) wordVec.push_back(thisWord);
                while (*++it == ' ' && it != offend);
                }
                break;
         }
        case '\"':
        {
            inQuotes = !inQuotes;
            ++it;
            break;
        }
       default:
       {
            thisWord.push_back(*it++)
       }
     }
   }
   return wordVec;
}

Medicol · Oct 29, 2014

Based on your description of the problem, the code below works but it should fail in the tests at several or many places :D . You cover them up yourself. Hope this helps.

PHP:

typedef struct strData
{
   int start;
   int end;
   std::string data;
}DATA;
typedef std::vector<DATA> DATAVEC;

bool isStringEnquoted(const std::string str)
{
   int nQuote = 0;
   for (int i = 0; i < str.length(); i++)
   {
     if ('\"' == str.at(i))
     {
       nQuote ++;
     }
   }
   return nQuote == 2;
}
DATAVEC tokenizeStr(const std::string &str)
{
   DATAVEC dv;
  
   const char delim = ' ';
   DATA data = {};
   int start = 0, end = 0;
   int i = 0;
   for (; i < str.length(); i++)
   {
     if (str.at(i) == delim)
     {  
       end = i;
       std::string tmp = str.substr(start, end-start);
       if (isStringEnquoted(tmp))
       {
         data.start = start;
         data.end = end;

         data.data = tmp;
         start = end+1;
         dv.push_back(data);
       }      
     }
   }

   end = i;
   data.start = start+1;
   data.end = end;

   data.data = str.substr(start, end-start);  
   dv.push_back(data);
   return dv;
}

D H · Nov 2, 2014

Medicol said:
Based on your description of the problem, the code below works but it should fail in the tests at several or many places :D . You cover them up yourself. Hope this helps.
Code:
typedef struct strData
{
   int start;
   int end;
   std::string data;
}DATA;
// Rest elided

typedef struct? Seriously? This is C++, not C plus or minus. I had to force myself to continue reading your code. The rest confirmed my suspicion: This is not a solution.

Jamin2112 said:

I can't figure out how to post code on this new forum platform, so it would be helpful if someone could post http://codepad.org/R5uun6uG for me.

Your code (unnecessarily) uses cbegin and cend. Those are C++11 concepts. The site codepad.org has not stepped up to the plate to using C++11. The site ideone.com has.

So let's look at your code.

Jamin2112 said:

Code:

std::vector<std::string> split_no_quotes (const std::string & str)
{
   std::vector<std::string> wordVec;
   bool inQuotes = false;
   std::string::const_iterator it(str.cbegin()), offend(str.cend());
   std::string thisWord;
   while (it != offend)
   {
      switch (*it)
      {
          case ' ': // whitespace
          {
                if (inQuotes)
                {
                      thisWord.push_back(*it++);
                }
               else
               {
                if (!wordVec.empty()) wordVec.push_back(thisWord);
                while (*++it == ' ' && it != offend);
                }
                break;
         }
        case '\"':
        {
            inQuotes = !inQuotes;
            ++it;
            break;
        }
       default:
       {
            thisWord.push_back(*it++)
       }
     }
   }
   return wordVec;
}

Nit-picky issue:

You used '\"'. There's no need for that backslash. In this case, '"' works just fine. Only escape the things that need to be escaped. Sometimes the backslash is very nice, for example '\n'. Other times, the backslash is evil in that it doesn't do anything close to what you think it does (for example, '\e'). It turns out that backslash quote evaluates to quote. You got lucky. Never rely on getting lucky.
This may be nit-picky, but it's a coffee stain on the flip-down trays. It means you do the engine maintenance wrong. (Google "coffee stain on the flip-down trays" for more.) As a code reviewer, seeing this would force me to pay extra attention to your code. It tells me you don't know what you are doing.

Minor problems:

You are using cbegin and cend. There's no need for that. You declared the input string const. The begin and end iterators on a const string are const iterators. Your code would compile as C++03 (e.g. codepad.com) if you change cbegin and cend to begin and end. (But see the major issues below.)
You have a missing semicolon on the statement in your default case. A missing semicolon is a major problem to a compiler. I classified this as a minor problem because (a) it's an easy mistake to make, and (b) it's an easy mistake to fix.

Major problems:

You are using cbegin and cend. That means you are using C++11. Step up to the plate and use the C++11 range-based for loop.
You are using a switch statement. This puts the cart before the horse. You have a finite state machine. Treat it as such. You should be using an if/else if/.../else structure.

Massive (well beyond major) problems:

You aren't adding the double quotes to the captured word.
You aren't adding the captured words to the vector.

In pseudocode, here's how I would write your function:

Code:

    // For each character in the input string :
        // If the character is '"' :
            // Add the character to the end of the current word
            // Toggle the "in quote" indicator.

        // Otherwise, if the character is a true "space" (spaces inside a quoted string are not "spaces") :
            // Add the character to the end of the current word.

        // Otherwise we have a "true" space:
            // Add the current word (if it's not empty) to the output vector and clear the current word.

    // Finally, add the current word (if it's not empty) to the output vector.

I'll leave it up to you to translate the above to C++11 code.

Jamin2112 · Nov 3, 2014

D H said:
typedef struct? Seriously? This is C++, not C plus or minus. I had to force myself to continue reading your code. The rest confirmed my suspicion: This is not a solution.
Your code (unnecessarily) uses cbegin and cend. Those are C++11 concepts. The site codepad.org has not stepped up to the plate to using C++11. The site ideone.com has.

So let's look at your code.
Nit-picky issue:

You used '\"'. There's no need for that backslash. In this case, '"' works just fine. Only escape the things that need to be escaped. Sometimes the backslash is very nice, for example '\n'. Other times, the backslash is evil in that it doesn't do anything close to what you think it does (for example, '\e'). It turns out that backslash quote evaluates to quote. You got lucky. Never rely on getting lucky.

This may be nit-picky, but it's a coffee stain on the flip-down trays. It means you do the engine maintenance wrong. (Google "coffee stain on the flip-down trays" for more.) As a code reviewer, seeing this would force me to pay extra attention to your code. It tells me you don't know what you are doing.

Minor problems:

You are using cbegin and cend. There's no need for that. You declared the input string const. The begin and end iterators on a const string are const iterators. Your code would compile as C++03 (e.g. codepad.com) if you change cbegin and cend to begin and end. (But see the major issues below.)

You have a missing semicolon on the statement in your default case. A missing semicolon is a major problem to a compiler. I classified this as a minor problem because (a) it's an easy mistake to make, and (b) it's an easy mistake to fix.

Major problems:

You are using cbegin and cend. That means you are using C++11. Step up to the plate and use the C++11 range-based for loop.

You are using a switch statement. This puts the cart before the horse. You have a finite state machine. Treat it as such. You should be using an if/else if/.../else structure.

Massive (well beyond major) problems:

You aren't adding the double quotes to the captured word.

You aren't adding the captured words to the vector.

In pseudocode, here's how I would write your function:
Code:
    // For each character in the input string :
        // If the character is '"' :
            // Add the character to the end of the current word
            // Toggle the "in quote" indicator.

        // Otherwise, if the character is a true "space" (spaces inside a quoted string are not "spaces") :
            // Add the character to the end of the current word.

        // Otherwise we have a "true" space:
            // Add the current word (if it's not empty) to the output vector and clear the current word.

    // Finally, add the current word (if it's not empty) to the output vector.
I'll leave it up to you to translate the above to C++11 code.

Thank you for your reply, DH.

A few things:

I'm trying to become reasonably competent at C++ < 11 before using C++ >= 11.
If cbegin() and cend() are C++ 11 functions, then I used them by accident, because I was using cplusplus.com as a reference and it classifies those functions as C++ 98 (http://www.cplusplus.com/reference/vector/vector/)
[*]Missing semi-colon was either a type-o or was something lost in translation as I had to manually format my code onto here
[*]Yesterday I redid that function and had it pass a few tests. Here's what it looks like now:

I still do not understand why a switch statement is inappropriate here. Most programming learning materials I've reading (including cplusplus.com) teach that switch statements are nothing more than better looking if-else statements.
[*]I always like your answers to programming threads.

jtbell · Nov 3, 2014

D H said:

// Otherwise, if the character is a true "space"

Should this read 'is not' instead of 'is'?

Jamin2112 said:

If cbegin() and cend() are C++ 11 functions, then I used them by accident, because I was using cplusplus.com as a reference and it classifies those functions as C++ 98 (http://www.cplusplus.com/reference/vector/vector/)

The table of member functions on the page that you linked to shows cbegin and cend with icons that say 'C++11'.

Jamin2112 · Nov 3, 2014

jtbell said:

Should this read 'is not' instead of 'is'?
The table of member functions on the page that you linked to shows cbegin and cend with icons that say 'C++11'.

That was sort of confusing because above there are separate tabs for C++98 and C++11 and I was on the C++98 tab. I know I don't need to use cbegin(), but I'm trying to get into the C++ spirit of using the most specific variable type possible and avoiding implicit casts.

jtbell · Nov 3, 2014

Jamin2112 said:

there are separate tabs for C++98 and C++11

It looks like those tabs apply only to the table of member types which is directly underneath them. It would be less confusing if the tabs actually covered the rest of the page also.

jtbell · Nov 3, 2014

Jamin2112 said:

Intended behavior:

name="someName" class="class1 class2 class3" id="someId"

------------>

name="someName" , class="class1 class2 class3" , id="someId"

Now to open up another can of worms. :p Note that attribute name-value pairs in HTML can contain extra whitespace e.g.

name = "someName"

See e.g. http://stackoverflow.com/questions/7064095/spaces-between-html-attributes-and-values

D H · Nov 3, 2014

jtbell said:

Should this read 'is not' instead of 'is'?

You're right. That was a typo. Jamin obviously saw the intent since the consequence is to add the character to the current word.

Jamin2112 said:

I'm trying to become reasonably competent at C++ < 11 before using C++ >= 11.

C++11 is a better (and in some ways, considerably different) language than is C++98/03. Learn C++98/03 first and you will have taught yourself some bad habits with regard to C++11. BTW, C++11 is now in the same group as C++98, C90, and FORTRAN IV. It is an obsolete standard. The current standard is C++14.

Personal opinion: The only reason to learn C++98/03 nowadays is because a number of employers have not yet made the switch to C++11.

Yesterday I redid that function and had it pass a few tests. Here's what it looks like now:

Here's how I would write that version of your function:

Code:

/**
* @brief Collect words in the range [@c first, @c last).
*
* @details Words are separated by one or more unquoted spaces.
*   Spaces between double quotes are not separators.
*   They are instead part of a word.
*
* @return Collected words as a vector of strings.
*
* @warning @c first and @c last must form a valid range.
*   Behavior is undefined if this is not the case.
*
* @warning Quotes are assumed to be paired.
*   A warning is issued if the range contains an odd number of quotes.
*/
std::vector<std::string>
split_no_quotes (
   const std::string::const_iterator first, //!< Start of range.
   const std::string::const_iterator last)  //!< End of range.
{
    std::vector<std::string> result;
    bool in_quotes = false;
    std::string curr_word;

    // Walk over the range, collecting words along the way.
    for (std::string::const_iterator it = first; it < last; ++it) {
        char c = *it;

        // Non-separator character: Add it to the current word, and toggle flag if needed.
        if (in_quotes || (c != ' ')) {
            curr_word.push_back (c);
            in_quotes = (c == '"') != in_quotes;  // OK. I'll admit this is a bit hackish.
        }

        // First unquoted space after a word: Add word to result and reset word.
        else if (! curr_word.empty()) {
            result.push_back (curr_word);
            curr_word.clear ();
        }

        // No else: This represents consecutive unquotes spaces. There's nothing to do.

    }

    // The last word has not been added to the result. Do so.
    if (! curr_word.empty()) {
        result.push_back (curr_word);
    }

    // Check for and report a violation of the paired quotes assumption.
    if (in_quotes) {
        std::cerr << "Warning: In split_no_quotes():\n"
                  << "Input string contains non-terminated quote.\n"
                  << "Input string: " << std::string(first, last) << std::endl;
    }

    return result;
}

The differences between your version and mine:

I'm using doxygen style comments.
That is quickly becoming the de facto standard for documenting C++ code across a wide number of different software development projects.
I'm checking for unmatched quotes.
It's a good practice to make explicit the assumptions you are making and then think of how a user of your function can violate those assumptions.
I'm passing the range as two parameters as opposed to one parameter as a
std :: pair.
(Aside: without the spaces around the double colon, the new forum software keeps changing my writing to smileys. Ignore the spaces.)
There are a number of reasons for this change:
- You probably thought passing a single parameter is more efficient. It's not.
- Passing first and last as separate parameters is the canonical way to represent a range.
- Using separate parameters enables automatic conversion from std :: string::iterator to std :: string::const_iterator. Constructing a pair of const_iterators from a non-const std:: string is a pain in the rear. Constructing a const_iterator argument from a non-const string is easy. Just use the non-const iterator. The conversion to const_iterator is automatic.
[*]I'm keeping the provided iterators intact.
The primary reason is so I can report the input string in the case of a violation of the paired quotes assumption.[*]I'm using an if/else if ... construct rather than a switch statement.
That's a bit of personal preference. I prefer the if/else construct in this case as being a bit clearer with regard to intent -- even if a switch my hackish in_quotes = (c == '"') != in_quotes to an if statement.[*]Naming conventions.
This is pure personal preference (unless some organization has rules that mandate a style). Like many, I have followed some variation of the StudLyCaps convention. My personal preference has moved from that to staying as far away as possible from the Java StudLyCaps convention. Note that the C++ library never uses StudLyCaps. They use underscores to separate words. Moreover, I saw a marked improvement in readability when some parts of python started a switch from StudLyCaps to all_lower_case convention.

I still do not understand why a switch statement is inappropriate here. Most programming learning materials I've reading (including cplusplus.com) teach that switch statements are nothing more than better looking if-else statements.

If every if statement is of the form if (var == some_number)and every test is against the same variable, then I would agree that a switch statement is "better-looking". On the other hand, if the different if tests are against different variables, or use boolean logic, or aren't integers, you can't do that in a switch statement.

Jamin2112 · Nov 4, 2014

jtbell said:

Now to open up another can of worms. :p Note that attribute name-value pairs in HTML can contain extra whitespace e.g.

name = "someName"

See e.g. http://stackoverflow.com/questions/7064095/spaces-between-html-attributes-and-values

F---

I'm going to have to re-do everything

Jamin2112 · Nov 5, 2014

So my new way of parsing an HTML tag is a monstrosity. However, it takes care of the fact that there can be whitespace, i.e. something like <p id = " someId" >Yo</p> should have the id parsed.

boolHtmlProcessor::_processTag(std::string::const_iterator it1, const std::string::const_iterator it2, node & nd)
{

/*

[it1, it2): iterators for the range of the string

nd: node in which classes and ids of the tage are stored

Returns true or false depending on whether a problem was encountered during the processing.

*/

/* Get the element type, at the beginning of the tag: */

std::string elementType("");

while (_elementTypeChars.find(*it1) != std::string::npos && it1 != it2) elementType.push_back(*it1++);

if (elementType.empty()) return false;

nd.element_type = elementType;

/* Get any attributes: */

std::vector<std::pair<std::string, std::string>> attributes;

const std::pair<std::string, std::string> thisAttribute;

while (_hasNextAttribute(it1, it2, thisAttribute)) attributes.push_back(thisAttribute);

if (!_processAttributes(attributes, nd.class_list, nd.iden)) return false;

returntrue;

}

where the function _getNextAttribute is

bool HtmlProcessor::_hasNextAttribute(std::string::iterator & it1, const std::string::iterator & it2, std::pair<std::string, std::string> & attrHolder)
{

/* Parses the first HTML attributes in the iterator range [it1, it2), adding them to attrHolder; eg.

class="myClass1 myClass2" id="myId" onsubmit = "myFunction()"

---------- _hasNextAttribute -------->

attrHolder = (class, myClass1 myClass2)

When the function terminates, it1 will be the iterator to the last character parsed, will be equal to

it2 if no characters were parsed.

*/

while (*it1 == ' ' && it1 != it2) ++it1; /* Skip through left whitespace padding */

if (it1 == it2) returntrue; /* No attributes in tag; only whitespace after the element name. Such is valid HTML. */ std::string attr(""); /* String to hold the attribute type, expected after any whitespace. Should be non-empty. */

while (_attributeTypeChars.find(*it1) == std::string::npos && it1 != it2) attr.push_back(*it1++);

if (attr.empty()) return false; while (*it1 == ' ' && it1 != it2) ++it1; /* Skip through whitespace padding between the attribute name and equals sign */

if (*it1 != '=' || it1++ == it2) returnfalse; /* Current character should be an equals sign */

while (*it1 == ' ' && it1 != it2) ++it1; /* Skip through whitespace between the equals sign and quotation mark */

if (*it1 != '"' || it1++ == it2) returnfalse; /* Current character should be a quotation mark */

std::string val(""); /* String to hold the attribute's value, exepcted after the first quotation mark. */

while (_attributeValChars.find(*it1) != std::string::npos) val.push_back(*it1++);

if (attr.empty()) return false;

if (*it1 != '"' || it1++ != it2) returnfalse; /* Current character should be a quotation mark */

/* If we're here, it1 should point to the character after the quotation mark that closes off the attribute's value */

attrHolder = std::make_pair(attr, val);
}

and the function _processAttributes is

bool HtmlProcessor::_processAttributes(const std::vector<std::pair<std::string, std::string>> & attrs, std::set<std::string> &classesTarget, std::string & identifierTarget)

{

for (std::vector<std::pair<std::string, std::string>>::const_iterator it(attrs.cbegin()), offend(attrs.end()); it != offend; ++it)

{

std::string thisAttr(it->first), thisVal(it->second);

std::transform(thisAttr.begin(), thisAttr.end(), thisAttr.begin(), ::tolower);

if (thisAttr == "id")

identifierTarget = thisVal;

else if (thisAttr == "class")

{

/* Since the value for a class attribute can be several classes separated by whitespace,

add all of them to set of classes for the node.

*/

std::stringstream ss(thisAttr);

std::string thisClass;

while (std::getline(ss, thisClass, ' ')) classesTarget.insert(thisClass);

}

}

returntrue;

}

and the related data structures are

const std::stringHtmlProcessor::_elementTypeChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";

const std::stringHtmlProcessor::_attributeTypeChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-_";

const std::stringHtmlProcessor::_attributeValChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-_ ";

struct node

{

std::string element_type;

std::set<std::string> class_list;

std::string iden;

std::vector<node*> children;

};

That node structure is how I'm building my document tree.

D H · Nov 5, 2014

That's pretty much unreadable, Jamin. Please put your blocks of code in code blocks rather than changing the font.

C++ function to split a string by whitespace, ignoring any whitespace in quotes

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect