Use the find_first_of
and first_first_not_of
member functions on basic_string
to iterate through the string and alternately locate the next
tokens and non-tokens. Example 4-12
presents a simple StringTokenizer
class that does just
that.
Example 4-12. A string tokenizer
#include <string> #include <iostream> using namespace std; // String tokenizer class. class StringTokenizer { public: StringTokenizer(const string& s, const char* delim = NULL) : str_(s), count_(-1), begin_(0), end_(0) { if (!delim) delim_ = " \f\n\r\t\v"; //default to whitespace else delim_ = delim; // Point to the first token begin_ = str_.find_first_not_of(delim_); end_ = str_.find_first_of(delim_, begin_); } size_t countTokens() { if (count_ >= 0) // return if we've already counted return(count_); string::size_type n = 0; string::size_type i = 0; for (;;) { // advance to the first token if ((i = str_.find_first_not_of(delim_, i)) == string::npos) break; // advance to the next delimiter i = str_.find_first_of(delim_, i+1); n++; if (i == string::npos) break; } return (count_ = n); } bool hasMoreTokens() {return(begin_ != end_);} void nextToken(string& s) { if (begin_ != string::npos && end_ != string::npos) { s = str_.substr(begin_, end_-begin_); begin_ = str_.find_first_not_of(delim_, end_); end_ = str_.find_first_of(delim_, begin_); } else if (begin_ != string::npos && end_ == string::npos) { s = str_.substr(begin_, str_.length()-begin_); begin_ = str_.find_first_not_of(delim_, end_); } } private: StringTokenizer() {}; string delim_; string str_; int count_; int begin_; int end_; }; int main() { string s = " razzle dazzle giddyup "; string tmp; StringTokenizer st(s); cout << "there are " << st.countTokens() << " tokens.\n"; while (st.hasMoreTokens()) { st.nextToken(tmp); cout << "token = " << tmp << '\n'; } }
Splitting a string with well-defined structure, as in Example 4-10, is nice, but it’s not always
that easy. Suppose instead that you have to tokenize a string instead
of simply break it into pieces based on a single delimiter. The most common incarnation of
this is tokenizing based on ignoring whitespace. Example 4-12 gives an implementation of a StringTokenizer
class (like the standard Java© class of the same name) for
C++ that accepts delimiter characters, but defaults to whitespace.
The most important lines in StringTokenizer
use
basic_string
’s find_first_of
and find_first_not_of
member
functions. I describe how they work and when to use them in Recipe 4.9. Example 4-10 produces this output:
there are 3 tokens. token = razzle token = dazzle token = giddyup
StringTokenizer
is a more flexible form of the
split
function in Example 4-10. It maintains state, so you can
advance from one token to the next instead of parsing the input string all at once. You
can also count the number of tokens.
There are a couple of improvements you can make on StringTokenizer
. First, for simplicity, I wrote StringTokenizer
to only work with strings, or in other words, narrow
character strings. If you want the same class to work for both narrow and wide characters,
you can parameterize the character type as I have done in previous recipes. The other
thing you may want to do is extend StringTokenizer
to
allow more friendly interaction with sequences and more extensibility. You can always
write all of this yourself, or you can use an existing tokenizer class instead. The Boost
project has a class named tokenizer
that does this. See
www.boost.org for
more details.
Get C++ Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.