Parsing and formatting text is a large, open-ended topic. So far in this chapter, we’ve looked at only primitive operations on strings—creation, basic editing, searching, and turning simple values into strings. Now we’d like to move on to more structured forms of text. Java has a rich set of APIs for parsing and printing formatted strings, including numbers, dates, times, and currency values. We’ll cover most of these topics in this chapter, but we’ll wait to discuss date and time formatting until Chapter 11.
We’ll start with parsing—reading primitive numbers and values as
strings and chopping long strings into tokens. Then we’ll go the other way
and look at formatting strings and the java.text
package. We’ll revisit the topic of
internationalization to see how Java can localize parsing and formatting
of text, numbers, and dates for particular locales. Finally, we’ll take a
detailed look at regular expressions, the most powerful text-parsing tool
Java offers. Regular expressions let you define your own patterns of
arbitrary complexity, search for them, and parse them from text.
We should mention that you’re going to see a great deal of overlap
between the new formatting and parsing APIs (printf
and Scanner
) introduced in Java 5.0 and the older
APIs of the java.text
package. The new
APIs effectively replace much of the old ones and in some ways are easier
to use. Nonetheless, it’s good to know about both because so much existing
code uses the older APIs.
In Java, numbers and Booleans are primitive types—not
objects. But for each primitive type, Java also defines a
primitive wrapper class. Specifically, the
java.lang
package
includes the following classes: Byte
, Short
, Integer
, Long
, Float
, Double
, and Boolean
. We talked
about these in Chapter 1, but we bring them up
now because these classes hold static utility methods that know how to
parse their respective types from strings. Each of these primitive
wrapper classes has a static “parse” method that reads a String
and returns the corresponding primitive
type. For example:
byte
b
=
Byte
.
parseByte
(
"16"
);
int
n
=
Integer
.
parseInt
(
"42"
);
long
l
=
Long
.
parseLong
(
"99999999999"
);
float
f
=
Float
.
parseFloat
(
"4.2"
);
double
d
=
Double
.
parseDouble
(
"99.99999999"
);
boolean
b
=
Boolean
.
parseBoolean
(
"true"
);
// Prior to Java 5.0 use:
boolean
b
=
new
Boolean
(
"true"
).
booleanValue
();
Alternately, the java.util.Scanner
provides a single API for not only parsing individual primitive types
from strings, but reading them from a stream of tokens. This example
shows how to use it in place of the preceding wrapper classes:
byte
b
=
new
Scanner
(
"16"
).
nextByte
();
int
n
=
new
Scanner
(
"42"
).
nextInt
();
long
l
=
new
Scanner
(
"99999999999"
).
nextLong
();
float
f
=
new
Scanner
(
"4.2"
).
nextFloat
();
double
d
=
new
Scanner
(
"99.99999999"
).
nextDouble
();
boolean
b
=
new
Scanner
(
"true"
).
nextBoolean
();
We’ll see Scanner
used to parse
multiple values from a String
or
stream when we discuss tokenizing text later in this chapter.
It’s easy to parse integer type numbers (byte
, short
, int
, long
) in alternate numeric bases. You can
use the parse methods of the primitive wrapper classes by simply
specifying the base as a second parameter:
long
l
=
Long
.
parseLong
(
"CAFEBABE"
,
16
);
// l = 3405691582
byte
b
=
Byte
.
parseByte
(
"12"
,
8
);
// b = 10
All methods of the Java 5.0 Scanner
class described earlier also accept
a base as an optional argument:
long
l
=
new
Scanner
(
"CAFEBABE"
).
nextLong
(
16
);
// l = 3405691582
byte
b
=
new
Scanner
(
"12"
).
nextByte
(
8
);
// b = 10
You can go the other way and convert a long
or integer value
to a string value in a specified base using special static toString()
methods of
the Integer
and Long
classes:
String
s
=
Long
.
toString
(
3405691582L
,
16
);
// s = "cafebabe"
For convenience, each class also has a static toHexString()
method
for working with base 16:
String
s
=
Integer
.
toHexString
(
255
).
toUpperCase
();
// s = "FF";
The preceding wrapper class parser methods handle the
case of numbers formatted using only the simplest English conventions
with no frills. If these parse methods do not understand the string,
either because it’s simply not a valid number or because the number is
formatted in the convention of another language, they throw a NumberFormatException
:
// Italian formatting
double
d
=
Double
.
parseDouble
(
"1.234,56"
);
// NumberFormatException
The Scanner
API is smarter
and can use Locale
s to parse
numbers in specific languages with more elaborate conventions. For
example, the Scanner
can handle
comma-formatted numbers:
int
n
=
new
Scanner
(
"99,999,999"
).
nextInt
();
You can specify a Locale
other than the default with the useLocale()
method.
Let’s parse that value in Italian now:
double
d
=
new
Scanner
(
"1.234,56"
).
useLocale
(
Locale
.
ITALIAN
).
nextDouble
();
If the Scanner
cannot parse a
string, it throws a runtime InputMismatchException
:
double
d
=
new
Scanner
(
"garbage"
).
nextDouble
();
// InputMismatchException
Prior to Java 5.0, this kind of parsing was accomplished using
the java.text
package with the
NumberFormat
class.
The classes of the java.text
package also allow you to parse additional types, such as dates,
times, and localized currency values, that aren’t handled by the
Scanner
. We’ll look at these later
in this chapter.
A common programming task involves parsing a string of text into words or “tokens” that are separated by some set of delimiter characters, such as spaces or commas. The first example contains words separated by single spaces. The second, more realistic problem involves comma-delimited fields.
Now
is
the
time
for
all
good
men
(
and
women
)...
Check
Number
,
Description
,
Amount
4231
,
Java
Programming
,
1000.00
Java has several (unfortunately overlapping) APIs for handling
situations like this. The most powerful and useful are the String split()
and Scanner
APIs. Both utilize regular expressions
to allow you to break the string on arbitrary patterns. We haven’t
talked about regular expressions yet, but in order to show you how this
works we’ll just give you the necessary magic and explain in detail
later in this chapter. We’ll also mention a legacy utility, java.util.StringTokenizer
, which uses simple
character sets to split a string. StringTokenizer
is not as powerful, but
doesn’t require an understanding of regular expressions.
The String split()
method
accepts a regular expression that describes a delimiter and uses it to
chop the string into an array of String
s:
String
text
=
"Now is the time for all good men"
;
String
[]
words
=
text
.
split
(
"\\s"
);
// words = "Now", "is", "the", "time", ...
String
text
=
"4231, Java Programming, 1000.00"
;
String
[]
fields
=
text
.
split
(
"\\s*,\\s*"
);
// fields = "4231", "Java Programming", "1000.00"
In the first example, we used the regular expression \\s
, which matches a single whitespace
character (space, tab, or carriage return). The split()
method returned
an array of eight strings. In the second example, we used a more
complicated regular expression, \\s*,\\s*
, which matches a comma surrounded by
any number of contiguous spaces (possibly zero). This reduced our text
to three nice, tidy fields.
With the new Scanner
API, we
could go a step further and parse the numbers of our second example as
we extract them:
String
text
=
"4231, Java Programming, 1000.00"
;
Scanner
scanner
=
new
Scanner
(
text
).
useDelimiter
(
"\\s*,\\s*"
);
int
checkNumber
=
scanner
.
nextInt
();
// 4231
String
description
=
scanner
.
next
();
// "Java Programming"
float
amount
=
scanner
.
nextFloat
();
// 1000.00
Here, we’ve told the Scanner
to use our regular expression as the
delimiter and then called it repeatedly to parse each field as its
corresponding type. The Scanner
is
convenient because it can read not only from String
s but directly from stream sources, such
as InputStream
s, File
s, and Channel
s:
Scanner
fileScanner
=
new
Scanner
(
new
File
(
"spreadsheet.csv"
)
);
fileScanner
.
useDelimiter
(
"\\
s
*,
\\
s
*
);
// ...
Another thing that you can do with the Scanner
is to look ahead with the “hasNext”
methods to see if another item is coming:
while
(
scanner
.
hasNextInt
()
)
{
int
n
=
scanner
.
nextInt
();
...
}
Even though the StringTokenizer
class that we mentioned is
now a legacy item, it’s good to know that it’s there because it’s been
around since the beginning of Java and is used in a lot of code.
StringTokenizer
allows you to
specify a delimiter as a set of characters and matches any number or
combination of those characters as a delimiter between tokens. The
following snippet reads the words of our first example:
String
text
=
"Now is the time for all good men (and women)..."
;
StringTokenizer
st
=
new
StringTokenizer
(
text
);
while
(
st
.
hasMoreTokens
()
)
{
String
word
=
st
.
nextToken
();
...
}
We invoke the hasMoreTokens()
and
nextToken()
methods
to loop over the words of the text. By default, the StringTokenizer
class uses standard
whitespace characters—carriage return, newline, and tab—as delimiters.
You can also specify your own set of delimiter characters in the
StringTokenizer
constructor. Any
contiguous combination of the specified characters that appears in the
target string is skipped between tokens:
String
text
=
"4231, Java Programming, 1000.00"
;
StringTokenizer
st
=
new
StringTokenizer
(
text
,
","
);
while
(
st
.
hasMoreTokens
()
)
{
String
word
=
st
.
nextToken
();
// word = "4231", " Java Programming", "1000.00"
}
This isn’t as clean as our regular expression example. Here we
used a comma as the delimiter so we get extra leading whitespace in
our description field. If we had added space to our delimiter string,
the StringTokenizer
would have
broken our description into two words, “Java” and “Programming,” which
is not what we wanted. A solution here would be to use trim()
to remove the leading and trailing
space on each element.
Get Learning Java, 4th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.