Chapter 4. Head Aches
Stand on your own head for a change / Give me some skin to call my own
They Might Be Giants, “Stand on Your Own Head” (1988)
The challenge in this chapter is to implement the head
program, which will print the first few lines or bytes of one or more files.
This is a good way to peek at the contents of a regular text file and is often a much better choice than cat
.
When faced with a directory of something like output files from some process, using head
can help you quickly scan for potential problems.
It’s particularly useful when dealing with extremely large files, as it will only read the first few bytes or lines of a file (as opposed to cat
, which will always read the entire file).
In this chapter, you will learn how to do the following:
-
Create optional command-line arguments that accept numeric values
-
Convert between types using
as
-
Use
take
on an iterator or a filehandle -
Preserve line endings while reading a filehandle
-
Read bytes versus characters from a filehandle
-
Use the turbofish operator
How head Works
I’ll start with an overview of head
so you know what’s expected of your program.
There are many implementations of the original AT&T Unix operating system, such as Berkeley Standard Distribution (BSD), SunOS/Solaris, HP-UX, and Linux.
Most of these operating systems have some version of a head
program that will default to showing the first 10 lines of 1 or more files.
Most will probably have options -n
to control the number of lines shown and -c
to instead show some number of bytes.
The BSD version has only these two options, which I can see via man head
:
HEAD(1) BSD General Commands Manual HEAD(1) NAME head -- display first lines of a file SYNOPSIS head [-n count | -c bytes] [file ...] DESCRIPTION This filter displays the first count lines or bytes of each of the speci- fied files, or of the standard input if no files are specified. If count is omitted it defaults to 10. If more than a single file is specified, each file is preceded by a header consisting of the string ''==> XXX <=='' where ''XXX'' is the name of the file. EXIT STATUS The head utility exits 0 on success, and >0 if an error occurs. SEE ALSO tail(1) HISTORY The head command appeared in PWB UNIX. BSD June 6, 1993 BSD
With the GNU version, I can run head --help
to read the usage:
Usage: head [OPTION]... [FILE]... Print the first 10 lines of each FILE to standard output. With more than one FILE, precede each with a header giving the file name. With no FILE, or when FILE is -, read standard input. Mandatory arguments to long options are mandatory for short options too. -c, --bytes=[-]K print the first K bytes of each file; with the leading '-', print all but the last K bytes of each file -n, --lines=[-]K print the first K lines instead of the first 10; with the leading '-', print all but the last K lines of each file -q, --quiet, --silent never print headers giving file names -v, --verbose always print headers giving file names --help display this help and exit --version output version information and exit K may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
Note that the GNU version can specify negative numbers for -n
and -c
and with suffixes like K
, M
, etc., which the challenge program will not implement.
In both the BSD and GNU versions, the files are optional positional arguments that will read STDIN
by default or when a filename is a dash.
To demonstrate how head
works, I’ll use the files found in 04_headr/tests/inputs:
-
empty.txt: an empty file
-
one.txt: a file with one line of text
-
two.txt: a file with two lines of text
-
three.txt: a file with three lines of text and Windows line endings
-
twelve.txt: a file with 12 lines of text
Given an empty file, there is no output, which you can verify with head tests/inputs/empty.txt
.
As mentioned, head
will print the first 10 lines of a file by default:
$ head tests/inputs/twelve.txt one two three four five six seven eight nine ten
The -n
option allows you to control the number of lines that are shown.
For instance, I can choose to show only the first two lines with the following command:
$ head -n 2 tests/inputs/twelve.txt one two
The -c
option shows only the given number of bytes from a file.
For instance, I can show just the first two bytes:
$ head -c 2 tests/inputs/twelve.txt on
Oddly, the GNU version will allow you to provide both -n
and -c
and defaults to showing bytes.
The BSD version will reject both arguments:
$ head -n 1 -c 2 tests/inputs/one.txt head: can't combine line and byte counts
Any value for -n
or -c
that is not a positive integer will generate an error that will halt the program, and the error message will include the illegal value:
$ head -n 0 tests/inputs/one.txt head: illegal line count -- 0 $ head -c foo tests/inputs/one.txt head: illegal byte count -- foo
When there are multiple arguments, head
adds a header and inserts a blank line between each file.
Notice in the following output that the first character in tests/inputs/one.txt is an Ö, a silly multibyte character I inserted to force the program to discern between bytes and characters:
$ head -n 1 tests/inputs/*.txt ==> tests/inputs/empty.txt <== ==> tests/inputs/one.txt <== Öne line, four words. ==> tests/inputs/three.txt <== Three ==> tests/inputs/twelve.txt <== one ==> tests/inputs/two.txt <== Two lines.
With no file arguments, head
will read from STDIN
:
$ cat tests/inputs/twelve.txt | head -n 2 one two
As with cat
in Chapter 3, any nonexistent or unreadable file is skipped and a warning is printed to STDERR
.
In the following command, I will use blargh as a nonexistent file and will create an unreadable file called cant-touch-this:
$ touch cant-touch-this && chmod 000 cant-touch-this $ head blargh cant-touch-this tests/inputs/one.txt head: blargh: No such file or directory head: cant-touch-this: Permission denied ==> tests/inputs/one.txt <== Öne line, four words.
This is as much as this chapter’s challenge program will need to implement.
Getting Started
You might have anticipated that the program I want you to write will be called headr
(pronounced head-er).
Start by running cargo new headr
, then add the following dependencies to your Cargo.toml:
[dependencies]
anyhow
=
"1.0.79"
clap
=
{
version
=
"4.5.0"
,
features
=
[
"derive"
]
}
[dev-dependencies]
assert_cmd
=
"2.0.13"
predicates
=
"3.0.4"
pretty_assertions
=
"1.4.0"
rand
=
"0.8.5"
Copy my 04_headr/tests directory into your project directory, and then run cargo test
.
All the tests should fail.
Your mission, should you choose to accept it, is to write a program that will pass these tests.
I propose you begin src/main.rs with the following code to represent the program’s three parameters with an Args
struct:
#[
derive(Debug)
]
struct
Args
{
files
:
Vec
<
String
>
,
lines
:
u64
,
bytes
:
Option
<
u64
>
,
}
The number of
lines
to print will be of the typeu64
.
Tip
All the command-line arguments for this program are optional because files
should default to a dash (-
), lines
will default to 10
, and bytes
can be left out.
The primitive u64
is an unsigned integer that uses 8 bytes of memory and is similar to a usize
, which is a pointer-sized unsigned integer type with a size that varies from 4 bytes on a 32-bit operating system to 8 bytes on a 64-bit system.
Rust also has an isize
type, which is a pointer-sized signed integer that you would need to represent negative numbers as the GNU version does.
Since you only want to store positive numbers à la the BSD version, you can stick with an unsigned type.
Note the other Rust types of u32
/i32
(unsigned/signed 32-bit integer) and u64
/i64
(unsigned/signed 64-bit integer) if you want finer control over how large these values can be.
The lines
and bytes
parameters will be used in functions that expect the types usize
and u64
, so later we’ll discuss how to convert between these types.
Your program should use 10
as the default value for lines
, but bytes
will be an Option
, which I first introduced in Chapter 2.
This means that bytes
will either be Some<u64>
if the user provides a valid value or None
if they do not.
I challenge you to parse the command-line arguments into this struct however you like.
To use the derive pattern, annotate the preceding Args
accordingly.
If you prefer to follow the builder pattern, consider writing a get_args
function with the following outline:
fn
get_args
()
->
Args
{
let
matches
=
Command
::new
(
"headr"
)
.
version
(
"0.1.0"
)
.
author
(
"Ken Youens-Clark <kyclark@gmail.com>"
)
.
about
(
"Rust version of `head`"
)
// What goes here?
.
get_matches
();
Args
{
files
:..
.
lines
:..
.
bytes
:..
.
}
}
Update main
to parse and pretty-print the arguments:
fn
main
()
{
let
args
=
Args
::parse
();
println!
(
"{:#?}"
,
args
);
}
See if you can get your program to print a usage like the following. Note that I use the short and long names from the GNU version:
$ cargo run -- -h Rust version of `head` Usage: headr [OPTIONS] [FILE]... Arguments: [FILE]... Input file(s) [default: -] Options: -n, --lines <LINES> Number of lines [default: 10] -c, --bytes <BYTES> Number of bytes -h, --help Print help -V, --version Print version
Run the program with no inputs and verify the defaults are correctly set:
$ cargo run Args { files: [ "-", ], lines: 10, bytes: None, }
files
should default to a dash (-
) as the filename.The number of
lines
should default to10
.bytes
should beNone
.
Now run the program with arguments and ensure they are correctly parsed:
$ cargo run -- -n 3 tests/inputs/one.txt Args { files: [ "tests/inputs/one.txt", ], lines: 3, bytes: None, }
The positional argument tests/inputs/one.txt is parsed as one of the
files
.The
-n
option forlines
sets this to3
.The
-b
option forbytes
defaults toNone
.
If I provide more than one positional argument, they will all go into files
, and the -c
argument will go into bytes
.
In the following command, I’m again relying on the bash
shell to expand the file glob *.txt
into all the files ending in .txt.
PowerShell users should refer to the equivalent use of Get-ChildItem
shown in the section “Iterating Through the File Arguments”:
$ cargo run -- -c 4 tests/inputs/*.txt Args { files: [ "tests/inputs/empty.txt", "tests/inputs/one.txt", "tests/inputs/three.txt", "tests/inputs/twelve.txt", "tests/inputs/two.txt", ], lines: 10, bytes: Some( 4, ), }
There are four files ending in .txt.
lines
is still set to the default value of10
.The
-c 4
results in thebytes
now beingSome(4)
.
Any value for -n
or -c
that cannot be parsed into a positive integer should cause the program to halt with an error.
Use clap::value_parser
to ensure that the integer arguments are valid and convert them to numbers:
$ cargo run -- -n blargh tests/inputs/one.txt error: invalid value 'blargh' for '--lines <LINES>': invalid digit found in string $ cargo run -- -c 0 tests/inputs/one.txt error: invalid value '0' for '--bytes <BYTES>': 0 is not in 1..18446744073709551615
The program should disallow the use of both -n
and -c
:
$ cargo run -- -n 1 -c 1 tests/inputs/one.txt error: the argument '--lines <LINES>' cannot be used with '--bytes <BYTES>' Usage: headr --lines <LINES> <FILE>...
Note
Just parsing and validating the arguments is a challenge, but I know you can do it. Stop reading here and get your program to pass all the tests included with cargo test dies
:
running 3 tests test dies_bad_lines ... ok test dies_bad_bytes ... ok test dies_bytes_and_lines ... ok
Defining the Arguments
Welcome back.
I will first show the builder pattern with a get_args
function as in the previous chapter.
Note that the two optional arguments, lines
and bytes
, accept numeric values.
This is different from the optional arguments implemented in Chapter 3 that are used as Boolean flags.
Note that the following code requires use clap::{Arg, Command}
:
fn
get_args
(
)
->
Args
{
let
matches
=
Command
::
new
(
"
headr
"
)
.
version
(
"
0.1.0
"
)
.
author
(
"
Ken Youens-Clark <kyclark@gmail.com>
"
)
.
about
(
"
Rust version of `head`
"
)
.
arg
(
Arg
::
new
(
"
lines
"
)
.
short
(
'n'
)
.
long
(
"
lines
"
)
.
value_name
(
"
LINES
"
)
.
help
(
"
Number of lines
"
)
.
value_parser
(
clap
::
value_parser
!
(
u64
)
.
range
(
1
..
)
)
.
default_value
(
"
10
"
)
,
)
.
arg
(
Arg
::
new
(
"
bytes
"
)
.
short
(
'c'
)
.
long
(
"
bytes
"
)
.
value_name
(
"
BYTES
"
)
.
conflicts_with
(
"
lines
"
)
.
value_parser
(
clap
::
value_parser
!
(
u64
)
.
range
(
1
..
)
)
.
help
(
"
Number of bytes
"
)
,
)
.
arg
(
Arg
::
new
(
"
files
"
)
.
value_name
(
"
FILE
"
)
.
help
(
"
Input file(s)
"
)
.
num_args
(
0
..
)
.
default_value
(
"
-
"
)
,
)
.
get_matches
(
)
;
Args
{
files
:
matches
.
get_many
(
"
files
"
)
.
unwrap
(
)
.
cloned
(
)
.
collect
(
)
,
lines
:
matches
.
get_one
(
"
lines
"
)
.
cloned
(
)
.
unwrap
(
)
,
bytes
:
matches
.
get_one
(
"
bytes
"
)
.
cloned
(
)
,
}
}
The
lines
option takes a value and defaults to10
.The
bytes
option takes a value, and it conflicts with thelines
parameter so that they are mutually exclusive.The
files
parameter is positional, takes zero or more values, and defaults to a dash (-
).
Alternatively, the clap
derive pattern requires annotating the Args
struct:
#[derive(Parser, Debug)]
#[command(author, version, about)]
/// Rust version of `head`
struct
Args
{
/// Input file(s)
#[arg(default_value =
"-"
, value_name =
"FILE"
)]
files
:Vec
<
String
>
,
/// Number of lines
#[arg(
short('n'),
long,
default_value =
"10"
,
value_name =
"LINES"
,
value_parser = clap::value_parser!(u64).range(1..)
)]
lines
:u64
,
/// Number of bytes
#[arg(
short('c'),
long,
value_name =
"BYTES"
,
conflicts_with(
"lines"
),
value_parser = clap::value_parser!(u64).range(1..)
)]
bytes
:Option
<
u64
>
,
}
Tip
In the derive pattern, the default Arg::long
value will be the name of the struct field, for example, lines and bytes. The default value for Arg::short
will be the first letter of the struct field, so l or b. I specify the short names n and c, respectively, to match the original tool.
It’s quite a bit of work to validate all the user input, but now I have some assurance that I can proceed with good data.
Processing the Input Files
I recommend that you have your main
call a run
function.
Be sure to add
use anyhow::Result
for the following:
fn
main
()
{
if
let
Err
(
e
)
=
run
(
Args
::parse
())
{
eprintln!
(
"{e}"
);
std
::process
::exit
(
1
);
}
}
fn
run
(
_args
:Args
)
->
Result
<
()
>
{
Ok
(())
}
This challenge program should handle the input files as in Chapter 3, so I suggest you add the same open
function:
fn
open
(
filename
:&
str
)
->
Result
<
Box
<
dyn
BufRead
>>
{
match
filename
{
"-"
=>
Ok
(
Box
::new
(
BufReader
::new
(
io
::stdin
()))),
_
=>
Ok
(
Box
::new
(
BufReader
::new
(
File
::open
(
filename
)
?
))),
}
}
Be sure to add all these additional dependencies:
use
std
::fs
::File
;
use
std
::io
::{
self
,
BufRead
,
BufReader
};
Expand your run
function to try opening the files, printing errors as you encounter them:
fn
run
(
args
:
Args
)
->
Result
<
(
)
>
{
for
filename
in
args
.
files
{
match
open
(
&
filename
)
{
Err
(
err
)
=
>
eprintln!
(
"
{filename}: {err}
"
)
,
Ok
(
_
)
=
>
println!
(
"
Opened {filename}
"
)
,
}
}
Ok
(
(
)
)
}
Iterate through each of the filenames.
Attempt to open the given file.
Print errors to
STDERR
.Print a message that the file was successfully opened.
Run your program with a good file and a bad file to ensure it seems to work. In the following command, blargh represents a nonexistent file:
$ cargo run -- blargh tests/inputs/one.txt blargh: No such file or directory (os error 2) Opened tests/inputs/one.txt
Without looking ahead to my solution, figure out how to read the lines and then the bytes of a given file.
Next, add the headers separating multiple file arguments.
Look closely at the error output from the original head
program when handling invalid files, noticing that readable files have a header first and then the file output, but invalid files only print an error.
Additionally, there is an extra blank line separating the output for the valid files:
$ head -n 1 tests/inputs/one.txt blargh tests/inputs/two.txt ==> tests/inputs/one.txt <== Öne line, four words. head: blargh: No such file or directory ==> tests/inputs/two.txt <== Two lines.
I’ve specifically designed some challenging inputs for you to consider.
To see what you face, use the file
command to report file type information:
$ file tests/inputs/*.txt tests/inputs/empty.txt: empty tests/inputs/one.txt: UTF-8 Unicode text tests/inputs/three.txt: ASCII text, with CRLF, LF line terminators tests/inputs/twelve.txt: ASCII text tests/inputs/two.txt: ASCII text
This is an empty file just to ensure your program doesn’t fall over.
This file contains Unicode, as I put an umlaut over the O in Őne to force you to consider the differences between bytes and characters.
This file has Windows-style line endings.
This file has 12 lines to ensure the default of 10 lines is shown.
This file has Unix-style line endings.
Tip
On Windows, the newline is the combination of the carriage return and the line feed, often shown as CRLF or \r\n
. On Unix platforms, only the newline is used, so LF or \n
. These line endings must be preserved in the output from your program, so you will have to find a way to read the lines in a file without removing the line endings.
Reading Bytes Versus Characters
Before continuing, you should understand the difference between reading bytes and characters from a file. In the early 1960s, the American Standard Code for Information Interchange (ASCII, pronounced as-key) table of 128 characters represented all possible text elements in computing. It takes only seven bits (27 = 128) to represent this many characters. Usually a byte consists of eight bits, so the notion of byte and character were interchangeable.
Since the creation of Unicode (Universal Coded Character Set) to represent all the writing systems of the world (and even emojis), some characters may require up to four bytes.
The Unicode standard defines several ways to encode characters, including UTF-8 (Unicode Transformation Format using eight bits).
As noted, the file tests/inputs/one.txt begins with the character Ő, which is two bytes long in UTF-8.
If you want head
to show you this one character, you must request two bytes:
$ head -c 2 tests/inputs/one.txt Ö
If you ask head
to select just the first byte from this file, you get the byte value 195
, which is not a valid UTF-8 string.
The output is a special character that indicates a problem converting a character into Unicode:
$ head -c 1 tests/inputs/one.txt �
The challenge program is expected to re-create this behavior.
This is not an easy program to write, but you should be able to use std::io
, std::fs::File
, and std::io::BufReader
to figure out how to read bytes and lines from each of the files.
Note that in Rust, a String
must be a valid UTF-8-encoded string, and so the method String::from_utf8_lossy
might prove useful.
I’ve included a full set of tests in tests/cli.rs that you should have copied into your source tree.
Note
Stop reading here and finish the program. Use cargo test
frequently to check your progress. Do your best to pass all the tests before looking at my solution.
Solution
This challenge proved more interesting than I anticipated.
I thought it would be little more than a variation on cat
, but it turned out to be quite a bit more difficult.
I’ll walk you through how I arrived at my solution.
Reading a File Line by Line
After opening the valid files, I started by reading lines from the filehandle. I decided to modify some code from Chapter 3:
fn
run
(
args
:
Args
)
->
Result
<
(
)
>
{
for
filename
in
args
.
files
{
match
open
(
&
filename
)
{
Err
(
err
)
=
>
eprintln!
(
"
{filename}: {err}
"
)
,
Ok
(
file
)
=
>
{
for
line
in
file
.
lines
(
)
.
take
(
args
.
lines
as
usize
)
{
println!
(
"
{}
"
,
line
?
)
;
}
}
}
}
Ok
(
(
)
)
}
Use
Iterator::take
to select the desired number of lines from the filehandle.Print the line to the console.
Tip
The Iterator::take
method expects its argument to be the type usize
, but I have a u64
. I cast or convert the value using the as
keyword.
I think this is a fun solution because it uses the Iterator::take
method to select the desired number of lines.
I can run the program to select one line from a file, and it appears to work well:
$ cargo run -- -n 1 tests/inputs/twelve.txt one
If I run cargo test
, the program passes almost half the tests, which seems pretty good for having implemented only a small portion of the specifications; however, it’s failing all the tests that use the Windows-encoded input file.
To fix this problem, I have a confession to make.
Preserving Line Endings While Reading a File
I hate to break it to you, dear reader, but the catr
program in Chapter 3 does not completely replicate the original cat
program because it uses BufRead::lines
to read the input files.
The documentation for that function says, “Each string returned will not have a newline byte (the 0xA
byte) or CRLF (0xD
, 0xA
bytes) at the end.”
I hope you’ll forgive me because I wanted to show you how easy it can be to read the lines of a file, but you should be aware that the catr
program replaces Windows CRLF line endings with Unix-style newlines.
To fix this, I must instead use BufRead::read_line
, which, according to the documentation, “will read bytes from the underlying stream until the newline delimiter (the 0xA
byte) or EOF is found. Once found, all bytes up to, and including, the delimiter (if found) will be appended to buf
.”1
Following is a version that will preserve the original line endings.
With these changes, the program will pass more tests than it fails:
fn
run
(
args
:
Args
)
->
Result
<
(
)
>
{
for
filename
in
args
.
files
{
match
open
(
&
filename
)
{
Err
(
err
)
=
>
eprintln!
(
"
{filename}: {err}
"
)
,
Ok
(
mut
file
)
=
>
{
let
mut
line
=
String
::
new
(
)
;
for
_
in
0
..
args
.
lines
{
let
bytes
=
file
.
read_line
(
&
mut
line
)
?
;
if
bytes
=
=
0
{
break
;
}
print!
(
"
{line}
"
)
;
line
.
clear
(
)
;
}
}
}
;
}
Ok
(
(
)
)
}
Accept the filehandle as a mutable value.
Use
String::new
to create a new, empty mutable string buffer to hold each line.Use
for
to iterate through astd::ops::Range
to count up from zero to the requested number of lines. The variable name_
indicates I do not intend to use it.Use
BufRead::read_line
to read the next line into the string buffer.The filehandle will return zero bytes when it reaches the end of the file, so
break
out of the loop.Use
String::clear
to empty the line buffer.
If I run cargo test
at this point, the program will pass almost all the tests for reading lines and will fail all those for reading bytes and handling multiple files.
Reading Bytes from a File
Next, I’ll handle reading bytes from a file.
After I attempt to open the file, I check to see if args.bytes
is Some
number of bytes; otherwise, I’ll use the preceding code that reads lines.
For the following code, be sure to add use std::io::Read
to your imports:
for
filename
in
args
.
files
{
match
open
(
&
filename
)
{
Err
(
err
)
=
>
eprintln!
(
"
{filename}: {err}
"
)
,
Ok
(
mut
file
)
=
>
{
if
let
Some
(
num_bytes
)
=
args
.
bytes
{
let
mut
buffer
=
vec!
[
0
;
num_bytes
as
usize
]
;
let
bytes_read
=
file
.
read
(
&
mut
buffer
)
?
;
print!
(
"
{}
"
,
String
::
from_utf8_lossy
(
&
buffer
[
..
bytes_read
]
)
)
;
}
else
{
..
.
// Same as before
}
}
}
;
}
Use pattern matching to check if
args.bytes
isSome
number of bytes to read.Create a mutable buffer of a fixed length
num_bytes
filled with zeros to hold the bytes read from the file.Read bytes from the filehandle into the buffer. The value
bytes_read
will contain the number of bytes that were read, which may be fewer than the number requested.Convert the selected bytes into a string, which may not be valid UTF-8. Note the range operation to select only the bytes actually read.
As you saw in the case of selecting only part of a multibyte character, converting bytes to characters could fail because strings in Rust must be valid UTF-8.
The String::from_utf8
function will return an Ok
only if the string is valid, but String::from_utf8_lossy
will convert invalid UTF-8 sequences to the unknown or replacement character:
$ cargo run -- -c 1 tests/inputs/one.txt �
Let me show you another, much worse, way to read the bytes from a file.
You can read the entire file into a string, convert that into a vector of bytes, and then select the first num_bytes
:
let
mut
contents
=
String
::
new
(
)
;
file
.
read_to_string
(
&
mut
contents
)
?
;
// Danger here
let
bytes
=
contents
.
as_bytes
(
)
;
print!
(
"
{}
"
,
String
::
from_utf8_lossy
(
&
bytes
[
..
num_bytes
as
usize
]
)
// More danger
)
;
Create a new string buffer to hold the contents of the file.
Read the entire file contents into the string buffer.
Use
str::as_bytes
to convert the contents into bytes (u8
or unsigned 8-bit integers).Use
String::from_utf8_lossy
to turn a slice ofbytes
into a string.
As I’ve noted before, this approach can crash your program or computer if the file’s size exceeds the amount of memory on your machine.
Another serious problem with the preceding code is that it assumes the slice operation bytes[..num_bytes]
will succeed.
If you use this code with an empty file, for instance, you’ll be asking for bytes that don’t exist.
This will cause your program to panic and exit immediately with an error message:
$ cargo run -- -c 1 tests/inputs/empty.txt thread 'main' panicked at src/main.rs:53:55: range end index 1 out of range for slice of length 0 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Following is a safe—and perhaps the shortest—way to read the desired number of bytes from a file.
Be sure to add the trait use std::io::Read
to your imports:
let
bytes
:Result
<
Vec
<
_
>
,
_
>
=
file
.
bytes
().
take
(
num_bytes
as
usize
).
collect
();
print!
(
"{}"
,
String
::from_utf8_lossy
(
&
bytes
?
));
In the preceding code, the type annotation Result<Vec<_>, _>
is necessary as the compiler infers the type of bytes
as a slice, which has an unknown size.
I must indicate I want a Vec
, which is a smart pointer to heap-allocated memory.
The underscores (_
) indicate partial type annotation, causing the compiler to infer the types.
Without any type annotation for bytes
, the compiler complains thusly:
error[E0277]: the size for values of type `[u8]` cannot be known at compilation time --> src/main.rs:50:59 | 95 | print!("{}", String::from_utf8_lossy(&bytes?)); | ^^^^^^^ doesn't | have a size known at compile-time | = help: the trait `Sized` is not implemented for `[u8]` = note: all local variables must have a statically known size = help: unsized locals are gated as an unstable feature
Note
You’ve now seen that the underscore (_
) serves various functions. As the prefix or name of a variable, it shows the compiler you don’t want to use the value. In a match
arm, it is the wildcard for handling any case. When used in a type annotation, it tells the compiler to infer the type.
You can also indicate the type information on the righthand side of the expression using the turbofish operator (::<>
).
Often it’s a matter of style whether you indicate the type on the lefthand or righthand side, but later you will see examples where the turbofish is required for some expressions.
Here’s what the previous example would look like with the type indicated with the turbofish instead:
let
bytes
=
file
.
bytes
(
)
.
take
(
num_bytes
as
usize
)
.
collect
:
:
<
Result
<
Vec
<
_
>
,
_
>
>
(
)
;
The unknown character produced by String::from_utf8_lossy
(b'\xef\xbf\xbd'
) is not exactly the same output produced by the BSD head
(b'\xc3'
), making this somewhat difficult to test.
If you look at the run
helper function in tests/cli.rs, you’ll see that I read the expected value (the output from head
) and used the same function to convert what could be invalid UTF-8 so that I can compare the two outputs.
The run_stdin
function works similarly:
fn
run
(
args
:
&
[
&
str
]
,
expected_file
:
&
str
)
->
Result
{
// Extra work here due to lossy UTF
let
mut
file
=
File
::
open
(
expected_file
)
?
;
let
mut
buffer
=
Vec
::
new
(
)
;
file
.
read_to_end
(
&
mut
buffer
)
?
;
let
expected
=
String
::
from_utf8_lossy
(
&
buffer
)
;
let
output
=
Command
::
cargo_bin
(
PRG
)
?
.
args
(
args
)
.
output
(
)
.
expect
(
"
fail
"
)
;
assert!
(
output
.
status
.
success
(
)
)
;
assert_eq!
(
String
::
from_utf8_lossy
(
&
output
.
stdout
)
,
expected
)
;
Ok
(
(
)
)
}
Printing the File Separators
The last piece to handle is the separators between multiple files.
As noted before, valid files have a header that puts the filename inside ==>
and <==
markers.
Files after the first have an additional newline at the beginning to visually separate the output.
This means I will need to know the file number that I’m handling, which I can get by using the Iterator::enumerate
method.
Following is the final version of my run
function that will pass all the tests:
fn
run
(
args
:
Args
)
->
Result
<
(
)
>
{
let
num_files
=
args
.
files
.
len
(
)
;
for
(
file_num
,
filename
)
in
args
.
files
.
iter
(
)
.
enumerate
(
)
{
match
open
(
filename
)
{
Err
(
err
)
=
>
eprintln!
(
"
{filename}: {err}
"
)
,
Ok
(
mut
file
)
=
>
{
if
num_files
>
1
{
println!
(
"
{}==> {filename} <==
"
,
if
file_num
>
0
{
"
\n
"
}
else
{
"
"
}
,
)
;
}
if
let
Some
(
num_bytes
)
=
args
.
bytes
{
let
mut
buffer
=
vec!
[
0
;
num_bytes
as
usize
]
;
let
bytes_read
=
file
.
read
(
&
mut
buffer
)
?
;
print!
(
"
{}
"
,
String
::
from_utf8_lossy
(
&
buffer
[
..
bytes_read
]
)
)
;
}
else
{
let
mut
line
=
String
::
new
(
)
;
for
_
in
0
..
args
.
lines
{
let
bytes
=
file
.
read_line
(
&
mut
line
)
?
;
if
bytes
=
=
0
{
break
;
}
print!
(
"
{line}
"
)
;
line
.
clear
(
)
;
}
}
}
}
}
Ok
(
(
)
)
}
Use the
Vec::len
method to get the number of files.Use the
Iterator::enumerate
method to track the file number and filenames.Only print headers when there are multiple files.
Print a newline when
file_num
is greater than0
, which indicates the first file.
Going Further
There’s no reason to stop this party now.
Consider implementing how the GNU head
handles numeric values with suffixes and negative values.
For instance, -c=1K
means print the first 1,024 bytes of the file, and -n=-3
means print all but the last three lines of the file.
You’ll need to change lines
and bytes
to signed integer values to store both positive and negative numbers.
Be sure to run the GNU head
with these arguments, capture the output to test files, and write tests to cover the new features you add.
You could also add an option for selecting characters in addition to bytes.
You can use the String::chars
function to split a string into characters.
Finally, copy the test input file with the Windows line endings (tests/inputs/three.txt) to the tests for Chapter 3.
Edit the mk-outs.sh for that program to incorporate this file, and then expand the tests and program to ensure that line endings are preserved.
Summary
This chapter dove into some fairly sticky subjects, such as converting types like string inputs to a u64
and then casting these to usize
.
If you still feel confused, just know that you won’t always.
If you keep reading the docs and writing more code, it will eventually make sense.
Here are some things you accomplished in this chapter:
-
You learned to create optional parameters that can take values. Previously, the options were flags.
-
You saw that all command-line arguments are strings and used
clap
to attempt the conversion of a string like"3"
into the number3
. -
You learned to convert types using the
as
keyword. -
You found that using
_
as the name or prefix of a variable is a way to indicate to the compiler that you don’t intend to use the value. When used in a type annotation, it tells the compiler to infer the type. -
You learned how to use
BufRead::read_line
to preserve line endings while reading a filehandle. -
You found that the
take
method works on both iterators and filehandles to limit the number of elements you select. -
You learned to indicate type information on the lefthand side of an assignment or on the righthand side using the turbofish operator.
In the next chapter, you’ll learn more about Rust iterators and how to break input into lines, bytes, and characters.
1 EOF is an acronym for end of file.
Get Command-Line Rust now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.