Chapter 9. Getting Data
To write it, it took three months; to conceive it, three minutes; to collect the data in it, all my life.
F. Scott Fitzgerald
In order to be a data scientist you need data. In fact, as a data scientist you will spend an embarrassingly large fraction of your time acquiring, cleaning, and transforming data. In a pinch, you can always type the data in yourself (or if you have minions, make them do it), but usually this is not a good use of your time. In this chapter, we’ll look at different ways of getting data into Python and into the right formats.
stdin and stdout
If you run your Python scripts at the command line, you can pipe data through them using sys.stdin
and sys.stdout
. For example, here is a script that reads in lines of text and spits back out the ones that match a regular expression:
# egrep.py
import
sys
,
re
# sys.argv is the list of command-line arguments
# sys.argv[0] is the name of the program itself
# sys.argv[1] will be the regex specified at the command line
regex
=
sys
.
argv
[
1
]
# for every line passed into the script
for
line
in
sys
.
stdin
:
# if it matches the regex, write it to stdout
if
re
.
search
(
regex
,
line
):
sys
.
stdout
.
write
(
line
)
And here’s one that counts the lines it receives and then writes out the count:
# line_count.py
import
sys
count
=
0
for
line
in
sys
.
stdin
:
count
+=
1
# print goes to sys.stdout
count
You could then use these to count how many lines of a file contain numbers. In Windows, you’d use:
type SomeFile.txt ...
Get Data Science from Scratch now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.