Pattern Matching

We need a dataframe with a serious amount of text in it to make these exercises relevant:

wf<-read.table("c:\\temp\\worldfloras.txt",header=T)
attach(wf)
names(wf)

[1] "Country" "Latitude" "Area" "Population" "Flora"
[6] "Endemism" "Continent"

Country

As you can see, there are 161 countries in this dataframe (strictly, 161 places, since some of the entries, such as Sicily and Balearic Islands, are not countries). The idea is that we want to be able to select subsets of countries on the basis of specified patterns within the character strings that make up the country names (factor levels). The function to do this is grep. This searches for matches to a pattern (specified in its first argument) within the character vector which forms the second argument. It returns a vector of indices (subscripts) within the vector appearing as the second argument, where the pattern was found in whole or in part. The topic of pattern matching is very easy to master once the penny drops, but it hard to grasp without simple, concrete examples. Perhaps the simplest task is to select all the countries containing a particular letter – for instance, upper case R:

as.vector(Country[grep("R",as.character(Country))])

[1]     "Central African Republic"   "Costa Rica"
[3]     "Dominican Republic"         "Puerto Rico"
[5]     "Reunion"                    "Romania"
[7]     "Rwanda"                     "USSR"

To restrict the search to countries whose first name begins with R use the ^ character like this:

as.vector(Country[grep("^R",as.character(Country))]) ...

Get The R Book now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.