If you store your data in a database, this hack shows you how to fetch it from R or Excel.
In “Move Data from a Database to Excel” [Hack #19] , I show several ways to export data from a database program into an Excel spreadsheet. This hack shows you another way to access data using ODBC (which stands for Open Database Connectivity). ODBC provides a standard interface for different programs to connect to databases, and it allows you to use the right tool for each job. As shown in Figure 4-5, you can do data selection and manipulation using SQL and then analyze the data in a tool such as R or Excel.
To use ODBC for MySQL, first you must install the MySQL ODBC drivers on your computer. (MyODBC is available for Windows, Mac OS X, Linux, and other platforms.) You can download the files from http://dev.mysql.com/downloads/connector/odbc/3.51.html. The simplest way to install MyODBC is to download the precompiled binaries and then run the install program to install the drivers.
If you don’t want to use MySQL at all, you might not have to. If you have Microsoft Access installed on your Windows XP machine, you should already have the Microsoft Access ODBC drivers installed. Plus, the Baseball Archive database is available as a Microsoft Access database.
If you’re using Mac OS X, you might have some problems with the MyODBC drivers, depending on the versions of MySQL, MyODBC, and Mac OS X that you are using. I found that the drivers from Actual Technologies worked somewhat better, but you have to pay for the software. See http://www.actualtechnologies.com for more information.
Now, you need to configure an ODBC connection for your database. Under the Start menu, go to Programs → Administrative Tools → Data Sources (ODBC); you will also find this in the Administrative Tools section of the Control Panel. Once you’ve opened the configuration manager, you will see a list of user data sources. If you are using MySQL, you need to add a MySQL ODBC connection. Click the Add…button. Select MySQL ODBC 3.51 Driver from the list in the New Dialog box, and then click Finish.
Here’s an example of what you would enter to access a MySQL database of player and team statistics from the Baseball DataBank database [Hack #10] :
- Data source name
This is the name you will use to refer to the database—a completely separate name from the underlying database itself. I used bballdata.
- Description
The description of the data source, which can be whatever you like.
- Server
The name of the machine on which the MySQL server resides. In most cases, this will be your local machine, referred to as localhost.
- User
The username of the account that you use to log in. Use the name you created when you set up the database. (In “Get a MySQL Database of Player and Team Statistics” [Hack #10] , I chose jadler.)
- Password
The password for the account. (In “Get a MySQL Database of Player and Team Statistics” [Hack #10] , I chose P@ssw0rd.)
- Database
The name of the database to which you would like to connect. This is the name of the database in MySQL. (In “Get a MySQL Database of Player and Team Statistics” [Hack #10] , I used bbdatabank.)
Figure 4-6 shows the configuration screen.
If you are using Microsoft Access, you need to add an Access ODBC connection. This is just as simple. Click the Add…button, select Microsoft Access Driver from the list, and choose the Access database file.
R is a terrific language for calculations, models, and visualization, but SQL [Hack #8] is a better choice for data storage and retrieval. Once you’ve installed the ODBC drivers and configured a data source, you’re almost ready to use R. The last preparatory step is to install RODBC, the package [Hack #31] that “ODBC-enables” R. Start R, select Packages → Load Package…, and then select RODBC from the list of options. Once you’ve installed the package, you must load it using the command library(RODBC) each time you use R.
With the package installed, you’re ready to crunch data with R. The RODBC package includes many commands; for more information, see the RODBC manual or type help(RODBC). Here are a few key functions (and arguments to those functions; I’m omitting a few to keep this easy) that you may find useful:
Table 4-2.
Function name |
Arguments |
Description |
---|---|---|
|
|
Connects to the ODBC data source named in the first argument (using the name you assigned to it in the Data Sources administrative tool). User ID and password are optional. Returns an object that you can use with other ODBC functions. |
|
|
Returns information about the ODBC connection. |
|
|
Lists the tables available from the ODBC data source. |
|
|
Returns a data frame containing the data from the specified table name. Optionally, you can specify a limited number of rows. |
|
|
Returns a data frame containing the results of the SQL query. |
Here’s a sample run of this program, using the commands that we described earlier. In this example, we start by opening a “channel” in R for accessing the database using the odbcConnect function. Next, we look at some information about the database (using the odbcGetInfo function), and we get a set of tables available on the database. Finally, we copy a few observations into R using the sqlFetch function and display the results to make sure everything worked:
>library(RODBC) >channel <- odbcConnect('bballdata') >odbcGetInfo(channel) DBMS_Name DBMS_Ver Driver_ODBC_Ver "MySQL" " 4.0.21-standard" "03.51" Data_Source_Name Driver_Name Driver_Ver "bballdata" "myodbc3.dll" "03.51.09" ODBC_Ver Server_Name "03.52.0000" "192.168.0.3 via TCP/IP" >sqlTables(channel) TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS 1 bballdata events TABLE MySQL table >events <- sqlFetch(channel, "events", max=9) >events gid AB Pitches Play 1 gid_2004_01_15_aasmlb_nasmlb_1 0 2 gid_2004_01_15_aasmlb_nasmlb_1 0 3 gid_2004_01_15_aasmlb_nasmlb_1 1 .BFE HR/7/L 4 gid_2004_01_15_aasmlb_nasmlb_1 2 .CFX 63/G 5 gid_2004_01_15_aasmlb_nasmlb_1 3 .SBX 8/F 6 gid_2004_01_15_aasmlb_nasmlb_1 4 7 gid_2004_01_15_aasmlb_nasmlb_1 4 ..BD S9/G 8 gid_2004_01_15_aasmlb_nasmlb_1 5 9 gid_2004_01_15_aasmlb_nasmlb_1 5 .B1.D D8/F.1-3 PlayByPlay 1 2 3 Carlos Febles homers (1) on a line drive to left field. 4 Kimera Bartee grounds out, shortstop Kevin Polcovich to first baseman Kevin Young. 5 Geronimo Berroa flies out to center fielder Rich Becker. 6 National Manager Walter Alston ejected by HP umpire Mark Carlson. 7 Mark Whiten singles on a ground ball to right fielder Rob Ducey. 8 American first baseman Larry Barnes left the game due to an injured chest. 9 Larry Barnes doubles (1) on a fly ball to center fielder Rich Becker. Mark Whiten to 3rd.
The cool thing about ODBC is that it gives you lots of options for how to work with the data. Here’s how to use the Baseball DataBank database in Excel. Open Excel and click Data → Import External Data → New Database Query. Select the bballdata ODBC source in the Choose Data Source window (Figure 4-7).
If you deselect the “Use the Query Wizard to create/edit queries” box, you can skip directly to the Microsoft Query application.
The first step of the Query Wizard allows you to select columns, as shown in Figure 4-8. I chose to select all columns from the Batting table and a few columns from the Master table (nameFirst, nameLast, birthYear, birthMonth, and birthDay). Next, the wizard allows you to choose to filter data, as shown in Figure 4-9. I chose to select player-seasons after the year 2000. After that, you can choose how you would like to sort columns (this is not pictured here). I chose not to sort the results. Finally, you can choose to save the query, return the results in Excel, create an OLAP cube, or edit the query using Microsoft Query (this step is also not shown). I chose to open the results in the Microsoft Query application.
Figure 4-10 shows a screenshot of the Microsoft Query application. If you are familiar with Microsoft Access, you’ll notice that Microsoft Query looks very similar. (See “Make Your Own Stats Book” [Hack #11] for more about Microsoft Access.) You can move around the order of the columns, add and subtract columns, join columns, add more tables, sort the results, filter the results, and do dozens of other things with this program. (Alternatively, you can click the SQL button and enter your SQL directly [Hack #16] .)
Once you’re satisfied with the query, go to the File menu in Microsoft Query and select Return Data to Microsoft Office Excel. Excel will then ask where you’d like to store the results (e.g., the current worksheet, a new worksheet, etc.). Once you’ve inserted the data, you can use Excel to slice, dice, or sort it in any way you want. Figure 4-11 shows the resulting spreadsheet.
If you’re not on Windows, RODBC can be tough to work with sometimes. (For example, I could not get RODBC to work at all on a Mac.) RMySQL, an alternative package, might be a better solution.
To install RMySQL from source, on Mac OS X, you can use the following set of R commands. You will need to have MySQL installed locally in /usr/local/mysql. The first command sets two environment variables so that the package builds correctly. The next two commands install the DBI package and the RMySQL package, respectively.
>print(Sys.putenv("PKG_CPPFLAGS"="-I/usr/local/mysql/include", + "PKG_LIBS"="-L/usr/local/mysql/lib -L/usr/lib -lmysqlclient")) >install.packages('DBI') >install.packages('RMySQL')
Tip
You can also use RMySQL on Windows. To do this, download the package from http://stat.bell-labs.com/RS-DBI/download/index.html and use the Packages → “Install Package(s) from Local Zip Files” menu item in R to load it.
To use this package in your R programs, you can use a set of commands, as shown here:
># load libraries >library(DBI) >library(RMySQL) ># load MySQL driver into R and start connection >drv <-dbDriver("MySQL") >con <- dbConnect(drv, username="jadler", password="P@ssw0rd", + dbname="bbdatabank", host="localhost")
Then, you can submit queries to a MySQL database directly from R and fetch the results into R tables. Here’s an example:
>res<-dbSendQuery(con, "select * from teams"); >teams<-fetch(res,n=-1); >summary(teams); idxTeams yearID lgID Min. : 1.0 Min. :1871 Length:2475 1st Qu.: 619.5 1st Qu.:1915 Class :character Median :1238.0 Median :1953 Mode :character Mean :1238.0 Mean :1948 3rd Qu.:1856.5 3rd Qu.:1982 Max. :2475.0 Max. :2004 idxTeamsFranchises divID Rank Min. : 1.00 Length:2475 Min. : 1.000 1st Qu.: 29.00 Class :character 1st Qu.: 2.000 Median : 57.00 Mode :character Median : 4.000 Mean : 54.63 Mean : 4.273 3rd Qu.: 82.00 3rd Qu.: 6.000 Max. :120.00 Max. :13.000 G Ghome W Min. : 6.0 Min. : 44.00 Min. : 0.00 1st Qu.:153.0 1st Qu.: 77.00 1st Qu.: 64.50 Median :155.0 Median : 79.00 Median : 76.00 Mean :148.8 Mean : 78.06 Mean : 73.92 3rd Qu.:162.0 3rd Qu.: 81.00 3rd Qu.: 87.00 Max. :165.0 Max. : 84.00 Max. :116.00 NA's :399.00 L DivWin WCWin Min. : 4.00 Length:2475 Length:2475 1st Qu.: 64.00 Class :character Class :character Median : 75.00 Mode :character Mode :character Mean : 73.92 3rd Qu.: 86.00 Max. :134.00 LgWin WSWin R Length:2475 Length :2475 Min. : 24.0 Class :character Class :character 1st Qu.:600.5 Mode :character Mode :character Median :684.0 Mean : 676.7 3rd Qu.: 760.0 Max. :1220.0 AB H 2B 3B Min. : 211 Min. : 33 Min. : 3.0 Min. : 0.0 1st Qu.:5074 1st Qu.:1279 1st Qu.:187.0 1st Qu.: 32.0 Median :5339 Median :1384 Median :223.0 Median : 45.0 Mean :5090 Mean :1334 Mean :219.7 Mean : 49.4 3rd Qu.:5496 3rd Qu.:1463 3rd Qu.:259.0 3rd Qu.: 63.0 Max. :5781 Max. :1783 Max. :373.0 Max. :150.0 HR BB SO Min. : 0.00 Min. : 0.0 Min. : 0.0 1st Qu.: 35.00 1st Qu.:416.5 1st Qu.: 482.0 Median : 96.00 Median :490.0 Median : 674.0 Mean : 93.15 Mean :468.8 Mean : 680.9 3rd Qu.:139.00 3rd Qu.:554.0 3rd Qu.: 906.5 Max. :264.00 Max. :835.0 Max. :1399.0 NA's :120.0 SB CS HBP Min. : 0.0 Min. : 0.00 Min. : 29.00 1st Qu.: 62.0 1st Qu.: 37.00 1st Qu.: 51.00 Median : 97.0 Median : 48.00 Median : 58.00 Mean :114.7 Mean : 51.31 Mean : 59.37 3rd Qu.:149.0 3rd Qu.: 61.00 3rd Qu.: 67.00 Max. :581.0 Max. :191.00 Max. : 95.00 NA's :144.0 NA's :859.00 NA's :2325.00 SF RA ER Min. : 25.00 Min. : 34.0 Min. : 25.0 1st Qu.: 40.00 1st Qu.: 599.0 1st Qu.: 484.0 Median : 46.50 Median : 680.0 Median : 579.0 Mean : 46.91 Mean : 676.7 Mean : 558.5 3rd Qu.: 52.00 3rd Qu.: 761.0 3rd Qu.: 655.0 Max. : 75.00 Max. :1252.0 Max. :1023.0 NA's :2325.00 ERA CG SHO Length:2475 Min. : 1.00 Min. : 0.000 Class :character 1st Qu.: 25.00 1st Qu.: 6.000 Mode :character Median : 53.00 Median : 9.000 Mean : 56.59 Mean : 9.622 3rd Qu.: 82.00 3rd Qu.:13.000 Max. :148.00 Max. :32.000 SV IPouts HA HRA Min. : 0.00 Min. : 162 Min. : 49 Min. : 0.00 1st Qu.: 7.00 1st Qu.:4050 1st Qu.:1277 1st Qu.: 38.00 Median :20.00 Median :4182 Median :1384 Median : 98.00 Mean :21.34 Mean :3980 Mean :1334 Mean : 93.15 3rd Qu.:34.00 3rd Qu.:4332 3rd Qu.:1467 3rd Qu.:136.00 Max. :68.00 Max. :4518 Max. :1993 Max. :241.00 BBA SOA E DP Min. : 0.0 Min. : 0.0 Min. : 47.0 Min. : 18.0 1st Qu.:418.0 1st Qu.: 481.0 1st Qu.:126.0 1st Qu.:124.0 Median :491.0 Median : 664.0 Median :155.0 Median :144.0 Mean :469.2 Mean : 676.1 Mean :198.0 Mean :139.0 3rd Qu.:556.0 3rd Qu.: 895.0 3rd Qu.:231.0 3rd Qu.:160.0 Max. :827.0 Max. :1404.0 Max. :639.0 Max. :217.0 NA's :317.0 FP name park Length:2475 Length:2475 Length:2475 Class :character Class :character Class :character Mode :character Mode :character Mode :character attendance BPF PPF Min. : 6088 Min. : 54.0 Min. : 54.0 1st Qu.: 468365 1st Qu.: 97.0 1st Qu.: 97.0 Median : 963895 Median :100.0 Median :100.0 Mean :1170516 Mean :100.2 Mean :100.2 3rd Qu.:1705375 3rd Qu.:103.0 3rd Qu.:103.0 Max. :4483350 Max. :131.0 Max. :129.0 NA's : 279
Get Baseball Hacks now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.