Chapter 1. An Overview of Geospatial Analytics
Geospatial data—that is, data with location information—is generated in huge volumes by billions of mobile phones, sensors, and other sources every day. Data begets data, constantly ratcheting up the unbounded streams of geospatial data (“geodata” for short) awaiting our analysis. Where people and their machines go, what our remote sensing detects, and how our devices are arrayed in space and perform across time all matter a great deal to the vibrancy of our economy, the health of our planet, and our general happiness and well-being. Geospatial analytics can provide us with the tools and methods we need to make sense of all that data and put it to use in solving problems we face at all scales.
Geospatial Analytics: Origins and Evolving Use Cases
Geospatial analytics has its contextual roots in print cartography and its contemporary development in defense. In the United States, the Department of Defense (DoD) has traditionally been the biggest consumer of geospatial analytics. Intelligence community satellites have been producing constant streams of telemetry for decades now, and defense vehicles, sensor nets, and many other sources of data have sprung up over time. With the rise of this type of data, the DoD has helped promote open source, open data, and data analysis companies such as Socrata, Databricks, and Uncharted Software. Much could be written (and indeed has) on the history of geographic information systems (GISs) and geospatial work for defense; rather than rehashing that here, we recommend referring to existing resources for more background, including the history of the Open Geospatial Consortium (OGC), which originated as a voluntary consensus standards organization thanks to efforts from the US Army Corps of Engineers and a great many others in public and private enterprises.
It is important to note that geospatial data is used far beyond the defense realm, however. Consider that just about any local, state, or federal government agency is responsible for a variety of geospatial analytics problems. For example, something as seemingly straightforward as a pothole in a road may require one road crew truck to drive by simply to confirm the need for repair, followed by another truck to actually perform the work. The locations of all those potholes and trucks need to be tracked. This example is multiplied many times over when you consider that there are similar geospatial requirements for crime reporting, traffic collision monitoring and assistance, building permitting, and a great many other governmental functions. Likewise, in geofocused industries like real estate, cadastral mapping, land records, and property exchange archives inform a practice of land use commercialization grounded in geospatial information.
Beyond governmental use cases (satellite telemetry, road potholes, ships at sea, etc.), another driver for geoanalysis in Big Data has been telecom. Visit the Ericsson headquarters north of Stockholm, and you’ll find the Swedish headquarters for Esri, a supplier of GIS software and applications, directly next door. Mobile operators in particular generate massive amounts of streaming data—telemetry about whether their equipment in the field is working correctly, whether their subscribers will be filing service complaints, and so on—much of which requires geospatial analysis.
But it’s not just government and mobile operators that are generating and capturing valuable geospatial data. For starters, anyone who has a fleet of trucks faces a range of geospatial analytics problems; supply chain management, routing efficiency, and resourcing of people, products, and vehicles all operate under an umbrella of geospatial influence factors. This opens new opportunities for leveraging geospatial Big Data in mainstream business.
From satellite and cell phone data, we can glean information about the areas hardest hit by a natural disaster as it occurs, and determine the places and people that most urgently need assistance. Similarly, satellite and remote sensing data can be used to analyze and ultimately fight the effects of climate change, shedding light on both short-term mitigation efforts and long-term solutions to emission problems.
From ambient noise sensors distributed around municipalities, and perhaps even from apps running on mobile phones and other personal devices, we might be able to monitor and discover useful information on a cityscape’s noise levels. This information could be used to report potentially dangerous noise from air traffic, construction projects, or highways and busy streets, and these reports could in turn be used to enforce noise ordinances and safer, more pleasant city zoning and design. We’ll present a use case on airport noise analysis later in this report.
Nonprofits can use data about their donors and communities to better plan and communicate the services they provide. (We’ll discuss this in greater detail later, too.) We can also proactively improve public health by analyzing an individual’s fitness tracking data, comparing it to norms for cohorts, and giving suggestions or “nudges” about behavioral tweaks and habits that could lead to large improvements in health and fitness down the road.
With the rise of the Internet of Things, IoT sensor networks are pushing the geospatial data rates even higher. There has been an explosion of sensor networks on the ground, mobile devices carried by people or mounted on vehicles, drones flying overhead, tethered aerostats (such as Google’s Project Loon), atmosats at high altitude, and microsats in orbit.
These layered sensing networks—layers of data sources, ranging from sensors on the ground to vehicles and mobile devices to drones to satellites, all at complementary levels of detail and cost/performance—serve as a flywheel for a tremendous spike in geospatial data (see Figure 1-1). Ground sensors can be relatively expensive, but they provide detail, whereas satellites provide the broader perspective less expensively at scale over time.
The bottom line here is most often about control systems. Some “thing” needs to be automated, and that automation process needs to be optimized. If that thing moves within the world, it generates geodata and must leverage spatial analytics. That holds for drones, self-driving cars, and robots in general—and in the larger picture, even companies like Uber can be considered as using complex control systems.
New Solutions for More and Complex Geospatial Data
Whatever problems we face, whether at the local, regional, national, or even global scale, if they involve a “where” component, geospatial solutions can probably be brought to bear to improve the result. It may be unclear at the onset of tackling a geospatial analytics challenge, however, which solution would be best. For many tasks, the data is not truly “big,” and in fact mainstream solutions and commodity hardware may prove sufficient to address them. For others, Big Data techniques are required.
For a long while the mainstay of GIS data systems has been the ArcGIS product line at Esri. Visit just about any municipal, state, or federal agency tracking geotagged items, and you’ll find a number of ArcGIS subscriptions—which tends to make Esri seem like the proverbial 800-pound gorilla in the room. Historically much of that industry dominance involves compatibility with proprietary desktop software (Microsoft Windows) and relatively modest dataset sizes, perhaps in the megabyte or gigabyte range.
Whether using ArcGIS or other tools, geospatial work requires atypical data types (e.g., points, shapefiles, map projections), potentially many layers of detail to process and visualize, and specialized algorithms—not your typical ETL (extract, transform, load) or reporting work. A sample of the complexities in many geospatial analyses might include the following:
-
At their foundation, most geospatial applications require some kind of map. Tiling provides rectangles for a selected level of detail, generally raster graphics.
-
Analytics overlays, such as vector graphics, can be layered atop the tiling.
-
Data sources may be relatively sparse and require statistical smoothing or interpolation (e.g., kriging to convert discrete data points into heatmaps, choropleths, and so on that are more useful to visualize data as geospatial overlays).
-
Some data sources (e.g., satellite images) have inherent needle-in-a-haystack problems that require sophisticated algorithms to identify points of interest, or locations that change dramatically over time (e.g., a building under construction).
-
Other data sources—for example, business addresses—provide metadata for maps but may have conflicting information to resolve (e.g., multiple addresses for a business).
-
Data sources come in a bewildering number of formats. This is a hard problem. If you thought that JSON versus Thrift versus Avro versus Parquet versus ORCFile was complex, brace yourself for the complexities of geodata! You can see many examples by perusing the list of OGR vector formats or reading through documentation for libraries such as GDAL (the Geospatial Data Abstraction Library), which supports many raster and vector formats to abstract away format-related complexities for you.
-
Meanwhile, map tiles, data sources, analytics, and the like may bring in a variety of licensing issues and conflicts.
-
Once you have the tiles, the data sources, the metadata, and the analytics, you need an interactive platform for zooming, selecting points, selecting optional layers, and more.
-
Then comes the part that requires real expertise: design, data visualization, interpretation, and storytelling.
We will discuss these complexities in more detail in the next chapter.
As a result of all of this complexity, geospatial analytics has historically not been amenable to SQL, because, for example, it often requires range queries to determine whether two regions intersect. Those can be quite expensive at scale.
As the amount of geospatial data requiring analysis has increased beyond what could be reasonably managed in flat files, purpose-built tools such as PostGIS have been created to provide a more scalable backend. Many geospatial tools, including ArcGIS, can tie into PostGIS storage. PostGIS and similar storage systems have enabled people to prototype and build systems with geospatial datasets up into the terabyte range, but at some point as data size increases (say, into the tens of terabytes and beyond), even they begin to meet scaling problems. Enter geospatial Big Data solutions.
Initially, SQL built for Big Data tended to omit GIS support. For example, Hive added GIS support in 2013, but as a “bolt-on” feature rather than one core to the design. In general, the data independence required for data parallel systems (e.g., Hadoop MapReduce) didn’t fit well with geospatial workloads. However, that environment has been evolving. Esri appears to be the early thought leader here, investing currently in “Team Apache” (Spark, Kafka, and open source libraries on GitHub for working with geospatial data at scale).
What This Report Covers
In order to paint a complete picture of the geospatial options available to you, this report will discuss low-scale commercial desktop GIS tools, medium-scale options such as PostGIS and Lucene-based searching, and truly Big Data solutions built on technologies such as Hadoop. We’ll examine open data that you can use from governments, nongovernmental organizations (NGOs), and private enterprises. We’ll survey the state of play with open source geospatial tools, and we’ll take a look at how some of the biggest of the big (e.g., Google) make use of geospatial data in their various products and services.
As we examine when it makes sense to move from one type of solution to the next, we’ll illustrate the drivers that push an organization to move up the low to medium to big geospatial scale and pay the increasing penalties in complexity and (potentially) cost to get the benefits of doing so. As part of our discussion about the Big Data end of the solution scale, we’ll also review the current trends in geospatial innovation, including the state of the art in geospatial analytics platform stacks, and a bit about the mathematics underpinning them.
Near the end of the report we’ll review examples of how geospatial analytics are being brought to bear, and we’ll provide links to additional resources where you can learn more and join the discussion in the geospatial and Big Data communities.
By the end of this report, you should understand strategies for resolving geospatial data against the physical world across a wide range of scales, and how to benefit from doing so.
Get Geospatial Data and Analysis now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.