Module 7. Geospatial Analytics

Learning Objectives

Outline a design approach to project planning and implementation.
Design a geographic data analysis using a design framework.
Identify the different dimensions of data quality.
Explain the importance of reproducibility in relation to geospatial analytics.
Identify three types of data reduction techniques.

Lecture Slides

There are no lecture slides for this week to compensate for the videos for lab/discussion.

Assignments

Lab
Quiz 7

Designing Your Geospatial Analysis

The Oxford dictionary defines analytics as the systematic computational analysis of data or statistics. Design is the planning and creation of a framework to guide a process; for our purposes, the process is the implementation of a geographic analysis. The design will give you a holistic view of the requirements of your project and provide a means for communicating it with stakeholders and collaborators. There are a variety of different design frameworks that can be used, and you should do some trial and error to find one that fits best with your typical work style.

Defining your goals, objectives, and constraints

The terms goals and objectives are used extensively in Geographic Information Science. A goal is a long-term and broad outcome. For example, a goal of geographic data science may be to improve our understanding of how uncertainty impacts decision-making in emergency situations. Objectives, on the other hand, are specific, measurable steps to meet a goal. One objective that we might define towards our goal of understanding is to develop and implement a map user experiment where different methods of communicating uncertainty are presented to emergency responders in order to determine if there is a best method for visualizing uncertainty in that context. Often, when designing geospatial analysis, you will have one goal but multiple objectives that lead to the satisfaction of that goal.

In addition to determining what the goals and objectives are for your project, it is important to understand other constraints on your analysis. For example, you will likely be performing your analysis for stakeholders. A stakeholder is any individual with an interest or concern about the project. Stakeholders in our emergency management case could be emergency responders or local government officials who make decisions about response budgets. Understanding who will use the results of your analysis will help you determine what questions to ask.

Other constraints that may impact your project include deadlines, specific hardware or software requirements, institutional review, collaboration expectations, funding, delivery methods, data management requirements, reproducibility standards, and even data availability.

Collecting the data

Once you understand who your stakeholders are and the goals and objectives of your analysis, you can begin to identify your data. Data collection is the process of acquiring, extracting, and storing data.

There are two types of data. Primary data refers to any data in its raw format from an original source. This data is collected using surveys, measurements, or experimental methods. One way that we gather primary data as geographers is through Global Positioning Systems (GPS). Secondary data, on the other hand, are data that have already been collected and may be reused. These types of data can come from internal sources or external sources. Geographic data scientists often use secondary data sources that are acquired through web portals like the USGS Earth Explorer.

When collecting data for a project, you should consider the quality of the data. Geographic data quality includes accuracy, completeness, consistency, timeliness, and validity.

Data accuracy is the degree to which the data represents the real world. Geographic data accuracy refers to both geographic accuracy (horizontal and vertical) as well as conceptual accuracy. Returning to our emergency management example, if we were using road data we would be interested in the topological accuracy of the road, the thematic definitions of the roads (highway versus rural road), and the geographic location of the road. You should determine how accurate your data must be by referring to your analytical goals and objectives.

Completeness refers to a dataset's coverage of the topic. In addition to missing values, data may be deemed incomplete when an attribute is missing or the data is truncated. Consistency is an indicator of how well your dataset aligns with other datasets or references. An example of consistency that is often encountered in GIS is the consistency of identification records for joining. In some cases, the data type can vary between two datasets (e.g., text versus numerals), making a join impossible.

Timeliness refers to the latency between data collection and availability. This concept is particularly important for analyses of continuously collected sensor data. Data collection from a sensor requires a number of intermediate steps before it can be used for analysis, and the delay caused by these intermediate steps can impact the timeliness and quality of the data. In our emergency management example, we might expect a seismic sensor to measure ground motion and report it rapidly to alert for an earthquake, but if too much time passes, the alert could come after we need it.

Finally, validity is the extent to which the data represents what it is intended to. For our emergency management example, if we were surveying households potentially affected by hurricane inundation, we would want to make sure that the survey was valid in its coverage of the population we survey. We could ask questions about whether the survey was representative of the diverse population, whether the survey sample size was large enough, and whether the survey asked the questions we are interested in.

Many times, we do not have control over the collection of geospatial data. Thus it is important to understand the data quality standards that the organization collecting the data has. This information may be available in the data’s metadata, but more general statements about data quality and standards are often written as separate documents. It is always good practice to review these documents when considering the use of geographic data for your analysis.

Some instances will require much higher data accuracy and precision than others. Once you have gathered your data and evaluated its quality, the next step is to being preprocessing it and evaluating its usefulness for the goals and objectives that you outlined when scoping out your project.

Preprocessing data

Data preprocessing is any process performed on raw data to prepare it for analysis. Data preprocessing includes data cleaning, transformation, and reduction. The specific preprocessing steps you will carry out are both data and context-dependent. One way to conceptualize data preprocessing is as a three-step process of data cleaning, transformation, and reduction.

Data Cleaning

Data cleaning is the process of fixing data. This can include identifying and removing incorrect data, reformatting improperly formatted data, and removing duplicate or incomplete data within a dataset. First, we may clean data by eliminating non-essential information. You will often acquire data that contains far more attributes than are necessary, eliminating these nonessential attributes. Next, there may be a need to rename your attributes. In some cases, database feature names may seem illogical, compound multiple attributes in a single attribute, or be longer than a specific tool allows. Finally, there is a need to assess the topological correctness of your data and repair any topological errors. For example, the American Community Survey uses field names that include a number of exclamation points that can make it hard to read.

Data Transformation

Data transformation is the conversion and structuring of data into a usable format. The first way that we perform data transformation is by reprojecting data to a coordinate reference system that is most appropriate for the location of interest. Relatedly, it may be necessary to convert geospatial location information into new measurements. For example, converting the degree-minute-second format to decimal degrees. Another way that you may have to transform your data is between data types. Some spatial analyses require data to be stored in specific formats, such as integers or floating-point numbers. Similarly, it may be necessary to transform geometry types or data types (e.g., vectorization). It is important to remember when performing these types of operations that you keep a copy of the original data so you can return to it if you make an error.

Attribute Selection

Attribute selection, also called feature selection, is a process of selecting attributes based on filtering. Filtering uses the inherent characteristics of the data to identify the most optimal attributes for modeling. We won’t go into detail about the types of attribute filtering methods you might use, but you should be aware that such methods exist to improve your attribute selection. Another method of transformation is discretization, whereby continuous numerical values are converted to discrete or categorical values. Finally, normalization is a process of rescaling data to the range of 0 to 1, and this is particularly useful when comparing attributes that have very different data ranges.

Data Reduction

Data Reduction is the process of reducing the amount of data records of the number of attribute values. Data aggregation is the process of summarizing data based on a common attribute. For instance, aggregating data based on spatial or temporal properties. You could also summarize data using descriptive statistics, such as the mean or median of an attribute by a common attribute. Another method of data reduction, data subsetting, is the process of selecting a portion of the data based on given criteria, such as time or space. Finally, dimensionality reduction is a process of reducing the number of attributes in a dataset while retaining as much information as possible. This is often done to improve a model’s performance or to reduce the correlation of attributes. While we will not go into detail about different methods of dimensionality reduction, you may come across the dimensionality reduction method of Principal Components Analysis (PCA) if you take a remote sensing class, as it is a common method used in that domain.

By this point of your analysis process, you should be familiar with your intended stakeholders, analytical goals and objectives, and data. You may even have some expectations about the type of analysis you will perform. In the next step, you will begin to compile a series of analytical processes to solve the geospatial questions you want to ask.

Analyzing the Data

There are numerous analytical methods within the domain of geospatial analysis. These methods can be organized in a variety of ways. Here we will categorize them using a common framework of Four Analytical Methods, descriptive, diagnostic, predictive, and prescriptive.

Descriptive Analytics are processes summarizing a data set’s main features and characteristics using statistical measures of distribution, central tendency, and variability. You have likely used descriptive analytics. The average of a value, or mean, is a common method of descriptive analysis.

Diagnostic Analytics are processes using data to determine the causes of trends in data and relationships between variables. Correlation is a statistical technique for evaluating a relationship between two or more variables. It is important to remember that such relationships do not infer causation, a cause-and-effect relationship between the variables. There are many examples where there is a correlation but no causation between two variables.

Predictive Analytics processes use data in conjunction with statistical algorithms and machine learning techniques to identify the likelihood of future events based on historical data. Regression is a common method of prediction. An example of predictive analytics in geography is predicting future real estate hotspots based on existing data concerning physical and human geography attributes.

Finally, Prescriptive Analytics uses data to determine an optimal course of action. Route selection or site selection are two optimization problems that fall within this category.

Evaluating Results

After performing the analysis, it is important to verify the reliability of the results. The method of cross-validation uses repeated independent samples of the data to test the strength of a model. This is a common method used in classification analyses, such as unsupervised classification. Another method of evaluation, sensitivity analysis, evaluates how the model behaves when model parameters are changed. Depending on the types of analysis you have completed, it may also be important to get feedback from stakeholders on the validity of your models. Oftentimes, a stakeholder or subject matter expert has a much more complete understanding of the real-life situation you are trying to analyze.

Communicating the Results

Effective communication of geospatial analysis results is imperative to help your stakeholders make good decisions. First, your message should be customized to your audience, their needs, and their level of knowledge about the topic you are presenting. Second, focus on the topics that are most important to your audience. Clear and direct communication can help clarify your message and give it more creditability to your audience. Third, tell a story. Storytelling is becoming a popular method for communicating science and has been shown to be effective in engaging with audiences. Use proper visualizations as appropriate, and remember that less is more. We have all sat in a presentation where someone is presenting the results of complex statistical analysis, and they bring up a presentation slide filled with dozens, if not hundreds, of data values. They then ask you to focus on one or two values buried within the spreadsheet. Avoid this. Not only is it annoying to your audience, but it is also reducing the impact of what you are trying to communicate. Similarly, remember that not everything needs to be communicated as a map. Knowing when to use a map to show data trends vs. when it is better to use a chart or diagram will make you look like a professional.

We have covered a lot of information this week. Hopefully, this week you have gained an appreciation for the design aspect of geospatial analysis. As you move into professional roles in GIS, being able to systematically design, implement, and communicate your geospatial analyses will become secondhand. For now, think about how you can use your new knowledge to inform your final projects.

Readings

Goodspeed, R. and Grengs, J. (2017). GIS&T in Urban and Regional Planning. The Geographic Information Science & Technology Body of Knowledge (4th Quarter 2017 Edition), John P. Wilson (ed). DOI:10.22224/gistbok/2017.4.2
Peter Kedron, Wenwen Li, Stewart Fotheringham & Michael Goodchild (2021) Reproducibility and replicability: opportunities and challenges for geospatial research, International Journal of Geographical Information Science, 35:3, 427-445, DOI: 10.1080/13658816.2020.1802032

PreviousModule 6. Geovisualization NextModule 8. GIScience Ethics

Last updated 3 years ago

hashtagLearning Objectives

hashtagLecture Slides

hashtagAssignments

hashtagDesigning Your Geospatial Analysis

hashtagDefining your goals, objectives, and constraints

hashtagCollecting the data

hashtagPreprocessing data

hashtagData Cleaning

hashtagData Transformation

hashtagAttribute Selection

hashtagData Reduction

hashtagAnalyzing the Data

hashtagEvaluating Results

hashtagCommunicating the Results

hashtagReadings