# Module 2. Geospatial Data

## Learning Objectives

* List the two data models used by geospatial analysts and provide a list of the benefits and drawbacks of their usage.&#x20;
* Determine when to use various data types (int, string, boolean, etc.)&#x20;
* Explain the relationships between geospatial data models, types, and formats.&#x20;
* Describe the components of a geospatial database.&#x20;

## Lecture Slides

{% embed url="<https://docs.google.com/presentation/d/1cRZ0l6F5YZpl9KOIQ9x3e0UBQgaqG2U5GZ8IziR6ka4/edit?usp=sharing>" %}
Lecutre 2. Geospatial Data Model and Structures
{% endembed %}

## Assignments

* [ ] Lab Assignment
* [ ] Quiz 2
* [ ] Lecture Video

## Overview

Geospatial data are necessary for studying geography. **Geospatial data** is information that describes objects, events, or other features with a location on or near the surface of the earth. It is important to understand where geospatial data comes from and how it is structured and stored in formats that are commonly used by geospatial analysts.  This module will introduce you to these topics.&#x20;

### Data Collection Systems&#x20;

**Data collection** is the process of gathering and measuring information about variables of interest in an established systematic fashion.  Primary data are data that have been collected directly from the source by the researcher.  Secondary data is data that has been collected previously but may be available for analysis.  An example of primary geographic data is data collected with a Global Positioning System or GPS.  Secondary data include data such as digitizations created from original maps or census data collections.&#x20;

#### Remote Sensing&#x20;

The National Reconnaissance Office (NRO) manages satellite-based data collection for the United States (<https://www.nro.gov/>).  Both airborne and satellite-based systems are used to collect remote sensing data. Systems may be either National Systems, developed and deployed by the NRO, or Commercial/Civil systems developed in the private sector or by other federal agencies.

**Remote sensing** is a key method of obtaining geospatial data by acquiring information about an object or target using a device that is not physically close to the object under study. Remote sensors collect information about a device that is not in physical proximity with the object understudy.” Remote sensing occurs through an interaction between some form of electromagnetic energy and the features above, on, or below the earth’s surface. Remote sensing will be discussed in more detail in later modules.

### Data Models&#x20;

A **data model** organizes data elements and standardizes how the data elements relate to one another. A data format is a mode in which these data models are represented using bits and bytes. On data model used by geographers is the vector data model. The **vector data model** is a representation of the world using points, lines, and polygons. A common vector data format used by geographers is the shapefile (SHP) format.

Remote sensing data is often stored in a raster data model. The **raster data model** utilizes grids to organize data measurements. These grids are often rectangular but may take other forms, such as hexagons or lattices. A remote sensing image is often stored in a raster data model, and this raster data model may be in a Tag Image File Format (TIFF) format.  &#x20;

Raster and vector data models have advantages and disadvantages depending on the data being collected.  The following summarizes some of the advantages and disadvantages of both models.

**Advantages of Vectors**&#x20;

Vector data provide more accurate representations of point, line, and polygon features due to the geometric nature of raster grids. As vector data are stored as latitude and longitude coordinates, it is possible to know the precise location of these features.  In contrast, raster pixels generalize over some areas.  Second, vector data are generally smaller in size than rasters, although this property is becoming less important as computer capabilities increase.   Finally, topology is a final advantage of vector data over raster data.  Topology refers to the rules that model coincident relationships between the points, lines, and polygons in a data set. We will discuss topology further in a later module.

**Advantages of Raster Models**

Rasters are a common data model and one that most people have ready access to.  We can create images with the cellular phone cameras we carry with us every day, and most computers today come with built-in software for editing such images. Second, the grid system of raster data models is simple and allows for straightforward data processing.&#x20;

### Data Types

Another term you may encounter when working with geospatial data is the **data type**, the particular kind of data item defined by the values it can take.  There are five basic categories of data types- integer (int), floating point (float), character (char), string (str), boolean (bool), and array. Additionally, various types of temporal data types also exist.

Table 2.1. Examples of Data Types

| Data Type      | Definition                                                    | Example                |
| -------------- | ------------------------------------------------------------- | ---------------------- |
| Array          | List of elements in a specific order                          | \[(1,2), (2,3), (3,4)] |
| Boolean        | represents a truth value                                      | TRUE or FALSE          |
| Character      | Single letter, digit, punctuation mark, symbol or blank space | A, !, 3                |
| Floating point | Numeric data including fractions                              | 1.75, 2.34, 8.876909   |
| Integer        | Numeric data excluding fractions                              | 1, 2,3, 4, 506         |
| String         | Sequence of characters                                        | Hello World!           |
| Timestamp      | Data and time together                                        | 2022-01-01 12:00:00    |

Text data may be stored as either individual characters as strings or sequences of characters.  Examples of strings include the names of streets or cities. Numerical data is stored either in a floating-point or integer values.  Floating-point numbers are those that contain decimals/fractions, while integer refers to whole numbers.  The term floating point refers to the fact that the decimal point can be located anywhere in relation to the number’s significant digits. An example of a floating-point value in geographic data is geographic distance. An integer value, or whole value, could be used to represent the count of trees in a given region.&#x20;

We can further describe integer data based on the number of significant digits. Short integers are 16-bit values that can store numbers ranging from −32,768 to 32,767 (signed) or from 0 to 65,535 (unsigned). Long integers are used to represent 32-bit values ranging from −2,147,483,648 to 2,147,483,647 (signed) or from 0 to 4,294,967,295 (unsigned).

When discussing the capacity, we can also differentiate between single-precision and double-precision floating points. Single-precision data are  32-bit values that are stored with 7-bits on the left side of the decimal point (128 unsigned, 127 signed) and 23-bits to the right side of the decimal (7 decimal places). Double-precision floating values stores 11-bits of data on the left side of the decimal point and 52 bits on the right.  The main advantage of floating-point versus its counterpart, fixed point, is that it can represent a much larger range of values <http://wiki.gis.com/wiki/index.php/Floating_point>).

### Geodatabases

A **database** is an organized collection of data stored and accessed electronically.  A **relational database** is a “type of database that stores and provides access to data points that are related to one another” (<https://www.oracle.com/database/what-is-a-relational-database/>).  Geographers use **geospatial databases** used for  “storing and querying data that represents objects defined in a geometric space” (<https://www.oracle.com/autonomous-database/what-is-geospatial-database/>). A database management system is used to manage the contents of a database. For those of you familiar with ArcGIS, you have likely worked with geodatabases (<https://pro.arcgis.com/en/pro-app/latest/help/data/geodatabases/overview/what-is-a-geodatabase-.htm>).  Open-source database management systems also exist, such as PostgreSQL (<https://www.postgresql.org/>)  and its geospatial extension PostGIS (<https://postgis.net/>).&#x20;

**Spatial databases** use a spatial index to speed up database operations by providing a framework for locating records. In addition to typical aspatial queries such as SELECT statements, spatial databases can perform a wide variety of spatial operations. Spatial indices optimize spatial queries to answer questions such as “Do two points fall within a spatial area of interest?”

## Readings

You may need to obtain these from the [University of Illinois Library](https://www.library.illinois.edu/search-tools/). &#x20;

* Diamond, L. (2019). Vector Formats and Sources. The Geographic Information Science & Technology Body of Knowledge (4th Quarter 2019 Edition), John P. Wilson (ed.). DOI: [10.22224/gistbok/2019.4.8](https://doi.org/10.22224/gistbok/2019.4.8)
* Williams, C. (2019). Raster Formats and Sources The Geographic Information Science & Technology Body of Knowledge (4th Quarter 2019 Edition), John P. Wilson (Ed.). DOI: [10.22224/gistbok/2019.4.11](https://doi.org/10.22224/gistbok/2019.4.11)
* Nyerges, T. (2017).  Logical Data Models. *The Geographic Information Science & Technology Body of Knowledge* (1st Quarter 2017 Edition), John P. Wilson (ed.). DOI: [10.22224/gistbok/2017.1.2](https://doi.org/10.22224/gistbok/2017.1.2)
