04. Networks - CDP Mapping Systems

In this exercise, we will explore networks and distance. We will start by using an abstract network of points and lines to illustrate some concepts, and then move on to using real-world data to identify proximate objects based on network distance.

Import libraries¶

import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import LineString, MultiLineString, Point, Polygon
import pandas as pd
import requests
import networkx as nx
import osmnx as ox
import h3
import libpysal as lps

# we'll begin by borrowing the plotting helper function we created in the geoprocessing notebook for use here.
# oftentimes when writing functions that we want to come back to, we can create a new python file (such as utils.py)
# and import functions from there.
def set_axis_off():
    """
    Set the default matplotlib settings to turn off axes and ticks.
    This function modifies the global matplotlib configuration to hide axes and ticks
    for all plots created after this function is called.
    """
    # set axis off by default
    plt.rcParams["axes.axisbelow"] = False
    plt.rcParams["axes.axisbelow"] = False
    plt.rcParams["axes.spines.left"] = False
    plt.rcParams["axes.spines.right"] = False
    plt.rcParams["axes.spines.top"] = False
    plt.rcParams["axes.spines.bottom"] = False

    # set tick params off by default
    plt.rcParams["xtick.bottom"] = False
    plt.rcParams["xtick.top"] = False
    plt.rcParams["xtick.labelbottom"] = False
    plt.rcParams["xtick.labeltop"] = False
    plt.rcParams["ytick.left"] = False
    plt.rcParams["ytick.right"] = False
    plt.rcParams["ytick.labelleft"] = False
    plt.rcParams["ytick.labelright"] = False

set_axis_off()

To begin, we will find the approximate bounding box of NYC.

# bounding box of nyc
bbox = (-74.3, 40.5, -73.7, 40.9)

Create random network¶

We’ll begin by creating a dataframe of 100 random points. As we can see below, we are using numpy to generate random numbers, along with geopandas’ points_from_xy() to create a geodataframe from these points. We are using the bounds of NYC to make sure that the network approximately covers the city.

# create a geodataframe of 100 random points
np.random.seed(0)
n = 100
df = pd.DataFrame(
    {
        "geometry": gpd.points_from_xy(
            np.random.uniform(-74.3, -73.7, n),
            np.random.uniform(40.5, 40.9, n),
        ),
    }
)

And here we cast the dataframe as a geodataframe and set its CRS.

gdf = gpd.GeoDataFrame(df, crs="EPSG:4326")

As we can see from our initial plot, we have a mass of 100 randomly distributed points. At this stage, this is meaningless!

gdf.plot()

Since we are creating a network (instead of analyzing an already existing one), let’s connect each node to its five nearest neighbors (measured based on straight-line or Euclidean distance). We’ll use geopandas’ distance method to calculate the distance between each pair of points, and then use that to create a list of edges. There are many ways to calculate distance like this, some of which are more appropriate for working with a greater volume of data (scipy’s KDTree operations chief among them).

We can also ignore the warning about the geographic CRS as well given our use case and scale, but do keep in mind that projection-related considerations are extremely important when calculating distances.

# find five nearest neighbors for each point, not including itself
k = 5
neighbors = gdf.geometry.apply(lambda x: gdf.geometry.distance(x)).values.argsort(
    axis=1
)[:, 1 : k + 1]

# join neighbors to original dataframe
gdf["neighbors"] = neighbors.tolist()

Now if we inspect our dataframe, we see that we have an additional column, neighbors, which contains the row index number of the five nearest points to that row.

gdf.head()

Based on that information, we can now draw lines (or network edges) between each point and its nearest neighbors. We can use the following function to create an array of lines between a point and it’s neighbors.

Let’s consider the component parts of the function:

lines = []: we create an empty array to store line geometry in
for i, neighbor in enumerage(r.neighbors): we set up a for loop to iterate through each row’s list of point indexes
lines.append(LineString([r.geometry, gdf.loc[neighbor].geometry])): we create a new LineString geometry with two points: the starting point at the input row r.geometry, and ending at the looked up neighbor’s geometry gdf.loc[neighbor].geometry. The .loc[neighbor] allows us to find the row that matches based on the neighbor id, and then we access its geometry property
return MultiLineString(lines): we combine each of the line geometries in the lines array into one complex MultiLineString, which is what it sounds like (a combination of LineStrings).

def create_lines(r):
    lines = []
    for i, neighbor in enumerate(r.neighbors):
        lines.append(LineString([r.geometry, gdf.loc[neighbor].geometry]))
    return MultiLineString(lines)

We can then apply the function to each row (axis=1). Keep in mind that when executing a function via apply(), the row (or column) input is implied and is the default input r in the function.

gdf["line_geometry"] = gdf.apply(create_lines, axis=1)

If we inspect our dataframe again, we’ll see a new column line_geometry which represents the edges between each point and its five nearest neighbors.

gdf.head()

We can create a new geodataframe based on this edges geometry and plot it- now we have a connected network of points and lines!

lines_gdf = gpd.GeoDataFrame(
    gdf[["line_geometry"]], geometry="line_geometry", crs="EPSG:4326"
)

ax = lines_gdf.plot(color="black", alpha=0.5, linewidth=0.2)
gdf.plot(ax=ax, color="black")

# optionally save the figure to file
# plt.savefig("lines.pdf", bbox_inches="tight", pad_inches=0)

Saving our datasets¶

You may optionally want to save these nodes and edges to file to inspect in QGIS or another software. To do so, we use the to_file() function to save out as GeoJSON files

gdf["geometry"].to_file("nodes.geojson", driver="GeoJSON")
gdf["line_geometry"].to_file("edges.geojson", driver="GeoJSON")

gdf.sample(10)

Let’s plot the nearest neighbors and connecting edges for a single point. Below, we take a random point s and then highlight it in blue, and connect it to its five nearest neighbors in orange. Keep in mind that every time you rerun the following cell, a new sampled point will be chosen.

# plot the lines for a single point
s = gdf.sample()
ax = lines_gdf.plot(color="black", alpha=0.5, linewidth=0.2)
gdf.plot(ax=ax, color="red")

s.plot(ax=ax, color="blue")
gpd.GeoSeries(s.line_geometry).plot(ax=plt.gca(), color="orange")

Apply to a real-world example¶

The toy problem above was a fine way to explore this concept, so now let’s apply it to real-world networks. We will use an API endpoint hosted through NYCOpenData to find restaurants in Morningside Heights.

There are a couple of things to keep in mind about the following API request. First is the request itself: we use the popular requests library to perform a GET request (there are many types of requests, read more here) on the API endpoint listed at the resource URL.

We can see on the open data portal’s website that each dataset has an Actions dropdown, and within that there’s an API option:

The API endpoint given there is the basis of the URL below. We are adding some additional query filters to focus the response items and reduce the total number of responses (as you can see in the screenshot above, there are over 280,000 results).

We add ?NTA=MN09&$limit=2000, which signifies that we only want resulst where the NTA field equals MN09 (i.e. Morningside Heights) and we limit the number of responses to 2000. You can learn more about formatting SODA queries here

get restaurants in Morningside Heights¶

If a query is successful, it will return a result that starts with 2**, such as 200. Error codes are reserved for 4** and 5**. We can make this most basic request first to confirm that our query is properly formatted:

requests.get(
    "https://data.cityofnewyork.us/resource/43nn-pn8j.json?NTA=MN09&$limit=2000"
)

What we really want, however, is the json associated with the response. We can create a new variable and set it to equal the response json payload:

mh_restaurants = requests.get(
    "https://data.cityofnewyork.us/resource/43nn-pn8j.json?NTA=MN09&$limit=2000"
).json()

If we print the results, we see we have an array of objects (up to 2000) that represents the restaurants in MN09. If you scan through, you’ll see attributes like street, building number, cuisine_description, and restaurant scores.

mh_restaurants

For these to be useful to us, let’s cast the array as a geodataframe:

mh_restaurants = gpd.GeoDataFrame(
    mh_restaurants,
    geometry=gpd.points_from_xy(
        [float(x["longitude"]) for x in mh_restaurants],
        [float(x["latitude"]) for x in mh_restaurants],
    ),
)

mh_restaurants.fillna("na", inplace=True)

Now we can see that we have a dataframe with a number of columns, including a geometry column, that we can use to perform calculations on.

mh_restaurants.columns

We can also drop duplicate records (based on the camis unique restaurant id):

mh_restaurants.drop_duplicates(subset="camis", inplace=True)

Let’s inspect the restaurant grades:

mh_restaurants.grade.value_counts()

Additionally, let’s check out the type of food served at each restaurant:

mh_restaurants.cuisine_description.value_counts()

Build a graph to allow us to search¶

Now that we have a geodataframe of restaurants, let’s create a network based on the street grid of the neighborhood. To do so, we’ll use the total geographic bounds of our restaurants dataframe to request a network from the OSMnx module. OSMnx is built on top of the Networkx library, and makes it easy for users to access information from OpenStreetMap for use in network analysis.

mh_bbox = mh_restaurants.total_bounds

mh_bbox

Like the name implies, we can build a network graph from the bbox above using the graph_from_bbox() function. We can specify that we are interested in a walk type network (i.e. pedestrian paths and sidewalks) using the network_type parameter (vs driving or all for example).

Note too that the order of coordinates returned from the total_bounds property does not match the order that OSMnx expects it, so we have to reorder them via the bbox parameter.

mh_network = ox.graph_from_bbox(
    bbox=[mh_bbox[3], mh_bbox[1], mh_bbox[2], mh_bbox[0]], network_type="walk"
)

We can confirm that the result is a networkx multidirectional (MultiDiGraph) graph object:

mh_network

...and we can extract the nodes and edges of that network out as geodataframes:

mh_net_nodes, mh_net_edges = ox.graph_to_gdfs(mh_network)

By inspecting the results, we can see that each row is a street network segment with a number of properties, all derived from OSM data.

mh_net_edges.head()

We can take a look at the distribution of restaurants relative to our network by plotting both:

ax = mh_restaurants.plot()
mh_net_edges.plot(ax=ax, color="black", linewidth=0.1)

Let’s take a look at the output- we can see a network of nodes and edges that represent the Morningside Heights area. Note that unlike our toy example above, network edges here can have corners and curves and are not just individual line segments.

ax = mh_net_edges.plot(color="black", linewidth=0.1, figsize=(10, 10))
mh_net_nodes.plot(ax=ax, color="orange", markersize=1).set_axis_off()

Great - now that we have a network and locations to traverse, let’s use a starting point based on where we are on campus:

fayerweather = Point(-73.9603477, 40.8082408)

Get directions to a random point¶

First we need to associate our starting point and target points to our networks. We begin by finding the nearest network nodes between an input coordinate and the network, for both our origin and a randomly selected restaurant.

orig = ox.distance.nearest_nodes(mh_network, fayerweather.x, fayerweather.y)
sample = mh_restaurants.sample()
dest = ox.distance.nearest_nodes(mh_network, sample.geometry.x, sample.geometry.y)[0]

The above functions return the indices of the nearest network nodes:

orig, dest

Now that we have network positions, we can traverse the network to find the shortest path between each point. To do so, we’ll use the shortest_path() function, and specify our weight as the length between start and end points. There are a number of options for weight, including the number of segments that need to be traversed, or a custom input you specify.

OSMnx also has special helper functions to plot routes, which we’ll use to plot the network and path between points.

# find the shortest path between nodes, minimizing travel time, then plot it
route = ox.shortest_path(mh_network, orig, dest, weight="length")
fig, ax = ox.plot_graph_route(mh_network, route, node_size=0)

If we inspect the route on its own, we’ll see an array of node ids that are ordered based on the shortest path that is generated.

route

That’s not super useful to us- thankfully OSMnx makes it easy to convert the route to a geodataframe for use elsewhere. We can also use some basic math to convert the path from meters to miles to put things in perspective.

edge_lengths = ox.routing.route_to_gdf(mh_network, route)["length"]

# convert meters to miles
sum(edge_lengths) / 1609.34

Find nearest restaurants of a given cuisine¶

Great, now let’s hone in on a specific use case: finding the n nearest restaurants of a given cuisine type based on network distances. For this example I’ll search for Pizza but you can explore using any cuisine_description value you like.

mh_pizza = mh_restaurants[mh_restaurants.cuisine_description == "Pizza"]

mh_pizza.plot()

Let’s assume we are hungry and want to find the five closest pizza restaurants based on network distance. We can use the following function to do so:

dest = ox.distance.nearest_nodes(mh_network, row.geometry.x, row.geometry.y): finds the nearest node for each input row (in this case, each pizza restaurant)
route = ox.shortest_path(mh_network, origin, dest, weight="length"): finds the shortest path between each restaurant and our input point
edge_lengths = ox.routing.route_to_gdf(mh_network, route)["length"]: finds the total length of the route in meters
route_geom...: finds and creates a unioned (combined) geometry of the route
return {"distance": sum(edge_lengths) / 1609.34, "route_geom": route_geom} returns a dictionary of distance (in miles) and the route geometry

def distance_from_point(row, origin):
    """
    Returns the distance in miles and the geometry of the shortest path between a point and a destination.

    Parameters:
    row (pd.Series): A row from a pandas DataFrame.
    origin (int): The origin node in the network.
    """

    dest = ox.distance.nearest_nodes(mh_network, row.geometry.x, row.geometry.y)
    route = ox.shortest_path(mh_network, origin, dest, weight="length")
    edge_lengths = ox.routing.route_to_gdf(mh_network, route)["length"]

    route_geom = ox.routing.route_to_gdf(mh_network, route)
    route_geom = route_geom.geometry.unary_union

    return {"distance": sum(edge_lengths) / 1609.34, "route_geom": route_geom}

mh_pizza.loc[:, "distance_from_class"] = mh_pizza.apply(
    distance_from_point, origin=orig, axis=1
)

If we observe our results, we can see that each row (e.g. each pizza restaurant) has a dictionary of distance in miles and the shortest path. Pretty neat, but not super ergonomic in its current form.

mh_pizza["distance_from_class"]

We can explode each dictionary key:value pair into a separate column using the below operation. We are combining three different steps in the below cell to achieve this:

converting the entries in the distance_from_class attribute into a pd.Series- this is what does the separating into separate columns
dropping the distance_from_class field, as we no longer need it
using the concat(...) function to concatenate the results of the other two operations; this allows us to combine things back together into our one dataframe

# explode the dictionary into separate columns
mh_pizza = pd.concat(
    [
        mh_pizza.drop(["distance_from_class"], axis=1),
        mh_pizza["distance_from_class"].apply(pd.Series),
    ],
    axis=1,
)

Now we can see we have each of the dictionary entries as new columns, and now we can operate over both in our dataframe.

mh_pizza.head()

Now let’s cast our results to a new geodataframe with the geometry set as the routes calculated above.

mh_pizza_routes = gpd.GeoDataFrame(mh_pizza, geometry="route_geom", crs="EPSG:4326")

Now we can plot each pizza restaurant, the shortest path between here and there, and can symbolize based on the distance in miles:

ax = gpd.GeoSeries(fayerweather).plot(color="black")
mh_net_edges.plot(ax=ax, color="black", linewidth=0.1)
mh_pizza_routes.plot(column="distance", cmap="magma", ax=ax, linewidth=3)
mh_pizza.plot(column="distance", cmap="magma", legend=True, ax=ax).set_axis_off()


# label each restaurant point
for x, y, label in zip(mh_pizza.geometry.x, mh_pizza.geometry.y, mh_pizza.dba):
    ax.text(x, y, label, fontsize=6)

# label the fayerweather point
ax.text(fayerweather.x, fayerweather.y, "Fayerweather", fontsize=6)

# title
plt.title("Distance from Fayerweather to \n Morningside Heights Pizza Restaurants")

Great- we can now visualize and intuit the closest pizza places. If we want to look at the actual numbers, we can sort by distance and observe. Note that the top two entries are tied, which suggest that both were snapped to the same network node (which is a consideration to keep in mind when we rely on third-party datasets like OSM)

mh_pizza[["dba", "distance"]].sort_values("distance")

Use grid cells to build and traverse a network¶

So far, we’ve been able to create networks from a random distribution of points, as well as from the street network of a real-world location. For our final example, let’s explore how we can use a grid of cells to create a network and traverse it. More specifically, we’ll create a grid of hexagons, and then connect each hexagon to its nearest neighbors using the h3 library. H3 was developed by Uber and represents a hierarchical spatial index that uses hexagonal grid cells to partition space. Hexagons are useful for this purpose because they have a number of desirable properties, including uniform adjacency (each hexagon has six neighbors), reduced distortion compared to square grids, and efficient coverage of large areas.

# use H3 grid to create a hexagon around the restaurants
# count the number of restaurants in each hexagon
# use pysal to get adjacency between hexagons
# build new network with hexagons as nodes and their adjacency as edges
# use networkx to find the shortest path between hexagons

# get the H3 hexagons

mh_net_nodes

74 - 76, 88 - 98

Here we are querying the H3 dataset at resolution 9 for each node in our network. You can read more about H3 resolutions here. This will return a unique hexagon id for each point in our network. We will deduplicate these momentarily and create a new dataframe of hexagons.

mh_net_nodes["h3"] = mh_net_nodes.apply(
    lambda x: h3.geo_to_h3(x.geometry.centroid.y, x.geometry.centroid.x, 9), axis=1
)

mh_net_nodes["h3"]

We can use the h3_to_geo_boundary() function to get the polygon geometry of each hexagon, and then populate a new field based on that geometry.

# create hexagons for AOI
mh_net_nodes["geometry"] = mh_net_nodes["h3"].apply(
    lambda x: (h3.h3_to_geo_boundary(x))
)

mh_net_nodes["geometry"]

Unfortunately H3 returns the polygon vertices in the opposite order that shapely expects them, so we have to reverse the order of the coordinates. We can do so using a list comprehension. While we’re at it, we can also cast the coordinates as a Polygon geometry.

# flip order of coordinates for each tuple
mh_net_nodes["geometry"] = mh_net_nodes["geometry"].apply(
    lambda x: Polygon([(y, x) for x, y in x])
)

We can assign the geometry a crs now that we know it is in lat/lon

mh_net_nodes.crs = "EPSG:4326"

mh_net_nodes.crs

Finally, we’ll create a new, deduplicated dataframe of hexagons.

h3_lvl_9_cells = mh_net_nodes.dissolve(by="h3").copy().reset_index()

h3_lvl_9_cells.boundary.plot()

Great, now let’s overlay the hex grid on top of our road network to get a sense of scale and how this new network compares.

ax = h3_lvl_9_cells.boundary.plot(color="black", figsize=(10, 10))
mh_net_edges.plot(
    ax=ax,
    color="black",
    linewidth=0.1,
)

As we can see, the hexagons are fairly small, reasonably appropriate for the scale of our analysis (i.e. traversing city blocks). Next, we will create a network by connecting each hexagon to its six nearest neighbors. While h3 has some native functions to calculate grid adjacency, we will use the more general purpose library libpysal to do so. libpysal is a spatial analysis library that has a number of useful functions, including the ability to calculate spatial weights and relationships between geometries.

We will caalculate a spatial weights matrix using the Queen contiguity method, which considers two polygons to be neighbors if they share a common edge or vertex. This is appropriate for our hexagonal grid, as each hexagon has six neighbors that share edges.

# create an adjacency matrix for the hexagons
w = lps.weights.Queen.from_dataframe(h3_lvl_9_cells)

Just like with the n-nearest neighbors example above, we can use the spatial weights matrix to create a list of neighboring hexagons for each row in our dataframe. We can use dictionary comprehension to create a mapping of each hexagon’s index to its list of neighbors, and then create a graph network from that mapping using networkx.

# get neighbors for each hexagon
neighbors = {k: v for k, v in w.neighbors.items()}

neighbors

Here we create a new graph G based on this adjacency matrix.

G = nx.Graph(neighbors)

Let’s take a look at the graph- each node ID is plotted on the centroid of each hexagon, and the edges represent the connections between each hexagon and its six neighbors. As before, we can ignore the geographic CRS warning for this example.

# plot the path
ax = h3_lvl_9_cells.plot()
h3_lvl_9_cells.boundary.plot(ax=ax, color="black")

# label cells with their index
for x, y, label in zip(
    h3_lvl_9_cells.geometry.centroid.x,
    h3_lvl_9_cells.geometry.centroid.y,
    neighbors.keys(),
):
    ax.text(x, y, label, fontsize=6)

Now we can inspect the adjacency dictionary above and understand how the adjacency is structured. Each key is a hexagon index, and the value is a list of adjacent hexagon indices.

With this in mind, we can use networkx and the shortest_path() function to find the shortest path between two hexagons by passing the graph G and two hexagon indices. Here we are finding the shortest path between hexagon 8 and hexagon 29.

# get the path between two nodes
path = nx.shortest_path(G, 8, 29)

The path is returned as a list of hexagon indices that represent the shortest path between the two input hexagons.

path

Finally, let’s visualize the path on top of the hexagon grid. We can use the iloc[...] function to look up the hexagons in path by their indices, and then plot the resulting geometries.

# plot the path
ax = h3_lvl_9_cells.plot()
h3_lvl_9_cells.boundary.plot(ax=ax, color="black")
h3_lvl_9_cells.iloc[path].plot(ax=ax, color="red").set_axis_off()

# label cells with their index
for x, y, label in zip(
    h3_lvl_9_cells.geometry.centroid.x,
    h3_lvl_9_cells.geometry.centroid.y,
    neighbors.keys(),
):
    ax.text(x, y, label, fontsize=6)

As we can see, the shortest path between hexagon 8 and hexagon 29 traverses through a number of intermediate hexagons, following the edges of the hexagonal grid. We can imagine cases where this type of grid-based network analysis could be useful.

One thing we didn’t explore as much here is the idea of cost or weight associated with traversing between nodes. In our street network example above, we used the length of each street segment as the weight for calculating the shortest path. We could just as easily assign other weights, such as travel time, elevation change, degree of sun exposure, etc. Think about the possibilities!

Syllabus

03. Making a web map with Maplibre