01. Describing and visualizing data

Getting Started¶

Download MapPLUTO data from NYC DCP.
Unzip the file and place contents in the Data directory at the root of this repo
Make sure you have installed all requisite libraries by running pip install -r requirements.txt with your virtual environment activated. For guidance on setting up or activating your virtual environment, refer to the notes in 00_Getting_started.

Goals¶

Download data from NYC DCP’s open data portal
Load data from file
Explore spatial and non-spatial elements of the dataset
Visualize spatial and non-spatial elements of the dataset

The following libraries will allow us to import and explore our data. Note that in some cases we import the entire library and even use a short-hand reference (e.g. import geopandas as gpd), while in other cases we only import submodules (e.g. from lonboard._layer import PolygonLayer). This is largely a matter of preference, but importing submodules can help reduce memory usage and improve performance, especially when working with large datasets or in an app you’ve developed.

# the bare minimum
import matplotlib.pyplot as plt  # for plotting
import geopandas as gpd  # for geospatial data handling
from matplotlib.lines import Line2D

# more advanced
from lonboard._map import Map
from lonboard._layer import PolygonLayer  # for mapping in 3D
from lonboard.colormap import (
    apply_categorical_cmap,
    apply_continuous_cmap,
)  # for assigning colors
from palettable.colorbrewer.sequential import PuRd_9  # for color palettes
from matplotlib.colors import LogNorm  # for logarithmic normalization
import pygwalker as pyg  # for creating interactive data visualizations

Load PLUTO data¶

MapPLUTO is New York City’s tax lot database, which contains detailed information about the city’s land parcels, including their size, zoning, and ownership. The dataset is updated quarterly and is a foundational resource for understanding the city’s built environment.

Here, we create a variable (pluto) and use geopandas to read the file into memory. We use a “relative” path to reference the file.

pluto = gpd.read_file("../Data/nyc_mappluto_25v1_1_shp/MapPLUTO.shp")

Basic exploration¶

A great way to familiarize yourself with a dataset is to look at the first few rows. Scroll to the right- you’ll see an ellipsis (...) indicating that there are more columns (there are 95!). The rightmost column is the geometry, which contains the spatial information for each row. That’s what we’ll use to map later, but all of the other information contains data that can help us gain a deeper understanding of the dataset.

pluto.head()

First, let’s get a sense of what columns are available in the dataset.

pluto.columns

We can also check the data types of each column- we can see there are a mix of numeric (int32, int64, float64) and string (O = object) data types, as well as some geometry data (geopandas.array.GeometryDtype).

list(pluto.dtypes)

Exploring a categorical column¶

A great way to get a sense of a field is to look at the frequency of values in a categorical field. here, we look at the LandUse field, which describes the type of land use for each property in the dataset.

pluto.LandUse.value_counts()

🧐 What do those numbers mean? Let’s look at the data dictionary

# now we can remap the numbers into something more meaningful

land_use_codes = {
    "01": "One & Two Family Buildings",
    "02": "Multi-Family Walk-Up Buildings",
    "03": "Multi-Family Elevator Buildings",
    "04": "Mixed Residential & Commercial Buildings",
    "05": "Commercial & Office Buildings",
    "06": "Industrial & Manufacturing",
    "07": "Transportation & Utility",
    "08": "Public Facilities & Institutions",
    "09": "Open Space & Outdoor Recreation",
    "10": "Parking Facilities",
    "11": "Vacant Land",
}

We can “map” the Land Use codes to more descriptive names. the map() function will replace instances of the “key” with the corresponding “value” in the dictionary.

pluto["LandUse"] = pluto.LandUse.map(land_use_codes)

# now when we perform operations on the LandUse field,
# we can use the more meaningful names.
# let's look at the frequency of values in the LandUse field again
pluto.LandUse.value_counts()

⚠️ Caution! So far we have been counting the number of rows, but not saying anything about the area of each land use type. To do so, we can sum across rows of the same type. We can use the groupby() function to group the data by a categorical column type and then sum the values in another column (in this case, the lot area) the area for each type.

Grouping by a categorical column¶

Here, we are grouping the data by the landuse column and then summing the values in the lotarea column for each group. The result is a new DataFrame that contains the total lot area for each land use type. We can sum or manipulate multiple columns at a time, and can chain together multiple operations as you can see below.

pluto.groupby("LandUse").LotArea.sum().sort_values(ascending=False)

# more complex grouping
# We can also use the `agg()` function to apply multiple aggregation functions to different columns at once.
# For example, we can calculate the total lot area and the average building area for each land use type:
landuse_summary = (
    pluto.groupby("LandUse")
    .agg({"LotArea": "sum", "BldgArea": "mean"})
    .reset_index()
    .rename(
        columns={"LotArea": "Total Lot Area", "BldgArea": "Average Building Area"},
    )
)
landuse_summary

Now let’s plot the total lot area for each land use type.

# there are many ways to visualize this data, and levels to refining the graphic style.
# This is the simplest possible example, using matplotlib to create a bar chart
pluto.groupby("LandUse").LotArea.sum().sort_values(ascending=False).plot.bar()
plt.title("Total lot area by land use type")

# now the same but for building area - note this is total building area, not average
pluto.groupby("LandUse").BldgArea.sum().sort_values(ascending=False).plot.bar()
plt.title("Total building area by land use type")

Below is a more complicated example where we can plot two variables against each other to compare.

Note how we are manipulating a “figure” (fig) and “axes” (ax) object to create a more complex plot. This is a common pattern in data visualization libraries like Matplotlib and Seaborn, where you can create a figure and then add multiple axes to it, each with its own plot. You can see several invocations of ax where we set various properties.

We also use two copies of our pluto dataset as inputs - note how each copy is aggregated over a different variable.

# plot both lot and building area on the same plot with a secondary y-axis
fig, ax = plt.subplots()
by_lot_area = pluto.groupby("LandUse").LotArea.sum().sort_values(ascending=False)

by_lot_area.plot.bar(ax=ax, color="orange")

# get order to apply below
order = {v: i for i, v in enumerate(by_lot_area.index)}

ax.set_ylabel("Lot Area")
ax.set_xlabel("Land Use Type")

ax2 = ax.twinx()
pluto.groupby("LandUse").BldgArea.sum().reindex(by_lot_area.index).plot.bar(
    ax=ax2, edgecolor="black", color="none"
)
ax2.set_ylabel("Building Area")

plt.title("Total lot and building area by land use type")

# add legends
ax.legend(["Lot Area"], loc="upper left")
ax2.legend(["Building Area"], loc="upper right")

numeric column¶

We can use the describe() function to get a summary of the numeric columns in the dataset.

pluto["NumFloors"].describe()

# We can see that the average number of floors is 2.35, but the maximum is 104! and that the majority of buildings have 1-3 floors.
# Let's visualize the distribution of the number of floors using a histogram
plt.figure(figsize=(10, 6))
pluto["NumFloors"].hist(bins=30, edgecolor="black")
plt.title("Distribution of Number of Floors")
plt.xlabel("Number of Floors")
plt.ylabel("Frequency")
plt.xlim(0, 20)  # limit x-axis to 20 for better visibility
plt.xticks(range(0, 21))  # set x-ticks to integers from 0 to 20
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()

Interactive plotting¶

We can use the pygwalker library to create an interactive visualization of the data. Especially as we are becoming familiar with the dataset, this can be a useful way to explore the data and see how different variables relate to each other.

# pygwalker doesn't suppert geospatial data directly, so we need to drop the geometry column.
# Be sure to keep a copy of the original data, we'll need it later!
pluto_non_spatial = pluto.drop(columns=["geometry"])

# Invoke pygwalker, begin exploring the data interactively
pyg.walk(pluto_non_spatial)

Your turn:¶
Take a few minutes to explore the dataset on your own.
Try creating graphs that show the relationship between different variables.
Let’s discuss what you find or what you’re exploring.

Mapping¶

Creating a static map¶

In this example, we are using matplotlib to map our pluto data. Under the hood, matplotlib is drawing each individual polygon (>800,000), which is resource intensive and hard to discern on a map! This is a good place to start, but we’ll soon move on to more advanced mapping techniques.

pluto.plot(figsize=(10, 10)).set_axis_off()
plt.title("NYC MapPLUTO")

cmap = {
    "One & Two Family Buildings": "#ffff00",
    "Multi-Family Walk-Up Buildings": "#fffb00",
    "Multi-Family Elevator Buildings": "#ffc800",
    "Mixed Residential & Commercial Buildings": "#ff4000",
    "Commercial & Office Buildings": "#ff0000",
    "Industrial & Manufacturing": "#7700ff",
    "Transportation & Utility": "#808080",
    "Public Facilities & Institutions": "#001580",
    "Open Space & Outdoor Recreation": "#219F21",
    "Parking Facilities": "#A6A6AB",
    "Vacant Land": "#222222",
    "Unknown": "#000000",
}

pluto.fillna({"LandUse": "Unknown"}, inplace=True)

pluto.LandUse.unique()

# here we make a new column called "color" that maps the LandUse values to colors
pluto["color"] = pluto["LandUse"].map(cmap)

pluto["color"].unique()

ax = pluto.plot(
    color=pluto["color"],
    figsize=(10, 10),
    legend=True,
).set_axis_off()
plt.title("NYC MapPLUTO")


# populate legend items based on dict from above
legend_colors = [
    Line2D([0], [0], marker="o", color="w", markerfacecolor=c, markersize=10)
    for c in cmap.values()
]
labels = cmap.keys()
plt.legend(legend_colors, labels, loc="upper left")

Your turn:¶

Map a numeric column using a continuous colormap for Queens. See here for a list and discussion of colormaps: https://matplotlib.org/stable/tutorials/colors/colormaps.html
What patterns emerge?

pluto.LotArea.plot.hist(bins=100, log=True)

Prep for interactive mapping¶

pluto_wgs = pluto.to_crs("epsg:4326")

Visualize a categorical variable¶

cmap_rgb = {k: list(int(v[i : i + 2], 16) for i in (1, 3, 5)) for k, v in cmap.items()}

cmap_rgb

len(pluto_wgs[pluto_wgs["LandUse"].isna()])

pluto.fillna({"LandUse": "Unknown"}, inplace=True)

now, we can plot the data using lonboard to create an interactive map¶

df = pluto_wgs[["LandUse", "geometry"]].copy()
df["LandUse"] = df["LandUse"].astype("category")

layer = PolygonLayer.from_geopandas(
    df[["LandUse", "geometry"]],
    get_fill_color=apply_categorical_cmap(
        df["LandUse"],
        cmap=cmap_rgb,
    ),
)
m = Map(layer)
m

visualize a continuous variable¶

df = pluto_wgs[["NumFloors", "geometry"]]

normalizer = LogNorm(1, df.NumFloors.max(), clip=True)
normalized_floors = normalizer(df.NumFloors)

layer = PolygonLayer.from_geopandas(
    df[["NumFloors", "geometry"]],
    get_fill_color=apply_continuous_cmap(normalized_floors, cmap=PuRd_9),
)
m = Map(layer)
m

df = pluto_wgs[["NumFloors", "geometry"]]

normalizer = LogNorm(1, df.NumFloors.max(), clip=True)
normalized_floors = normalizer(df.NumFloors)


layer = PolygonLayer.from_geopandas(
    df[["NumFloors", "geometry"]],
    get_fill_color=apply_continuous_cmap(normalized_floors, cmap=PuRd_9),
    extruded=True,
    get_elevation=pluto_wgs["NumFloors"] * 14,
)
m = Map(
    layer, view_state={"longitude": -73.97, "latitude": 40.73, "zoom": 10, "pitch": 45}
)
m

df = pluto_wgs[pluto_wgs.YearBuilt > 2010][["NumFloors", "geometry"]].copy()

normalizer = LogNorm(1, df.NumFloors.max(), clip=True)
normalized_floors = normalizer(df.NumFloors)

layer = PolygonLayer.from_geopandas(
    df[["NumFloors", "geometry"]],
    get_fill_color=apply_continuous_cmap(normalized_floors, cmap=PuRd_9),
    extruded=True,
    get_elevation=df["NumFloors"] * 14,
)
m = Map(layer)
m

pluto_wgs.YearBuilt.nunique()

def categorize_buildings(r):
    if r.YearBuilt < 1900:
        return "Pre-1900"
    elif r.YearBuilt < 1950:
        return "1900-1950"
    elif r.YearBuilt < 2000:
        return "1950-2000"
    else:
        return "Post-2000"

pluto_wgs["year_category"] = pluto_wgs.apply(categorize_buildings, axis=1)

pluto_wgs.year_category.value_counts()

year_built_ma = {
    "Pre-1900": "[255,0,0]",
    "1900-1950": "#00ff00",
    "1950-2000": "#0000ff",
    "Post-2000": "#ff00ff",
}

df = pluto_wgs[["year_category", "geometry"]]

layer = PolygonLayer.from_geopandas(
    df[["year_category", "geometry"]],
    get_fill_color=apply_categorical_cmap(df["year_category"], cmap=cmap_rgb),
)
m = Map(layer)
m

To read more about lonboard and mapping in 3d, see here for some tips: https://developmentseed.org/lonboard/latest/examples/overture-maps/#imports