Additional Tutorials

Clustering

Here we will observe how to cluster our data both based on spatial relationships and attribute values (and then both together).

from sklearn.cluster import KMeans  # for clustering
import geopandas as gpd  # for data manipulation

pluto = gpd.read_file("../Data/nyc_mappluto_24v1_1_shp/MapPLUTO.shp")

Cluster based on attribute data

# get clusters using kmeans

kmeans = KMeans(n_clusters=5)
kmeans.fit(pluto[["NumFloors", "ResArea"]])

pluto["NumFloorsCluster"] = kmeans.predict(pluto[["NumFloors", "ResArea"]])

pluto.groupby("NumFloorsCluster").NumFloors.describe().sort_values("max")

    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}


  
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    

      NumFloorsCluster
      
      
      
      
      
      
      
      
    

  
      2
      5.0
      18.600000
      8.264381
      13.0
      14.0
      15.0
      18.0
      33.0
    

      4
      1.0
      33.000000
      NaN
      33.0
      33.0
      33.0
      33.0
      33.0
    

      3
      5580.0
      12.516174
      9.462652
      0.0
      6.0
      9.0
      16.0
      82.0
    

      1
      469.0
      24.384861
      16.606995
      0.0
      13.0
      20.0
      33.0
      98.0
    

      0
      850764.0
      2.278651
      1.545119
      0.0
      2.0
      2.0
      2.5
      104.0
    

	count	mean	std	min	25%	50%	75%	max
2	5.0	18.600000	8.264381	13.0	14.0	15.0	18.0	33.0
4	1.0	33.000000	NaN	33.0	33.0	33.0	33.0	33.0
3	5580.0	12.516174	9.462652	0.0	6.0	9.0	16.0	82.0
1	469.0	24.384861	16.606995	0.0	13.0	20.0	33.0	98.0
0	850764.0	2.278651	1.545119	0.0	2.0	2.0	2.5	104.0

pluto.plot.scatter(x="NumFloors", y="ResArea", c="NumFloorsCluster", cmap="tab10")

<Axes: xlabel='NumFloors', ylabel='ResArea'>

# same as above but with log y axis instead of linear to be able to see the clusters better
pluto.plot.scatter(
    x="NumFloors", y="ResArea", c="NumFloorsCluster", cmap="tab10", logy=True
)

<Axes: xlabel='NumFloors', ylabel='ResArea'>

pluto["NumFloorsCluster"].plot.hist(bins=5, alpha=0.5)

<Axes: ylabel='Frequency'>

Question:

🧐 What does the shape of the clusters tell us about the data (and the relationship between buildings in NYC, residential area, and number of floors)?

	count	mean	std	min	25%	50%	75%	max
NumFloorsCluster
2	5.0	18.600000	8.264381	13.0	14.0	15.0	18.0	33.0
4	1.0	33.000000	NaN	33.0	33.0	33.0	33.0	33.0
3	5580.0	12.516174	9.462652	0.0	6.0	9.0	16.0	82.0
1	469.0	24.384861	16.606995	0.0	13.0	20.0	33.0	98.0
0	850764.0	2.278651	1.545119	0.0	2.0	2.0	2.5	104.0

Cluster based on attribute data

Question:

Cluster based on spatial relationships