How can I divide data into multiple clusters using agglomerative clustering in Python with NumPy and Matplotlib? Ask Question
To divide data into multiple clusters using agglomerative clustering in Python with NumPy and Matplotlib, you can follow these steps:
Step 1: Import the necessary libraries
python
Copy code
import numpy as np
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
Step 2: Prepare the data Create an array of data points that you want to cluster. The data should be stored in a NumPy array, where each row represents a data point and each column represents a feature.
Step 3: Perform hierarchical clustering Use the linkage function from scipy.cluster.hierarchy to perform agglomerative clustering on the data. The linkage function takes the data array as input and returns a linkage matrix, which contains information about the hierarchical clustering.
python
Copy code
Z = linkage(data, 'ward')
In this example, 'ward' linkage criterion is used, which minimizes the variance within each cluster. You can also experiment with other linkage criteria like 'single', 'complete', or 'average'.
Step 4: Plot the dendrogram Use the dendrogram function from scipy.cluster.hierarchy to visualize the hierarchical clustering result as a dendrogram.
python
Copy code
plt.figure(figsize=(10, 5))
dendrogram(Z)
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.title('Dendrogram')
plt.show()
Step 5: Cut the dendrogram to form clusters Based on the dendrogram, you can decide on the number of clusters you want to create. Use the fcluster function from scipy.cluster.hierarchy to cut the dendrogram at a specific distance threshold and form clusters.
python
Copy code
from scipy.cluster.hierarchy import fcluster
max_d = 10 # adjust this threshold based on the dendrogram
clusters = fcluster(Z, max_d, criterion='distance')
Here, max_d is the maximum distance threshold for forming clusters. You can adjust this value to control the number of clusters.
Step 6: Visualize the clusters Finally, you can use Matplotlib to plot the data points with different colors representing different clusters.
python
Copy code
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Agglomerative Clustering')
plt.show()
Putting it all together, here's a complete example:
python
Copy code
import numpy as np
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
# Step 1: Import the necessary libraries
# Step 2: Prepare the data
data = np.array([[1, 2], [2, 3], [3, 3], [6, 7], [7, 8], [8, 8]])
# Step 3: Perform hierarchical clustering
Z = linkage(data, 'ward')
# Step 4: Plot the dendrogram
plt.figure(figsize=(10, 5))
dendrogram(Z)
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.title('Dendrogram')
plt.show()
# Step 5: Cut the dendrogram to form clusters
max_d = 4
clusters = fcluster(Z, max_d, criterion='distance')
# Step 6: Visualize the clusters
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Agglomerative Clustering')
plt.show()
You can adjust the data, distance threshold
Enter an email address to invite a colleague or co-author to join you on socimo. They will receive an email and, in some cases, up to two reminders.