K-Means 편집하기

'''K-Means''' is one of the most popular unsupervised machine learning algorithms used for clustering data into distinct groups. The algorithm partitions a dataset into '''k''' clusters, where each data point belongs to the cluster with the nearest mean. It is widely used for data analysis, pattern recognition, and feature engineering.
==How K-Means Works==
The K-Means algorithm follows an iterative process to assign data points to clusters and optimize the cluster centroids:
#Initialize '''k''' cluster centroids randomly or using a heuristic method (e.g., K-Means++ initialization).
#Assign each data point to the cluster whose centroid is nearest (based on a distance metric such as Euclidean distance).
#Recalculate the centroids of each cluster as the mean of all data points assigned to that cluster.
#Repeat steps 2 and 3 until the centroids converge or the maximum number of iterations is reached.
===Example===
Given a dataset, K-Means groups the points as follows:<syntaxhighlight lang="python">
from sklearn.cluster import KMeans
import numpy as np

# Example dataset
data = np.array([[1, 2], [1, 4], [1, 0],
                 [10, 2], [10, 4], [10, 0]])

# Apply K-Means
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(data)

# Results
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)
</syntaxhighlight>
==Parameters of K-Means==
*'''Number of Clusters (k):''' The number of clusters to create. This parameter must be chosen carefully, as it significantly impacts the algorithm's performance.
*'''Centroid Initialization:''' The initial positions of the centroids can affect convergence and results. K-Means++ is a commonly used initialization method to improve performance.
*'''Distance Metric:''' The measure used to compute the distance between points and centroids, typically Euclidean distance.
==Advantages==
*Simple and easy to implement.
*Scales well to large datasets.
*Works efficiently when clusters are well-separated.
==Limitations==
*Sensitive to the initial placement of centroids, which may lead to suboptimal results.
*Assumes spherical clusters and equal cluster sizes, which may not always reflect the data distribution.
*Requires the number of clusters (k) to be specified beforehand.
==Applications==
K-Means is used across various domains for clustering and grouping tasks:
*'''Image Segmentation:''' Partitioning an image into meaningful segments or regions.
*'''Customer Segmentation:''' Grouping customers based on purchasing behavior or demographics.
*'''Anomaly Detection:''' Identifying unusual patterns or outliers in datasets.
*'''Document Clustering:''' Grouping documents based on similarity in text content.
==Variants of K-Means==
Several variations of the K-Means algorithm have been developed to address its limitations:
*'''K-Means++:''' Improves the initialization process to reduce convergence time and improve results.
*'''Mini-Batch K-Means:''' Processes subsets of data to handle large-scale datasets efficiently.
*'''Fuzzy C-Means:''' Allows data points to belong to multiple clusters with varying degrees of membership.
*'''Bisecting K-Means:''' Iteratively splits clusters for hierarchical clustering.
==Choosing the Number of Clusters==
Selecting the appropriate value of '''k''' is crucial. Common methods include:
*'''Elbow Method:''' Plot the within-cluster sum of squares (WCSS) against '''k'''. The optimal '''k''' is at the "elbow point," where the rate of decrease slows significantly.
*'''Silhouette Analysis:''' Measures how similar data points are within a cluster compared to other clusters.
*'''Gap Statistic:''' Compares the total within-cluster variation for different values of '''k''' with that expected under null reference data.
==Related Concepts and See Also==
*[[Clustering]]
*[[K-Means++]]
*[[Silhouette Analysis]]
*[[Hierarchical Clustering]]
*[[Fuzzy C-Means]]
*[[Unsupervised Learning]]