Entropy (Data Science): Difference between revisions

From IT위키
No edit summary
No edit summary
Line 6: Line 6:


where:
where:
*'''p₁''' and '''p₂''' are the probabilities of the two classes.
*'''p<sub>1</sub>''' and '''p<sub>2</sub>''' are the probabilities of the two classes.
If the dataset contains multiple classes, entropy is extended to account for all probabilities of each class. Higher entropy values indicate greater disorder, while lower values indicate a more uniform distribution. In Decision Trees, splits that reduce entropy are preferred because they create more "pure" nodes.
If the dataset contains multiple classes, entropy is extended to account for all probabilities of each class. Higher entropy values indicate greater disorder, while lower values indicate a more uniform distribution. In Decision Trees, splits that reduce entropy are preferred because they create more "pure" nodes.
==Applications in Decision Trees==
==Applications in Decision Trees==
Line 25: Line 25:
Consider a binary dataset where a target feature can have two possible classes, A and B. If the dataset contains 50% of each class, entropy will be maximal:
Consider a binary dataset where a target feature can have two possible classes, A and B. If the dataset contains 50% of each class, entropy will be maximal:


H = - (0.5 * log₂(0.5)) - (0.5 * log₂(0.5)) = 1
H = - (0.5 * log<sub>2</sub>(0.5)) - (0.5 * log<sub>2</sub>(0.5)) = 1


Conversely, if all instances belong to class A, entropy will be minimal (0), indicating perfect homogeneity.
Conversely, if all instances belong to class A, entropy will be minimal (0), indicating perfect homogeneity.

Revision as of 06:02, 5 November 2024

In Data Science, Entropy is a measure of randomness or uncertainty in a dataset. Often used in Decision Trees and other machine learning algorithms, entropy quantifies the impurity or unpredictability of information in a set of data. In classification tasks, entropy helps determine the best way to split data to reduce uncertainty and increase homogeneity in the resulting subsets.

How Entropy Works

Entropy, denoted as H, is calculated based on the probabilities of different classes within a dataset. For a binary classification, entropy is given by:

H = - p1 log2(p1) - p2 log2(p2)

where:

  • p1 and p2 are the probabilities of the two classes.

If the dataset contains multiple classes, entropy is extended to account for all probabilities of each class. Higher entropy values indicate greater disorder, while lower values indicate a more uniform distribution. In Decision Trees, splits that reduce entropy are preferred because they create more "pure" nodes.

Applications in Decision Trees

Entropy is a key concept in building Decision Trees, where it guides the splitting of nodes. The process is as follows:

1. Calculate Entropy: Entropy is calculated for the parent node based on the distribution of classes.

2. Evaluate Potential Splits: Each possible feature split is evaluated to see how much it decreases entropy (i.e., increases homogeneity).

3. Select the Best Split: The split with the maximum reduction in entropy (known as information gain) is chosen.

This approach leads to more structured and informative splits, ultimately improving the accuracy of the Decision Tree.

Key Characteristics

  • Higher Entropy: Indicates a mixed distribution of classes, suggesting greater disorder and higher impurity.
  • Lower Entropy: Indicates a more uniform or pure distribution, suggesting lower disorder and greater homogeneity.
  • Range: Entropy values range from 0 (perfectly homogeneous) to 1 (completely mixed distribution in binary classification).

Example Calculation

Consider a binary dataset where a target feature can have two possible classes, A and B. If the dataset contains 50% of each class, entropy will be maximal:

H = - (0.5 * log2(0.5)) - (0.5 * log2(0.5)) = 1

Conversely, if all instances belong to class A, entropy will be minimal (0), indicating perfect homogeneity.

See Also