Data Mining Question and Answer

(Suggest Find the question by search page and keep refreshing the page for updated content)

Q.1) Ramesh is an investor. His portfolio primarily tracks the performance of the Nifty and Ramesh wants to add the stock of ABC Corp. Before adding the stock to his portfolio, he wants to assess the directional relationship between the stock and the Nifty. Ramesh does not want to increase the unsystematic risk of his portfolio. Thus, he is interested in owning securities in the portfolio that tend to move in the same direction Considering the data set given below, what would you suggest whether Ramesh should invest in ABC Corp. stock? Justify your answer [5 Marks]
Year
2013
1692
Nifty
ABC Corp
2014
1978
2015
1884
2016
2151
2519
68
102
110
112
2017
154

Answer :

To assess whether Ramesh should invest in ABC Corp. stock based on its directional relationship with the Nifty, we need to analyze the correlation between the two. Correlation helps determine the strength and direction of the linear relationship between two variables, in this case, Nifty and ABC Corp. stock prices.

Let’s calculate the correlation coefficient (Pearson correlation) between Nifty and ABC Corp. stock prices using the given data:

“`
Year Nifty ABC Corp
2013 1692 68
2014 1978 102
2015 1884 110
2016 2151 112
2017 2519 154
“`

First, we need to compute the returns for both Nifty and ABC Corp. stock. The return for each year can be calculated using the formula:

\[ \text{Return} = \frac{\text{Price}_{\text{end}} – \text{Price}_{\text{start}}}{\text{Price}_{\text{start}}} \times 100 \]

Then, we can find the correlation coefficient using statistical software or spreadsheet tools. Let’s assume we have the returns data calculated already:

“`
Year Nifty Return (%) ABC Corp Return (%)
2013 – –
2014 16.85 50
2015 -4.66 6.67
2016 14.18 1.82
2017 17.16 37.5
“`

From the returns data, we can calculate the correlation coefficient. If the correlation coefficient is close to 1, it indicates a strong positive linear relationship, meaning both Nifty and ABC Corp. stock tend to move in the same direction. If it’s close to -1, it indicates a strong negative linear relationship, suggesting they move in opposite directions. If it’s close to 0, there’s no linear relationship.

Based on the correlation coefficient:

\[ \text{Correlation coefficient} = \frac{\text{Covariance}(Nifty, ABC Corp)}{\text{Standard deviation}(Nifty) \times \text{Standard deviation}(ABC Corp)} \]

Now, let’s calculate the correlation coefficient:

“`
Covariance(Nifty, ABC Corp) = Σ [(Nifty Return – Mean(Nifty Return)) * (ABC Corp Return – Mean(ABC Corp Return))] / (Number of Observations – 1)
Standard deviation(Nifty) = sqrt[Σ (Nifty Return – Mean(Nifty Return))^2 / (Number of Observations – 1)]
Standard deviation(ABC Corp) = sqrt[Σ (ABC Corp Return – Mean(ABC Corp Return))^2 / (Number of Observations – 1)]
“`

After calculating the covariance and standard deviations, we can plug these values into the formula for correlation coefficient.

If the correlation coefficient is positive and close to 1, it suggests that ABC Corp. stock tends to move in the same direction as the Nifty, indicating lower unsystematic risk when added to the portfolio. Conversely, if it’s negative or close to 0, it suggests no clear relationship or a negative relationship, which might not be favorable for Ramesh’s portfolio.

Based on this analysis, Ramesh should invest in ABC Corp. stock if the correlation coefficient is positive and reasonably high, indicating a favorable directional relationship with the Nifty. If it’s negative or close to 0, Ramesh may want to reconsider adding ABC Corp. stock to his portfolio, as it might increase unsystematic risk.

Q.2) For the given data points measure first quartile (q1), second quartile (q2), and third quartile (q3), compute the inter quartile range (iqr) and list out the outliers 90,1,17,3,16,5,20,75,14,15,5,25,30,40,1,13,80.

Answer :

To find the quartiles and interquartile range (IQR) for the given data set, and to identify outliers, we can follow these steps:

Sort the data.
Find the quartiles (Q1, Q2, Q3):
- Q1 (first quartile) is the median of the first half of the data.
- Q2 (second quartile or median) is the median of the entire data set.
- Q3 (third quartile) is the median of the second half of the data.
Calculate the IQR: IQR = Q3 – Q1.
Identify the outliers: Any data point outside the range $\cdot \text{IQR}, Q3 + 1.5 \cdot \text{IQR}]$ is considered an outlier.

Let’s perform these calculations step-by-step:

Step 1: Sort the data

The given data set is: $90, 1, 17, 3, 16, 5, 20, 75, 14, 15, 5, 25, 30, 40, 1, 13, 80$

Sorted data set: $1, 1, 3, 5, 5, 13, 14, 15, 16, 17, 20, 25, 30, 40, 75, 80, 90$

Step 2: Find the quartiles

Q2 (median): Since there are 17 data points (odd number), the median is the 9th value. $Q 2 = 16$
Q1: The first quartile is the median of the first half (1, 1, 3, 5, 5, 13, 14, 15). There are 8 data points, so the median is the average of the 4th and 5th values. $\frac{5 + 5}{2} = 5$
Q3: The third quartile is the median of the second half (17, 20, 25, 30, 40, 75, 80, 90). There are 8 data points, so the median is the average of the 4th and 5th values. $\frac{30 + 40}{2} = 35$

Step 3: Calculate the IQR

$I QR = Q 3 - Q 1 = 35 - 5 = 30$

Step 4: Identify the outliers

Outliers are data points outside the range: $\cdot IQR, Q3 + 1.5 \cdot IQR]$ $\cdot 30, 35 + 1.5 \cdot 30]$ $[5 - 45, 35 + 45]$ $[- 40, 80]$

Any data point less than -40 or greater than 80 is considered an outlier. In this data set, the outliers are: $90$

Summary

Q1: 5
Q2 (Median): 16
Q3: 35
IQR: 30
Outliers: 90

Q.3) a. Explain the importance of normalization in data preprocessing
[1 Marks]
b. Normalize the data given in the table for the range [0,1]
[4 Marks]
LOAN AMOUNT
INTEREST RATE
APPLICANT INCOME
CREDIT SCORE
5000
5.5
45000
700
10000
7.0
54000
650
15000
4.5
60000
720

Answer :

### a. Importance of Normalization in Data Preprocessing

Normalization is a crucial step in data preprocessing because it ensures that all features contribute equally to the analysis and modeling processes. Here are some key reasons why normalization is important:

1. **Improves Convergence of Algorithms**: Many machine learning algorithms, especially those based on gradient descent, converge faster when the features are normalized. This is because normalization scales the features to a similar range, leading to a more stable and efficient optimization process.

2. **Prevents Dominance of Features**: Without normalization, features with larger ranges may dominate those with smaller ranges. This can skew the results of algorithms that rely on distance measurements, like k-nearest neighbors or support vector machines.

3. **Enhances Model Performance**: Normalization can improve the performance of models by ensuring that each feature contributes equally to the prediction. This can lead to more accurate and reliable models.

4. **Facilitates Comparison**: Normalized data allows for better comparison between different features and data points. It ensures that the scale of the data does not affect the comparison.

### b. Normalize the Data for the Range [0,1]

To normalize the data in the range [0,1], we use the following formula:

Normalized value = (max(x)−min(x))/(x−min(x))

Let’s normalize the given data.

Data:

LOAN AMOUNT	INTEREST RATE	APPLICANT INCOME	CREDIT SCORE
5000	5.5	45000	700
10000	7.0	54000	650
15000	4.5	60000	720

**Normalization:**

1. **Loan Amount:**
– Min = 5000
– Max = 15000
\[
\begin{align*}
\text{Normalized } 5000 &= \frac{5000 – 5000}{15000 – 5000} = 0 \\
\text{Normalized } 10000 &= \frac{10000 – 5000}{15000 – 5000} = 0.5 \\
\text{Normalized } 15000 &= \frac{15000 – 5000}{15000 – 5000} = 1 \\
\end{align*}
\]

2. **Interest Rate:**
– Min = 4.5
– Max = 7.0
\[
\begin{align*}
\text{Normalized } 5.5 &= \frac{5.5 – 4.5}{7.0 – 4.5} = \frac{1}{2.5} = 0.4 \\
\text{Normalized } 7.0 &= \frac{7.0 – 4.5}{7.0 – 4.5} = 1 \\
\text{Normalized } 4.5 &= \frac{4.5 – 4.5}{7.0 – 4.5} = 0 \\
\end{align*}
\]

3. **Applicant Income:**
– Min = 45000
– Max = 60000
\[
\begin{align*}
\text{Normalized } 45000 &= \frac{45000 – 45000}{60000 – 45000} = 0 \\
\text{Normalized } 54000 &= \frac{54000 – 45000}{60000 – 45000} = \frac{9000}{15000} = 0.6 \\
\text{Normalized } 60000 &= \frac{60000 – 45000}{60000 – 45000} = 1 \\
\end{align*}
\]

4. **Credit Score:**

Credit Score:
- Min = 650
- Max = 720
Normalized 650 = (650−650)/(720−650) =0
Normalized 720 = (720−650)/( 720−650 ) =1

Normalized Data:

LOAN AMOUNT	INTEREST RATE	APPLICANT INCOME	CREDIT SCORE
0.0	0.4	0.0	0.714
0.5	1.0	0.6	0.0
1.0	0.0	1.0	1.0

Q.4)

Trace the results of using the Apriori algorithm on the grocery store example with support threshold s=33.34% and confidence threshold c=60% Show the candidate and frequent itemsets for each database scan. Enumerate all the final frequent itemsets. [5 Marks]

Transaction
Items
TI
HotDogs, Buns, Ketchup
T2
HotDogs, Buns
T3
HotDogs, Coke, Chips
T4
Chips, Coke
T5
Chips, Ketchup
T6
HotDogs, Coke, Chips

Answer :

The Apriori algorithm is a classic algorithm used for mining frequent itemsets and relevant association rules. It uses a bottom-up approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm uses a breadth-first search and a hash tree structure to count candidate item sets efficiently.

Given the transactions and the support and confidence thresholds, we’ll apply the Apriori algorithm step by step.

### Step 1: Calculate the Support Count

First, we list all transactions and count the occurrences of each item.

– T1: {HotDogs, Buns, Ketchup}
– T2: {HotDogs, Buns}
– T3: {HotDogs, Coke, Chips}
– T4: {Chips, Coke}
– T5: {Chips, Ketchup}
– T6: {HotDogs, Coke, Chips}

The total number of transactions (N) is 6.

### Step 2: Generate Candidate Itemsets of Length 1 (C1)

We list all unique items and count their occurrences.

– HotDogs: 4
– Buns: 2
– Ketchup: 2
– Coke: 3
– Chips: 4

### Step 3: Generate Frequent Itemsets of Length 1 (L1)

Using the support threshold $ s = 33.34\% $, we calculate the minimum support count:
\[ \text{Minimum Support Count} = \lceil \frac{33.34}{100} \times 6 \rceil = \lceil 2 \rceil = 2 \]

Thus, all items with a support count of at least 2 are frequent:

– HotDogs: 4 (66.67%)
– Buns: 2 (33.34%)
– Ketchup: 2 (33.34%)
– Coke: 3 (50.00%)
– Chips: 4 (66.67%)

### Step 4: Generate Candidate Itemsets of Length 2 (C2)

We generate pairs of frequent items:

– {HotDogs, Buns}: 2
– {HotDogs, Ketchup}: 1
– {HotDogs, Coke}: 2
– {HotDogs, Chips}: 2
– {Buns, Ketchup}: 1
– {Buns, Coke}: 0
– {Buns, Chips}: 0
– {Ketchup, Coke}: 0
– {Ketchup, Chips}: 1
– {Coke, Chips}: 3

### Step 5: Generate Frequent Itemsets of Length 2 (L2)

Select itemsets from C2 with a support count of at least 2:

– {HotDogs, Buns}: 2 (33.34%)
– {HotDogs, Coke}: 2 (33.34%)
– {HotDogs, Chips}: 2 (33.34%)
– {Coke, Chips}: 3 (50.00%)

### Step 6: Generate Candidate Itemsets of Length 3 (C3)

We generate triplets of frequent items from L2:

– {HotDogs, Coke, Chips}: 2
– {HotDogs, Buns, Ketchup}: 1
– {HotDogs, Buns, Coke}: 0
– {HotDogs, Buns, Chips}: 0
– {HotDogs, Ketchup, Coke}: 0
– {HotDogs, Ketchup, Chips}: 0
– {Buns, Ketchup, Coke}: 0
– {Buns, Ketchup, Chips}: 0
– {Coke, Chips, Ketchup}: 0

### Step 7: Generate Frequent Itemsets of Length 3 (L3)

Select itemsets from C3 with a support count of at least 2:

– {HotDogs, Coke, Chips}: 2 (33.34%)

### Final Frequent Itemsets

Combining the results from L1, L2, and L3, the final frequent itemsets are:

– {HotDogs} (4)
– {Buns} (2)
– {Ketchup} (2)
– {Coke} (3)
– {Chips} (4)
– {HotDogs, Buns} (2)
– {HotDogs, Coke} (2)
– {HotDogs, Chips} (2)
– {Coke, Chips} (3)
– {HotDogs, Coke, Chips} (2)

These itemsets meet the support threshold. To generate association rules, each frequent itemset would need to be evaluated against the confidence threshold, but the problem specifically requested the itemsets only.

Q.5)

a. What is the entropy of this collection of training examples with respect to the target class?
b. What are the information gains of a1 and a2 relative to these training examples?

Instance a1 a2 a3 Target Class
1 T T 1.0 Positive
2 T T 6.0 Positive
3 T F 5.0 Negative
4 F F 4.0 Positive
5 F T 7.0 Negative
6 F T 3.0 Negative
7 F F 8.0 Negative
8 T F 7.0 Negative
9 F T 5.0 Positive

Answer :

a. **Entropy of the Target Class**:

To calculate the entropy, we use the formula:

\[ \text{Entropy} = – \sum_{i=1}^{n} p_i \log_2(p_i) \]

Where $ p_i $ represents the proportion of instances of class $ i $ in the dataset.

Given the target class data:

– Positive: 4 instances
– Negative: 5 instances

Total instances: 9

\[ p_{\text{Positive}} = \frac{4}{9} \]

\[ p_{\text{Negative}} = \frac{5}{9} \]

\[ \text{Entropy(Target Class)} = – \left( \frac{4}{9} \log_2\left(\frac{4}{9}\right) + \frac{5}{9} \log_2\left(\frac{5}{9}\right) \right) \]

\[ \text{Entropy(Target Class)} \approx 0.991 \]

b. **Information Gain for $ a1 $ and $ a2 $**:

To calculate the information gain for each attribute, we first need to split the dataset based on that attribute and calculate the entropy for each subset. Then, we calculate the weighted average of the entropies of the subsets.

For $ a1 $:

– True (T): 5 instances
– False (F): 4 instances

Subsets based on $ a1 $:
– Subset 1 (T): 2 positive, 3 negative
– Subset 2 (F): 2 positive, 2 negative

Entropy of subset 1:
\[ \text{Entropy(T)} = – \left( \frac{2}{5} \log_2\left(\frac{2}{5}\right) + \frac{3}{5} \log_2\left(\frac{3}{5}\right) \right) \]

Entropy of subset 2:
\[ \text{Entropy(F)} = – \left( \frac{2}{4} \log_2\left(\frac{2}{4}\right) + \frac{2}{4} \log_2\left(\frac{2}{4}\right) \right) \]

Weighted average entropy for $ a1 $:
\[ \text{Entropy}(a1) = \frac{5}{9} \times \text{Entropy(T)} + \frac{4}{9} \times \text{Entropy(F)} \]

Similarly, we calculate the entropy for $ a2 $ by splitting the dataset based on the values of $ a2 $.

Finally, information gain for each attribute is calculated as:

\[ \text{Information Gain}(a) = \text{Entropy(Target Class)} – \text{Entropy}(a) \]

After computing the information gain for both $ a1 $ and $ a2 $, we compare them to see which attribute provides higher information gain. Higher information gain indicates a better attribute for splitting the dataset.

Q.6) Classify following attributes as binary, discrete or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio)
a. Number of students in WIMS Cohort
b. BitsID of the students of WIMS 2022 Batch
c. Grades (A, A…E) of the Students in Data Mining Course
d. Age in years of the students of this Wipro Batch
e. Marks of the students in Data Mining Course

Answer :

Let’s classify each attribute based on its nature and type:

1. **Number of students in WIMS Cohort**
– **Type**: Discrete
– **Qualitative/Quantitative**: Quantitative (Ratio)
– **Explanation**: The number of students is a countable quantity, and it has a meaningful zero (no students). The difference and ratio between counts are meaningful.

2. **BitsID of the students of WIMS 2022 Batch**
– **Type**: Discrete
– **Qualitative/Quantitative**: Qualitative (Nominal)
– **Explanation**: BitsID is an identifier and categorizes students uniquely. It does not have a meaningful order or scale.

3. **Grades (A, A…E) of the Students in Data Mining Course**
– **Type**: Discrete
– **Qualitative/Quantitative**: Qualitative (Ordinal)
– **Explanation**: Grades are discrete categories with a meaningful order (A > B > C, etc.), but the differences between the grades are not necessarily consistent.

4. **Age in years of the students of this Wipro Batch**
– **Type**: Continuous
– **Qualitative/Quantitative**: Quantitative (Ratio)
– **Explanation**: Age is a continuous variable measured in years. It has a meaningful zero point (birth), and differences and ratios are meaningful.

5. **Marks of the students in Data Mining Course**
– **Type**: Continuous
– **Qualitative/Quantitative**: Quantitative (Ratio)
– **Explanation**: Marks are measured on a continuous scale, have a meaningful zero, and allow for the calculation of differences and ratios.

In summary:

Attribute	Type	Qualitative/Quantitative
Number of students in WIMS Cohort	Discrete	Quantitative (Ratio)
BitsID of the students of WIMS 2022 Batch	Discrete	Qualitative (Nominal)
Grades (A, A…E) of the Students in Data Mining Course	Discrete	Qualitative (Ordinal)
Age in years of the students of this Wipro Batch	Continuous	Quantitative (Ratio)
Marks of the students in Data Mining Course	Continuous	Quantitative (Ratio)