Classification of Pottery from pre-classical sites in Italy, using Euclidean and Mahalanobis distance measures

Measurements of elemental composition was performed on 58 samples of pottery from Southern Italy, divided into two groups A (black carbon containing bulks) and B (clayey ones). The data are as follows.

 Ti%Sr ppmBa ppmMn ppmCr ppmCa%Al%Fe%Mg%Na%K%Class
A1 0.304 181 1007 642 60 1.640 8.342 3.542 0.458 0.548 1.799 A
A2 0.316 194 1246 792 64 2.017 8.592 3.696 0.509 0.537 1.816 A
A3 0.272 172 842 588 48 1.587 7.886 3.221 0.540 0.608 1.970 A
A4 0.301 147 843 526 62 1.032 8.547 3.455 0.546 0.664 1.908 A
A5 0.908 129 913 775 184 1.334 11.229 4.637 0.395 0.429 1.521 A
E1 0.394 105 1470 1377 90 1.370 10.344 4.543 0.408 0.411 2.025 A
E2 0.359 96 1188 839 86 1.396 9.537 4.099 0.427 0.482 1.929 A
E3 0.406 137 1485 1924 90 1.731 10.139 4.490 0.502 0.415 1.930 A
E4 0.418 133 1174 1325 91 1.432 10.501 4.641 0.548 0.500 2.081 A
L1 0.360 111 410 652 70 1.129 9.802 4.280 0.738 0.476 2.019 A
L2 0.280 112 1008 838 59 1.458 8.960 3.828 0.535 0.392 1.883 A
L3 0.271 117 1171 681 61 1.456 8.163 3.265 0.521 0.509 1.970 A
L4 0.288 103 915 558 60 1.268 8.465 3.437 0.572 0.479 1.893 A
L5 0.253 102 833 415 193 1.226 7.207 3.102 0.539 0.577 1.972 A
C1 0.303 131 601 1308 65 0.907 8.401 3.743 0.784 0.704 2.473 A
C2 0.264 121 878 921 69 1.164 7.926 3.431 0.636 0.523 2.032 A
C3 0.264 112 1622 1674 63 0.922 7.980 3.748 0.549 0.497 2.291 A
C4 0.252 111 793 750 53 1.171 8.070 3.536 0.599 0.551 2.282 A
C5 0.261 127 851 849 61 1.311 7.819 3.770 0.668 0.508 2.121 A
G8 0.397 177 582 939 61 1.260 8.694 4.146 0.656 0.579 1.941 A
G9 0.246 106 1121 795 53 1.332 8.744 3.669 0.571 0.477 1.803 A
G10 1.178 97 886 530 441 6.290 8.975 6.519 0.323 0.275 0.762 A
G11 0.428 457 1488 1138 85 1.525 9.822 4.367 0.504 0.422 2.055 A
P1 0.259 389 399 443 175 11.609 5.901 3.283 1.378 0.491 2.148 B
P2 0.185 233 456 601 144 11.043 4.674 2.743 0.711 0.464 0.909 B
P3 0.312 277 383 682 138 8.430 6.550 3.660 1.156 0.532 1.757 B
P6 0.183 220 435 594 659 9.978 4.920 2.692 0.672 0.476 0.902 B
P7 0.271 392 427 410 125 12.009 5.997 3.245 1.378 0.527 2.173 B
P8 0.203 247 504 634 117 11.112 5.034 3.714 0.726 0.500 0.984 B
P9 0.182 217 474 520 92 12.922 4.573 2.330 0.590 0.547 0.746 B
P14 0.271 257 485 398 955 11.056 5.611 3.238 0.737 0.458 1.013 B
P15 0.236 228 203 592 83 9.061 6.795 3.514 0.750 0.506 1.574 B
P16 0.288 333 436 509 177 10.038 6.579 4.099 1.544 0.442 2.400 B
P17 0.331 309 460 530 97 9.952 6.267 3.344 1.123 0.519 1.746 B
P18 0.256 340 486 486 132 9.797 6.294 3.254 1.242 0.641 1.918 B
P19 0.292 289 426 531 143 8.372 6.874 3.360 1.055 0.592 1.598 B
P20 0.212 260 486 605 123 9.334 5.343 2.808 1.142 0.595 1.647 B
F1 0.301 320 475 556 142 8.819 6.914 3.597 1.067 0.584 1.635 B
F2 0.305 302 473 573 102 8.913 6.860 3.677 1.365 0.616 2.077 B
F3 0.300 204 192 575 79 7.422 7.663 3.476 1.060 0.521 2.324 B
F4 0.225 181 160 513 94 5.320 7.746 3.342 0.841 0.657 2.268 B
F5 0.306 209 109 536 285 7.866 7.210 3.528 0.971 0.534 1.851 B
F6 0.295 396 172 827 502 9.019 7.775 3.808 1.649 0.766 2.123 B
F7 0.279 230 99 760 129 5.344 7.781 3.535 1.200 0.827 2.305 B
D1 0.292 104 993 723 92 7.978 7.341 3.393 0.630 0.326 1.716 B
D2 0.338 232 687 683 108 4.988 8.617 3.985 1.035 0.697 2.215 B
D3 0.327 155 666 590 70 4.782 7.504 3.569 0.536 0.411 1.490 B
D4 0.233 98 560 678 73 8.936 5.831 2.748 0.542 0.282 1.248 B
M1 0.242 186 182 647 92 5.303 8.164 4.141 0.804 0.734 1.905 B
M2 0.271 473 198 459 89 10.205 6.547 3.035 1.157 0.951 0.828 B
M3 0.207 187 205 587 87 6.473 7.634 3.497 0.763 0.729 1.744 B
G1 0.271 195 472 587 104 5.119 7.657 3.949 0.836 0.671 1.845 B
G2 0.303 233 522 870 130 4.610 8.937 4.195 1.083 0.704 1.840 B
G3 0.166 193 322 498 80 7.633 6.443 3.196 0.743 0.460 1.390 B
G4 0.227 170 718 1384 87 3.491 7.833 3.971 0.783 0.707 1.949 B
G5 0.323 217 267 835 122 4.417 9.017 4.349 1.408 0.730 2.212 B
G6 0.291 272 197 613 86 6.055 7.384 3.343 1.214 0.762 2.056 B
G7 0.461 318 42 653 123 6.986 8.938 4.266 1.579 0.946 1.687 B

Questions

  1. Standardise this matrix, and explain why this transformation is important. Why is it normal to use the population rather than the sample standard deviation? All calculations below should be performed on this standardised data matrix.

  2. Perform PCA, initially calculating 11 PCs, on the data of question 1. What is the total sum of the eigenvalues for all 11 components, and what does this number relate to?

  3. Plot the scores of PC2 versus PC1, using different symbols for classes A and B. Is there are good separation between classes? One object appears an outlier, which one?

  4. Plot the loadings of PC2 versus PC1. Label these with the names of the elements.

  5. Compare the loadings plot to the scores plot. Pick two elements that appear diagnostic of the two classes: these elements will appear in the loadings plot in the same direction of the classes (there may be more than one answer to this question). Plot the value of the standardised readings these elements against each other, using different symbols and show that reasonable (but not perfect) discrimination is possible.

  6. From the loadings plots, choose a pair of elements that are very poor at discriminating (at right angles to the discriminating direction) and show that the resultant graph of the standardised readings of each element against the other is very poor and does not provide good discrimination.

  7. Calculate the centroids of class A (excluding the outlier) and class B. Calculate the Euclidean distance of the 58 samples to both these centroids. Produce a class distance plot of distance to centroid of class A against class B, indicating the classes using different symbols, and comment.

  8. Determine the variance-covariance matrix for the 11 elements and each of the classes (so there should be two matrices of dimensions 11 * 11), remove the outlier first. Hence calculate the Mahalanobis distance to each of the class centroids. What is the reason for using Mahalanobis distance rather than Euclidean distance? Produce a class distance plot for this new measure, and comment.

  9. Calculate the %Correctly Classified using the class distances in 8, using the lowest distance to indicate correct classification.