Equity Portfolio Cluster Analysis
I am working on a little tool to help identify clusters in an equity portfolio. Ideally, I want to identify highly correlated ‘pockets’ of holdings, such that the holdings in each cluster do not provide much diversification from each other. Mentally, I imagine that a PCA analysis of the correlation of log differences for individual cluster would have 1 vector that represented a linear shift which explains the majority of the variance.
My first attempts have been limited in succes…
1) Perform a PCA decomposition of the correlation matrix, then regress the eigenvectors (scaled by their eigenvalues) against each column of the correlation matrix . After each regression, I find the coefficient with the largest absolute value (for the time being, I ignored ‘significance’). This coefficient identifies that holdings associated cluster. The problem, of course, becomes that each stock becomes associated with an eigenvector that has a very low contribution to the overall explained variance.
2) Similar to above, but I didn’t scale the eigenvectors. This ends up with all the holdings being associated with 1 vector, which on inspection was basically the ‘market’ vector (i.e. the linear shift of all holdings).
3) Use PCA to identify the number of eigenvectors required to explain 95% of the variance, and use hierarchical centroid clustering (using the correlation matrix rows as my ‘points’ in n-space). The issue here is that for a larger portfolio (say, 50 stocks), the eigenvectors fall off very precipitously, and I end up with 30-40 eigenvectors that explain ~1.5% of the variance — and therefore I end up with ~30-40 clusters.
4) Skip the PCA, and just use hierarchical centroid clustering, using the rows of the correlation matrix as my points in n-space, with a ‘maximum distance’ criteria, not allowing clustering if points are ‘too far’ apart. However, without any way to decide what this ‘maximum distance’ is, this didn’t feel very good.
5) I realized that the problem with using my correlation matrix rows as my points in n-space was that as the number of holdings increased, the correlation between two individual holdings mattered less and less. i.e. if I had two holdings that had similar correlations on every other holding, but a high correlation with each other, I want them clustered. But if they have a low correlation with each other (which typically implies, if their other correlations are ‘similar’, that the other correlations are low), I do NOT want them clustered. However, using the correlation matrix, as the number of holdings increased, the each dimension matters less and less, so the dimension identifying their low correlation with each other does not come into play.
To fix this, I mapped my correlation matrix into n-space (using optimization) s.t. the cosine between two holdings’ vectors in n-space was as close as possible to their correlation. This fixed the problems associated with the last issue, but I still don’t know how to identify how many clusters to select. Furthermore, the optimization seems to be quite slow (though, this is sort of a low priority problem)
In summary…
Given a portfolio of equity holdings, how would you recommend I go about identifying a) how many clusters exist in the portfolio and b) what those clusters are?
These are the problems I am stuck on…