Equity Portfolio Cluster Analysis

I am working on a little tool to help identify clusters in an equity portfolio. Ideally, I want to identify highly correlated ‘pockets’ of holdings, such that the holdings in each cluster do not provide much diversification from each other. Mentally, I imagine that a PCA analysis of the correlation of log differences for individual cluster would have 1 vector that represented a linear shift which explains the majority of the variance.

My first attempts have been limited in succes…

1) Perform a PCA decomposition of the correlation matrix, then regress the eigenvectors (scaled by their eigenvalues) against each column of the correlation matrix . After each regression, I find the coefficient with the largest absolute value (for the time being, I ignored ‘significance’). This coefficient identifies that holdings associated cluster. The problem, of course, becomes that each stock becomes associated with an eigenvector that has a very low contribution to the overall explained variance.

2) Similar to above, but I didn’t scale the eigenvectors. This ends up with all the holdings being associated with 1 vector, which on inspection was basically the ‘market’ vector (i.e. the linear shift of all holdings).

3) Use PCA to identify the number of eigenvectors required to explain 95% of the variance, and use hierarchical centroid clustering (using the correlation matrix rows as my ‘points’ in n-space). The issue here is that for a larger portfolio (say, 50 stocks), the eigenvectors fall off very precipitously, and I end up with 30-40 eigenvectors that explain ~1.5% of the variance — and therefore I end up with ~30-40 clusters.

4) Skip the PCA, and just use hierarchical centroid clustering, using the rows of the correlation matrix as my points in n-space, with a ‘maximum distance’ criteria, not allowing clustering if points are ‘too far’ apart. However, without any way to decide what this ‘maximum distance’ is, this didn’t feel very good.

5) I realized that the problem with using my correlation matrix rows as my points in n-space was that as the number of holdings increased, the correlation between two individual holdings mattered less and less. i.e. if I had two holdings that had similar correlations on every other holding, but a high correlation with each other, I want them clustered. But if they have a low correlation with each other (which typically implies, if their other correlations are ‘similar’, that the other correlations are low), I do NOT want them clustered. However, using the correlation matrix, as the number of holdings increased, the each dimension matters less and less, so the dimension identifying their low correlation with each other does not come into play.

To fix this, I mapped my correlation matrix into n-space (using optimization) s.t. the cosine between two holdings’ vectors in n-space was as close as possible to their correlation. This fixed the problems associated with the last issue, but I still don’t know how to identify how many clusters to select. Furthermore, the optimization seems to be quite slow (though, this is sort of a low priority problem)

In summary…

Given a portfolio of equity holdings, how would you recommend I go about identifying a) how many clusters exist in the portfolio and b) what those clusters are?

These are the problems I am stuck on…

  • Share/Bookmark

5 Responses to “Equity Portfolio Cluster Analysis”

  • jb Says:

    I’m a newbie but found this problem interesting so I’m throwing a couple of thoughts (uneducated guesses) out there.

    Why not try a range for the number of clusters and choose the one which has the max average correlation across clusters (or some other metric). Perhaps, a simpler kNN algo would work ok too?

    I was trying to find out if an approach based on cointegration works. If you find out the cointegration vectors on a per holding basis with the dependent variables being the rest of the holdings, can the cointegration vectors be used somehow to identify clusters?

  • Nik Says:

    Factor analysis is one way to go. Rotate until more factors do not add any more significance (based on your personal choice of significance threshold). As far as I remember I gave you a reply on NP on this matter a while ago. For the single value decomposition, you can use Scree test, rather than trying to explain a certain amount of variance, that should reduce your number of factors massively in large portfolios.

    -Nik

  • Corey Says:

    Thanks for the suggestions!

    @jb: I am trying to save ‘brute-force’ as a last case scenario, though your method of choosing the number of clusters that maximizes a metric among clusters definitely seems feasible. I just have to figure out what sort of metric I would choose.

    @nik: Factor analysis is a definitely possibility, though I was playing with PCA because it allowed me to plot the components and ‘identify’ what they represented, so I could play with my methodology to try to match what I was seeing. Factor analysis loses that ability — though I can still compare my results to what I see with PCA.

    The scree test may not be a possibility, because I am looking to use a mathematical method…

  • jb Says:

    What/where is NP?

Leave a Reply