Jul 15 2010

Equity Portfolio Cluster Analysis

I am working on a little tool to help identify clusters in an equity portfolio. Ideally, I want to identify highly correlated ‘pockets’ of holdings, such that the holdings in each cluster do not provide much diversification from each other. Mentally, I imagine that a PCA analysis of the correlation of log differences for individual cluster would have 1 vector that represented a linear shift which explains the majority of the variance.

My first attempts have been limited in succes…

1) Perform a PCA decomposition of the correlation matrix, then regress the eigenvectors (scaled by their eigenvalues) against each column of the correlation matrix . After each regression, I find the coefficient with the largest absolute value (for the time being, I ignored ‘significance’). This coefficient identifies that holdings associated cluster. The problem, of course, becomes that each stock becomes associated with an eigenvector that has a very low contribution to the overall explained variance.

2) Similar to above, but I didn’t scale the eigenvectors. This ends up with all the holdings being associated with 1 vector, which on inspection was basically the ‘market’ vector (i.e. the linear shift of all holdings).

3) Use PCA to identify the number of eigenvectors required to explain 95% of the variance, and use hierarchical centroid clustering (using the correlation matrix rows as my ‘points’ in n-space). The issue here is that for a larger portfolio (say, 50 stocks), the eigenvectors fall off very precipitously, and I end up with 30-40 eigenvectors that explain ~1.5% of the variance — and therefore I end up with ~30-40 clusters.

4) Skip the PCA, and just use hierarchical centroid clustering, using the rows of the correlation matrix as my points in n-space, with a ‘maximum distance’ criteria, not allowing clustering if points are ‘too far’ apart. However, without any way to decide what this ‘maximum distance’ is, this didn’t feel very good.

5) I realized that the problem with using my correlation matrix rows as my points in n-space was that as the number of holdings increased, the correlation between two individual holdings mattered less and less. i.e. if I had two holdings that had similar correlations on every other holding, but a high correlation with each other, I want them clustered. But if they have a low correlation with each other (which typically implies, if their other correlations are ‘similar’, that the other correlations are low), I do NOT want them clustered. However, using the correlation matrix, as the number of holdings increased, the each dimension matters less and less, so the dimension identifying their low correlation with each other does not come into play.

To fix this, I mapped my correlation matrix into n-space (using optimization) s.t. the cosine between two holdings’ vectors in n-space was as close as possible to their correlation. This fixed the problems associated with the last issue, but I still don’t know how to identify how many clusters to select. Furthermore, the optimization seems to be quite slow (though, this is sort of a low priority problem)

In summary…

Given a portfolio of equity holdings, how would you recommend I go about identifying a) how many clusters exist in the portfolio and b) what those clusters are?

These are the problems I am stuck on…

  • Share/Bookmark

Jun 16 2010

Weighted Random Array

I had an array that I was trying to do weighted random pulls from. There are a few basic algorithms for this online, but my issue was that they were not efficient for very, very large tables — most were O(n) for each time you wanted to pull an element. So I wrote my own wrapper class that would handle the task for me using Alias Tables.

class WeightedRandomArray < Array
      def initialize(other_array, weights)
        total_weights = weights.inject(0.0) { |t,e| t+e }
        proportions = weights.map { |e| e / total_weights }
 
        elements = other_array.zip(proportions)
 
        # construct an alias table for faster access
        @table = []
 
        n = elements.size
        elements.map! { |a,w| [a, w*(n-1)] }
        elements.sort! { |e1,e2| e1[1] <=> e2[1] }
 
        while elements.size > 2
          p = elements[0][1]
          elements[-1][1] -= (1.0 - p)
          @table << [p, elements[0][0], elements[-1][0]]
          elements.delete_at(0)
          elements.sort! { |a,b| a[1] <=> b[1] }
        end
 
        p = elements[0][1]
        elements[-1][1] -= p
        @table << [p, elements[0][0], elements[-1][0]]
      end
 
      def random_element
        entry = @table[(Kernel.rand * @table.size).floor]
        if Kernel.rand < entry[0]
          return entry[1]
        else
          return entry[2]
        end
      end
    end

So now the work is all up-front — and while it is considerable, so long as you will be doing enough random draws, the O(1) time to get a random entry should trump…

  • Share/Bookmark

Jun 8 2010

Passing around Ruby blocks

I have some code that I am bundling together that required me to tackle a rather strange problem. Basically, I had one function that took a block, and another function that wrapped around that function. It looks something like this:

def inner(*args)
   yield args[0], args[1]*args[2]
end
 
def wrapper
   inner(3,4,5)
end

The question is, how do I pass a block to wrapper and have it get passed to inner? Google wasn’t much help here. My solution looks like:

def call_block
   yield 4, 5, 6
end
 
def wrap_block(&blk)
   call_block { |*args| blk.call(*args) }
end
 
wrap_block { |x,y,z| puts x+y+z }

Would love to see a smarter solution…

  • Share/Bookmark

May 19 2010

Immutable Ruby?

Got bored and hacked around.  Found this sort of interesting…
class Object
  def self.new( *args, &blk )
    o = allocate
    o.instance_eval{initialize( *args, &blk )}
    o.freeze
    o
  end
end
Forcing objects to be frozen after initialization?  Almost sounds … functional.
  • Share/Bookmark

Apr 1 2010

Long time, no post

So it has been well over a month since my last post, and that is just far too long.  Finals snuck around the corner, then it was spring break, and then getting back into the swing of school.  But now that I am back, I thought I would jump back on the train.

Today, I thought I would provide a couple thoughts on some tools I have been trying out.  I use Matlab fairly extensively in school, but since I work for myself, I have to use other, alternative products to develop my strategies in to avoid having to pay Matlab’s exorbitant licensing fees.  If you’ve read some of my past posts, you will also know that I have been trying to find some sort of alternative to my Ruby addiction — something preferably statically typed and with strong functional roots.

But I am also looking for something that allows me to do easy numerical computing and statistics work.  This means a good matrix interfacing library, as well as a strong set of statistics (and preferably, econometrics) packages.

My first stop was the “Matlab” clones and cousins: Octave, SciLab, R, SciPy and Sage.  My initial thoughts were:

  • Octave: I would really like to have a GUI.  And a lot of Matlab functions I want are missing (particularly, textscan).
  • SciLab: Not enough to differentiate itself from Octave.  Why bother learning the differences?
  • R: I absolutely despise R’s semantics — but it is a brilliant statistics package.
  • Sage: Too ‘mathematica’ like for me
  • SciPy (with NumPy and IPython): Not a bad alternative, if I could get the damn thing to compile on Mac OS X 10.6

After continued searching, I found QtOctave, which solved my Octave GUI problem.  For now, I have stuck with that.

I also have been playing around with Haskell, doing as much homework in it as possible.  Being more of an ‘academic’ language, it has a lot of scientific packages available.  Unfortunately, it doesn’t really have a uniform set of interfaces, which I think is the largest bonus of Matlab.  Can Haskell do everything Matlab can?  Probably.  But can it do it as easily and without as much glue code?  Not at all.  I really do think that Haskell is a fantastic language and would like to work on a project where I can unify several projects into one easy to use package — but my Haskell capabilities are just not there yet.  This all makes the learning curve twice as frustrating compared to any of the Matlab clones.

Another tool I re-stumbled upon, which I like, is Weka.  Instead of having to code up all my classification algorithms to perform analysis, I can quickly load up a file into Weka and have it perform the tests.  Saves me a whole lot of time.  I highly recommend checking it out for any quick machine learning tasks you have.

That is pretty much it, for now.  I am currently working on installing RLaBPlus.  I will provide my thoughts in a few days.

  • Share/Bookmark

Jan 29 2010

Random thoughts of the day

I always bounce through a fairly regular pattern when it comes to programming.  It goes a little something like: prototype an idea with Ruby code.  Find out it is slow.  Get frustrated.  Start writing code in C++.  Take a lot longer, but be impressed by how the strict type system tends to force me to write better programs.  Get frustrated with code verbosity.  Start looking for something else to program in that mixes the expressiveness of Ruby with the speed of C++ (one day I will realize that these two are probably mutually exclusive).  Decide that a more functional language will suit my needs.  OCaml?  Erlang?  Maybe something on the JVM!  Scala?  Clojure?  Get frustrated with build system (I don’t care what you say, Maven is just stupid) or lack of libraries.  Go back with my tail between my legs to Ruby.  Start looking at all the Ruby options: MRI, MacRuby, JRuby …

But wait!  How is it that I have somehow always overlooked Rubinius?  JIT compilation on LLVM?  Future multi-VM implementation (hello sandbox!)?  Holy cow!  That is AWESOME.  Rubinius is definitely my VM of choice from now on.

P.S. Totally unrelated, but check out Ruby Processing — it is pretty damn cool.

  • Share/Bookmark

Jan 18 2010

github

I started actively publishing some random code on github. If you want to check out my repos, click here.

  • Share/Bookmark

Jan 18 2010

[C++] Learn something new every day

If you want to inherit from a class that does not have a virtual destructor, you will be in trouble if you inherit from it publicly, because deleting through a base-class pointer will lead to nasty behavior. But having a private or protected inheritance means that you won’t be able to delete through a base class — because you won’t be ABLE to obtain a pointer to the base class.

So, no virtual destructor doesn’t mean you can’t inherit — it just means that it should be private or protected inheritance.

Of course, composition is probably a better choice, but whatever.

  • Share/Bookmark

Jan 6 2010

Some thoughts on Back Testing

Back testing is fairly common when analyzing the profitability of a strategy, but there are many other things to be considered besides returns.  Much of this list came from perusing Nuclear Phynance (particularly, FDAXHunter’s input).

  • Length of the Period Tested: Over what time-frame did you run the test? How long was that time-frame?  What market conditions persisted over it?  The goal should be to run the strategy on as diverse a time-frame as possible, to help you discover what market factors play a critical role in the success or failure of your method.
  • Out of Sample Tests & Locations: Much like above, you want to have a fairly significant and diverse set of out of sample data to test on.
  • Average Trade / Win / Loss: What does the average trade of the system look like?  Are you normally profitable, or do you lost money and it was a couple fat-tail trades that gave you profitability?
  • Volatility & Skewness of P&L Stream: Is your profitability stable?  Do you have a fat loss tail?  How skewed positive are you?
  • Maximum Consecutive Losers / Winners: This is very important for when the system goes live.  Is five bad trades in a row a reason to pull the system?  Ten?  What is normal for the system?  When should we start getting concerned?
  • Maximum Draw Down & Time: If we implemented the system, what sort of draw-downs would we have to stomach, and over how long would we have to stomach them?  I don’t care much about a 1300% return over 5 years if for 4 of them, I faced an 80% draw-down.  You would probably pull the plug long before the fifth year came around.
  • Average Draw Down & Time: What does the average draw down look like?  Is it stable?
  • Percent of Winners Removed Until Neutral: What percentage of our best trades do we have to remove before the system breaks?  Is it only a few?  Is our success based on a few large winners, or do we have a stable set of success?
  • Histogram of P&L: What does the P&L look like, historically?  This is a visualization of the skew and volatility from above.
  • Shape of Equity Curve: Are we talking about a long, smooth curve?  A curve with lots of jumps?  How much interim volatility between new highs?
  • Optimal Parameter Location: Are the parameters we used in a stable location, or are they at a pin-point?  If they are at a pin-point, the success of the model is most likely the result of data-mining, instead of a true edge.  Instead, we would like to see that our parameters are in a plateau — the model remains stable for moderate changes in our parameter values.
  • Performance in Other Markets: How does the model perform on securities it wasn’t designed to trade for?  Does it succeed in similar securities?  Does it fail in securities which the edge shouldn’t exist on?

All of these things should be considered along-side a simple profitability analysis, or else you will end up with a ‘successful back-test of three years, blow up in three days’ scenario.

  • Share/Bookmark

Jan 6 2010

Painting with Computers

Inspired by Roger Alsing, a poster at Gamedev.net has created two very cool projects. The first is a near copy of Roger’s work, and can be found here. The second is more recent, and uses the idea of Boids to paint pictures. Check it out here. Very cool stuff.

  • Share/Bookmark