# Back-testing Stock Selection - Data Analysis Help Needed

1. Mar 29, 2012

### unleash

Hey everyone!

I'm a finance grad and am doing my first big project back-testing some stock selection methods.

I have spent the last few weeks writing a big vba program to run the back-test and I have the following:

10 dates (5 years semi-annual) and 40 companies where, for a given date, if data is availalbe then I have
a) the stock price on that date
b) a valuation
from which I then calculate % difference to determine whether I value the stock at more or less than it's trading.

On each of these dates, I have a set of companies (fewer - approx 10 for the earlier dates since not all companies had sufficient data for a back-test that far back and a full 40 for the latest few dates) and I have for each company a 'spread' which is used to indicate whether to buy or sell the stock.

I have tried a non-parametric method of testing whether the stock-selection method works by ranking the stocks on each date by spread and creating an equally weighted portfolio of the top quartile and similarly for the lower quartile and then check the return over the next 6 months.

The results are as hoped with the quartile with the highest spread (valuation suggests that they're a 'buy') yielding the highest return over the following period and conversely, the lower quartile significantly under-performs relative to the top quartile and relative to an equally weighted holding of all stocks tested for that given sub-period.

I would now like to statistically test this relationship. A t-test comes to mind but I'm unsure about whether I should just take the top quartile versus lower quartile just for each sub-period and do 10 t-tests (similarly for buy vs equally weighted sample portfolio) ... or whether I should somehow do a test over the entire set of 10 dates (given that the number of companies on each date is different and so each portfolio is different.

Also, any other suggestions of nonparametric or other statistical methods to draw some juicyness out of the data will be much appreciated! :)

Regards
a

2. Mar 29, 2012

### Stephen Tashi

In my opinion, you can't think clearly about problems involving statistics unless you have a probability model for the phenomenon you are studying. If you do the traditional type of statistics without formulating such a model, the methods you use are actually assuming a particular probability model, so you haven't escaped the requirement - you've only succeeded in pushing it into the background or remaining ignorant of it.

If the primary goal is do a conceptually clear analysis, you should formulate a probability model for how the data is generated.

On the other hand, if the primary goal is to write an acceptable academic document, then focus you attention on which people are going to approve it. The use of statistics is subjective and different people may have very strong opinions on statistical methods. Certain methods are established traditions in certain fields of study. The simplest course of action is to ask the people who are going to review the document for suggestions and to follow those suggestions.

Your question of how extract information from the data is a natural question, but it puts people who answer in the position of doing a mind-reading exercise as to goals of the analysis.

My mind reading attempt is this: There is some sort of "utility" function that defines how well your valuation method works. I can't follow exactly how you compute it, but I'll imagine it as something like this: We assume that at time t = n, investor A invests, say, $10,000, in stocks that your method reccomends, dividing his money equally among those stocks. At the same time, investor B invests$10,000 in the same manner in stocks your method recommends against. We assume that at time t = n + 1, both investors sell their stocks. The utiity of your method from time t = n to t = n + 1 is (profit of investor A - profit of investor B).

Let's assume your goal is to use the traditional (non-Bayesian) sort of statisitcs to "prove" that your method "works". The simplest formulation of this is to a "hypothesis test". For this, we need a "null hypothesis", which will express the general idea "There is no difference in the performance of investor A and investor B". However, the null hypothesis must say more than this. It must say enough to let us compute the probability of observing some statistic ( like the t-statistic ).

For example, if you assume that each "step" in the observed data from time t = n to time t = n+1 is an independent draw from a random variable representing the distribution of utility then you can compute the proability of various statistics.

Since your are dealing with time series data, I'm dubious of such a simplified assumption. It seems to me that what happens in successive steps of the process aren't independent events if you believe in "trends" in the stockmarket. Again, I emphasize that if you know the people who will review your document, then you should try to divine their opinions on such matters. If you can't speak to them directly, then try to look at work that they have written and see what they did.