Creating Diversified Portfolios Using Cluster Analysis

3m ago
1.33 MB
26 Pages

– Independent Work Report Fall, 2015 –Creating Diversified Portfolios Using Cluster AnalysisKarina MarvinAdviser: Swati BhattAbstractBecause of randomness in the market, as well as biases often seen in human behavior related toinvesting and illogical decision making, creating and managing successful portfolios of financialassets is a difficult practice. Achieving high returns with low risk is ideal, but seemingly impossible.Modern portfolio theory states that diversification of assets is the most effective way to get lowrisk-reward ratios [6]. However, how can one diversify effectively while at the same time avoidingbiases? The solution is an automated method of diversification using cluster analysis of financialassets. The cluster analysis serves as a method to find which assets are different from each other[15]. What is the best measure of difference or similarity between stocks? Previous works haveattemped to utilitze this type of algorithm using the correlation between stocks as the similaritymeasure driving the clustering [11, 12]. However, correlations often change during periods offinancial stress [14]. This would cause the clusters to no longer be structurally sound, during a timewhen risk is high. This paper proposes an alternative measure of similarity to avoid this risk: anaverage of the two ratios:Net IncomeAssetsRevenuesAssetsandNet IncomeAssets .Therefore, companies with similarratios will be in the same cluster, while companies with varyingRevenuesAssetsRevenuesAssetsandandNet IncomeAssetsratios will be in different clusters. Then, diversified portfolios of high performing stocks can becreated by picking assets with the highest Sharpe ratios from different clusters. Testing delays inthe financial statements that are used as well as weighting of the ratios, this algorithm results inhigh performance portfolios, when compared to the S&P 500 from July 2000 to July 2015. Thisis particularly true during the pre-crisis and post-crisis periods, 2000 to 2007 and 2009 to 2015respectively. During the crisis period of 2007 to 2009, the algorithm portfolios do not perform aswell as they do in the other periods, however they still perform adequately compared to the S&P500.

1. IntroductionMaintaining and managing a portfolio of investments is a much puzzled-over activity. There aremany issues that contribute to the confusion and difficulty of investing. Primarily, investors aresubject to a lot of uncertainty and randomness. There are even schools of thought and economictheory that state it is not possible to beat the market using savvy stock selection or good timing. Theefficient market hypothesis is a theory that claims share prices of stocks always trade at fair valuebecause they incorporate all relevant information [5]. This hypothesis therefore implies that it isimpossible for one to sell stocks that are too inflated or buy stocks that are undervalued. Althoughthere is opposition to the efficient market hypothesis, the extreme availability of data in recent yearssuggests that market prices more quickly reflect a value close to the true value of the stocks. Thisimplies that an investor cannot beat the market consistently because market prices will only changein reaction to news, which by definition is random. An investor, therefore, can only obtain higherreturns by partaking in riskier investments.Additionally, when one picks an asset to invest in, there are many different biases that affectthe investor. Illogical decision making has been observed in studies on human behavior whenfinancial choices are being made. For example, it has been shown that people tend to overweighttail events, or events that are highly unlikely to happen, in their decision making [2]. This is whymany participate in lotteries, despite the miniscule chance of winning. Additionally, people tend tohave a negativity bias, or an overweighting of bad news than good news [13]. This causes investorsto be more cautious or risk averse about losses rather than gains, despite the fact that they should betreated equally in economic analysis [13]. Due to these biases and the many more that influencethe decision-making of investors, a more quantitative and formulaic way to choose investmentsis necessary to reduce biases in judgement, as well as reduce risk and maximize returns despiteuncertainty in the market.Modern portfolio theory is the most widely used practice by individuals to develop portfolios.It is based on a principle of attempting to maximize expected return for a given amount of risk or2

equivalently minimizing risk for a given amount of return [6]. A highly utilized method to reducerisk is diversification. The idea of diversification is to split investment between varying companiesso that if a few securities one owned were to take a downturn, the others would not, reducing theloss. The downturn could be related to many factors, including the indsutry, market, country, typeof asset, or company itself.Risk can be diversified by picking assets that are different from each other with respect a particularaspect about the assets themselves. This aspect could be related to industry, country, type of asset,or more. For example, two stocks in the same industry may move together. Additionally, twoEuropean stocks may move together while an American stock moves differently. However, there arean infinite number of ways that assets can be related and connected to one another that could resultin an issue for the investor in a crisis. There is no formulaic mechanism one can use to diversifyportfolios. Furthermore, the biases common in human behavior are still present and influential overinvestment decisions. Therefore, an automated method to classify or cluster assets would be veryuseful and essential to investment decision-making and the practice of diversification. The stockswould be separated into groups via a clustering method that maximizes similarity within groups andminimizes simiarity between groups. Doing this with securities would allow one to figure out whatcombination of assets could make up a well diversified portfolio.2. Background2.1. 2008 Financial CrisisThe 2008 Financial Crisis, which is also known as the Global Financial Crisis, is thought to be theworst and most severe crisis since the Great Depression in the 1930s. There were many causesof the crisis, particularly the increase in subprime mortgages and mortgage-backed securities thatwere only insured by credit default swaps. House prices rose steadily from the 1990s to 2006.During this time, it was commonly thought that real estate was a safe and guaranteed successfulinvestment. As a result, home ownership increased to 68.6% of households by 2007 [1]. During thistime, mortgage lenders began to make subprime loans to people who usually would not be approved3

for one. Banks securitized these loans through mortgage backed securities. The two biggestissuers of these securities were Fannie Mae and Freddie Mac. When the housing market beganto plummet dramatically in 2006, the mortgage-backed securites became worthless as subprimeborrowers defaulted. Furthermore, personal wealth fell, reducing consumption and negativelyaffecting businesses. It became very difficult to sell any suspicious assets, and the stock marketplummeted. Banks were further hurt by the lack of confidence, who began to withdraw all oftheir money. When the Lehman Brothers went bankrupt in September 2008 due to large losses onmortgage-backed securities, the panic in the markets multiplied. Governments and central banksresponded with large amounts of fiscal stimulus and institutional bailouts. There was an influx ofregulatory action and a slow recovery began. In June 2009, the financial crisis was declared to beover, as the world continued to recover.During the few years of the 2008 financial crisis, the stock markets were in complete turmoil.In less than a month, from September 19, 2008 to October 10, 2008, the Dow Jones IndustrialAverage plummeted 3600 points [1]. The S&P 500 fell 38.5% and the NASDAQ composite indexfell 40.5%, in 2008 alone [1]. These three are US market indices that are commonly followed andare thought to be representative of the health of the American economy. The extreme financialstress and systematic risk during this period affected investors and markets worldwide. While theeconomy has improved greatly from 2008 to 2015, it is important to note that this period of financialstress occurred, and could possibly occur again. Therefore, in order to accurately and effectivelystress test any investing algorithms or strategies, the 2007 to 2009 period must be used as a worstcase scenario test.2.2. Diversification and the Financial CrisisDiversification is the process of choosing investments in order to reduce exposure to any oneparticular asset. This is typically done by investing in a variety of assets. If the asset prices donot move together, then a diversified portfolio of those assets will have a lower variance than theweighted average variance of the assets. The portfolio’s volatility could even be lower than any4

single asset inside the portfolio.Risk, which is defined as the chance that an investment’s return will be different than expected, iswhat is trying to be reduced in diversification. There are two basic types of risk: systematic risk andidiosyncratic risk. Systematic risk influences a large number of assets at once. It is inherent to theentire market, not just a particular stock or industry [9]. Because of this, it is impossible to diversifyaway systematic risk.Idiosyncratic risk, on the other hand, only affects a small number of assets. This type of risk canalso be referred to as unsystematic risk or specific risk, because it typically will affect a specificstock [9]. For example, news about a company such as a sudden strike will change the stock priceof that company, and possibly a couple competitors of the company. The risk of this occuring isidiosyncratic risk, which one can protect themselves from using diversification.The capital asset pricing model or the CAPM describes the relationship between risk and expectedreturn [10].E[ra ] r f βa (E[rm ] r f )(1)where E[ra ] is the expected return on asset a, r f is the risk-free rate, βa is the beta of a, and E[rm ]is the expected return on the market. β is a measure of how much the security will respond toswings in the market. In fact, it is a measure of the systematic risk of a security in comparison tothe market as a whole [10].βa 2σamσm2(2)2 is the covariance between asset a and the market and σ 2 is the variance of the market.where σammIf a regression is run between the return on the market and the return on an individual asset, thefollowing is achieved:ra αa βa rm εa5(3)

where εa is the error term, and measures the variability in ra that is independent of all othersecurities in rm . Using this regression, the variance of a stock can be decomposed into its systematicand idiosyncratic parts [10]:σa2 βa2 σm2 σε2(4)where σa2 is the total risk, βa2 σm2 is the systematic risk, and σε2 is the idiosyncratic risk. Thisidiosyncratic risk can be removed through diversification,Portfolio variance is:nσ 2 (r p ) n wiw j σi j(5)i 1 j 1where σ 2 (r p ) is the portfolio variance, n is the number of assets in the portfolio, wi is the portfolioweight of an asset i, and σi j is the covariance of assets i and j, noting that the covariance of thesame asset is simply the variance of that asset. Therefore, lower covariance between assets cangreatly reduce the portfolio variance.For very large n, portfolio variance becomes [10]:σ 2 (r p ) σ2 n 1 2σ nn ij(6)where the first term is the average variance of the individual investments, the idiosyncratic risk,and the second term is the average covariance, or the systematic risk. As n approaches infinity, thefirst term approaches zero and the risk is diversified away.However, the systematic risk remains. It will also be noted that there was a lot of systematic riskduring the 2008 financial crisis [9]. The economic event was certainly market-wide, with all typesof assets and securities decreasing in value due to the lack of confidence present. Diversificationcan reduce risk to a large degree, but systematic risk cannot be removed.6

2.3. Clustering MethodsCluster analysis is the task of grouping a set of objects such that the objects within a group aremore similar to each other than the objects in other groups [15]. It is commonly used in statisticaldata analysis. There are many algorithms designed to solve this task that differ in the definitionof a cluster and how the clusters are found. Typical problem parameters are similarity or distancefunctions and the number of clusters.Clustering methods can be divided into two basic types: hierarchical and partitional clustering.Hierarchical clustering either merges smaller clusters into larger clusters or splits larger clusters intosmaller clusters [15]. This is typically used if the underlying structure behind the data is a tree andis presented in a dendrogram. Partitional clustering, by contrast, directly partitions the data set intoa set of disjoint clusters [15]. The most commonly used partitional clustering method is k-meansclustering.k-means clustering is a method of partitioning n observations into k clusters, in which eachobservation belongs to the cluster with the nearest mean [15]. The goal is to minimize the withincluster sum of squares or:karg min s x µi 2(7)i 1 x Siwhere x are the obversations, S S1 , S2 , ., Sk are the sets of observations, and µi is the mean ofthe points in Si . While this problem is NP-hard, there are commonly used heuristic algorithms, suchas Lloyd’s algorithm [15]. The algorithm employs an iterative refinement technique. It is initializedby randomly assigning each observation to a cluster. Then, means are calculated to be the centroidsof the clusters in the update step. In the assignment step, each observation is assigned to the clusterwhose mean yields the least within-cluster sum of squares, where the sum of squares is the sumof the squared Euclidean distance. Then, the new means are updated and the process continuesuntil convergence, when the assignment step no longer changes which cluster the observations areassigned to. The process is shown graphically in the following Figure 1.7

Figure 1: Lloyd’s algorithm as a heuristic for k-meansk-means was picked to be the clustering algorithm for the purposes of this paper. Because theclustering is to be run on stocks, there is no hierarchical nature to the data. A partitional clusteringmethod such as k-means is much more appropriate.3. Related WorkAutomated methods of classifying assets for the purpose of diversification are recent innovations.For example, in 2005, Zhiwei Ren in Portfolio Construction Using Clustering Methods uses clusteranalysis to group highly correlated stocks and then uses those clusters to run mean-variance portfoliooptimization [11]. Similarly, Fredrik Rosen in Correlation Based Clustering of the Stockholm StockExchange classifies stocks based only on the correlation between them, where correlation is themeasure of the extent to which the stock price returns fluctuate together [12]. This method ofclustering on the similarity measure of correlation is the most obvious and straightforward approach.If one cluster’s stock price decreases, it is likely that the other cluster’s stock price will not decrease,therefore creating a hedge and reducing loss. The papers use different methods of portfolio creationand optimization, but both find success in evaluating the performance of the portfolios that werecreated after clustering.However, both test on time periods of data that were not stressful in nature, in the few yearsbefore the 2008 financial crisis. If they were tested using a period such as 2007 to 2009, it is likely8

that the clusters would not remain structually sound. Correlations often reverse or change duringperiods of stress [3]. Therefore the clusters that were formed based on correlations between stockswould no longer be strongly positively correlated within the clusters and uncorrelated or stronglynegatively correlated between the clusters. This would mean the investor is no longer holding adiversified portfolio, putting them at risk of large losses. Furthermore, the increased risk would becoming at a time of financial stress, which is the worst possible time [14].Figure 2: S&P 500 levels from 2000 to 2015The Standard & Poor’s 500, or the S&P 500, is an American stock market index based on 500large companies with stock listed on the NYSE or NASDAQ [8]. The S&P 500 is one of the mostcommonly followed equity indices and is considered the best representation of the US stock market[8]. In Figure 2, the past 10 years, from 2000 to 2015, of the S&P 500 levels are shown. Looking atthe levels in Figure 2 as a financial indicator of wellness, it is clear that there were very tumultuousperiods in recent history. Using a period of time such as 2002 to 2007 or 2010 to 2015 to test theporfolio creation methods that Ren and Rosen proposed would not be truly representative of therisk one could face. It is unlikely that the methods centered around correlations would survive9

during periods of fincicial distress, during which the stabilitiy of one’s investments is arguably mostimportant. The question is then: is there a safer measure than correlation to run the cluster analysison that results in portfolios that perform just as well?4. Methodology4.1. Clustering Measure of SimilarityCorrelation as a measure of similarity does not hold up in periods of stress. A potential candidatefor a good measure of similarity would be related to the previous success or potential for growth ofthe companies and be more inherent to the companies than correlation of stock prices. Then, thestructure of the clusters formed on the basis of the similarity measure would not crumble duringstressful time periods.The measure of similarity proposed and evaluated in this paper is based on two financial ratios:revenues to assets and net income to assets. A weighted average of these ratios is taken, and thedifference between the weighted averages of different firms serves as the similarity measurement.Financial statements are released by all public firms every quarter. Revenue, net income andassets are some of the measures that are present on these financial statments. Revenues, whichare on a company’s income statement, are the amount of money that a company receives duringa specific period, also known as sales [7]. Net income, by contrast, is a company’s total earningsor profit. It is calculated by taking revenues and subtracting the cost of doing business includingcost of goods, interest, taxes, depreciation, and other expenses [7]. These two measures are oftenconsidered when evaluating the strength or health of a business. They are thought to be the mostimportant figures on quarterly reports and are both evaluated because it is possible for net income toincrease while revenue remains the same, suggesting costs were cut. Revenue can indicate potentialfor growth or success in the market, while net income is the actual profit the firm takes in. Both areinherent to a company’s business every quarter and relate to the company’s performance over thequarter.However, a larger company is more likely to have high revenues and net income than a startup or10

a small company. The sizes, however, do not necessarily correlate to worthiness in investment. Inorder to scale for size, the revenues and net income values are divided by assets. Assets are alsoreleased in finanical statements every quarter and are shown on the balance sheet of a corporation.Generally, assets include cash, accounts receivable, inventory, real estate, and equipment [4]. Thismeasure is commonly used as a representation of size of company. Therefore, the revenues and netincome can be scaled by size by dividing by assets.4.2. Portfolio CreationUsing the difference of weighted averages as a measure of similarity, a clustering method is run onthe data to partition it into groups. Then, a stock must be picked from each cluster. The clusters areideally very different from each other. Therefore, a portfolio containing a stock from each clusterwill be diversified. How should the stock from each cluster be picked? In the interest of having highreturns and low risks, stocks are picked to maximize the return to risk ratio. The Sharpe ratio, or theaverage return earned in excess of the risk-free rate per unit of volatility, is a measure of calculatingrisk-adjusted return. The stock from each cluster with the highest Sharpe ratio is therefore picked tobe in the portfolio. The portfolio is then diversified and comprised of historically high performingstocks.4.3. VariablesMultiple values were varied to determine the most successful combination in portfolio creation. Theweight of the revenues per asset ratio is varied from 0% to 100% in the weighted average used inthe similarity measure. The similarity measure is then:RevenuesNet Incomex (1 x)AssetsAssets(8)where x ranges from 0 to 1. This measures any combination of the two ratios, including the pureRevenuesAssetsratio and the pureNet IncomeAssetsratio.Additionally, a delay in effect from previous quarters of data is also varied. Three delay models11

are investigated: one period, two period and average. A one period delay is simply using the mostrecently published quarterly data to make a portfolio for the upcoming quarter. A two period delayis using the financial data published the quarter before the most recent published data. In additionto these two delays, the average delay model is using the average of the previous two quarters datafor the cluster analysis. These three methods are compared to see which is the most successful.5. Implementation668 stocks in the technology sector that are traded on the NYSE and NASDAQ were used for thepurpose of testing. Quarterly data on revenues, net income and assets, as well as daily data on stockprice returns, were found from 2000 to 2015. The companies that did not have data for this periodof time were removed, leaving 229 potential stocks to invest in. k-means clustering was run everyquarter to create portfolios.5.1. Determination of k for k-meansIn order to maximize viability of the clusters created, an appropriate value for k must be determined.There is unlikely to be an obvious separation between clusters in the data; additionally, it wouldbe ideal to not have to visually determine the number of clusters every quarter. To determine theappropriate k, two measures were considered: the ratio of between cluster sum of squares to totalsum of squares, as well as silhouette width.The between cluster sum of squares is the sum of the squared Euclidean distance between everyobservation and the mean of all of the clusters it does not belong to [15]. The within-cluster sum ofsquares, by contrast, is the sum of the squared Euclidean distance between every observation andthe mean of the cluster it belongs to. The total sum of squares is the sum of the two. Therefore, theratio of between sum of squares to total sum of squares measures what percentage of the variancein the clustering is between clusters. Intuitively, between cluster sum of squares to total sum ofsquares is the amount of variance in the data that is accounted for between clusters, rather thanwithin clusters.Sihouette widths are a common measure of consistency of clusters [15]. For each point p, the12

average distance between that point and all other points within the cluster is calculated, as wellas the average distance between p and all points in the nearest cluster. The silhouette width is thedifference between these two divided by the greater of the two. If there is strong cohesion within agroup and weak cohesion between groups, the coefficients will be high.Figure 3: Silhouette widths for varying values of k after k-means clusteringFirstly, a variety of values of k between 3 to 200 were examined to find a potential range.Clustering was run for each k 100 times, and the ratio of between cluster sum of squares to totalsum of squares was calculated. Ideally, this value is close to one, and consistent results are achievedin 100 trials. As can be seen in Figure 3, a higher number of clusters appears to be better, but anynumber of clusters above five is fairly consistent across the 100 trials with a very high ratio over 0.9.13

Figure 4: Between cluster SS to total sum of squares for varying values of k after k-means clusteringClustering was run again for each k 100 times, computing the silhouette widths. The silhouettewidths are calculated for each cluster, and the median of these is taken as a measure of the successof the value of k in the clustering. In Figure 4, all potential values of k are adequate, with highsilhouette widths, other than 200.It was concluded that all cluster sizes between 5 and 100 would be tested for each clusteringalgorithm. The silhouette widths and sum of squares ratios would be calculated, and a k pickedbased on a weighted combination of these measurements. The ideal number of clusters are thereforedynamically picked in every portfolio creation.6. ResultsAs a preliminary evaluation of the investing algorithm, a weight of 50% onaverage of the financial ratiosRevenuesAssetsandNet IncomeAssets14RevenuesAssetsin the weightedand a single period of delay is tested.

Figure 5: Returns from clustering algorithm portfolios versus S&P 500 returns from 2000 to 2015The quarterly returns on the clustering algorithm portfolios in blue are shown in Figure 5 withthe quarterly returns of the S&P 500 shown in the red, from 2000 to 2015.The clustering portfolios’ returns, as can be seen in Figure 5, are more volatile than the S&P500’s returns. There are three locations where there are large negative peaks compared to the S&P,around Octoer 2001, October 2002 and October 2014. However, there are significantly more largepositive peaks in relation to the S&P throughout the 15 years in July 2001, January 2002, October2003, January 2005, January 2011, and April 2012, as well as a few smaller positive peaks.The performance can also be evaluated in multiple periods. 2000 to the beginning of 2007 can bethought of as a pre-crisis period, before the 2008 financial crisis, 2007 to 2009 as the crisis period,and 2009 to 2015 as the post-crisis period. Splitting up the clustering porfolios performance intothese three periods, it appears that there is similar performance in the pre- and post-crisis periods.There are many high peaks and a couple low peaks during these periods. By contrast, in the crisisperiod, the clustering portfolios had many small positive and negative peaks. Portfolios had no15

large peaks during this period, on the positive or negative side. The algorithm portfolios are lessvolatile during the crisis period, due to the systematic risk across all markets during the period. Theconservative natrue of the investments during this period compared to the other periods is a goodfeature of the algorithm.Figure 6: Portfolio variance of clustering algorithm portfolios versus S&P 500 volatilities from 2000to 2015Figure 6 shows the portfolio standard deviation of the clustering algorithm portfolios versus theS&P 500 volatility from 2000 to 2015. As can be seen in the figure, the clustering portfolios dohave higher volatility than the S&P 500, as expected. However, the values are still very close andthe portfolio volatility is still low, typically around 2-3%, with the exception of the 2008 financialcrisis. Additionally, the average volatility of the stocks that the portfolios are comprised of is shownin purple. It is noted that this volatility is higher than the portfolio variance, suggesting that thediversification was successful. Looking at the three different periods of time, the pre- and post-crisisvolatilities of the clustering algorithm portfolios are similar to each other. The volatilities waver16

between 1 and 3 percent, following a similar pattern to the S&P 500 volatilies’ movements. Inthe crisis period, volatility spikes to around 6%. During this time, the S&P 500 volatility alsospikes, although to a lower value. Because the 2008 financial crisis caused high systematic risk,the portfolio volatility is expected to increase greatly. Overall, the volatility of the portfolios isgreater than ideal, but is still low enough to be acceptable and is representative of the underlyingdiversification.Using the same data as for Figure 5 and 6, a regression was run between S&P 500 returns and theclustering portfolio returns for the 50%-50% weighting and one delay model. Using the quarterswhere the S&P 500 had positive returns, the corresponding quarters’ data were taken from theclustering portfolios returns and a regression was run between the two.Figure 7: Regression of the positive values of the S&P 500 returns to the clustering portfolio returnsIn Figure 7, the blue line is the best fit line for the relationship between the S&P 500 and theclustering portfolios, whereas the red line is a line of slope 1 as a reference point. Ideally, thecoefficient on the regression is greater than 1. For example, if the coefficient is 2, and the S&P 50017

returned 10% in one quarter, it would be estimated that the portfolios would return 20%. As can beseen in Figure 7, the coefficient is significantly above 1, at just below 2. In fact, a return on the S&Pof 15% is estimated to correlate with a return of around 30% on the clustering algorithm portfolios.However, it is noted that there is a decent amount of variation around this regression, as can be seenby the points in Figure 7 and the R2 value from the regression of 0.32.In order to determine if the coefficient is significantly different from 1, the same regression is runwith an of