Hi everyone, Long time lurker, first time poster! I'm developing a website where users can read and participate in discussions, articles and questions. I'm currently developing an algorithm that will recommend to each user discussions, articles and questions that they would be interested in. I've written up what I've come up with so far. I'm not looking for anyone to spend too much time on this, just mostly looking for pointers. It'd be a great result to have someone say "ha! What are you doing? That thing will fall apart with more than 200 users!" etc. Please poke holes in it and make any suggestions you like. If you think the entire approach is wrong, please let me know if there's a particular area I should be researching to re-work it. On a final note, I'm not a computer scientist and I have no experience with these kind of algorithms. Thanks in advance! Aim: To successfully recommend to a user discussions, articles and questions that he or she would be interested in reading. This algorithm is not intended to be a perfect solution. Rather, it is intended to be a solution until we can employ computer scientists to re-write it. Whilst functional, it is far from efficient in its current state (read: it may be slow and most likely won’t scale well). Required definitions: A category is a broad field - e.g. Mathematics, Physics, Biology, etc. A sub-category is the next level - e.g. Pure Mathematics, Applied mathematics, etc. A sub-sub-category is the next level again - e.g. Number Theory, Real analysis, etc. A thread is any question, article or discussion - e.g. "is religion incompatible with science?" The order is then category > sub-category > sub-sub-category > thread Overview: Each user is connected to a unique recommendation 'cluster' that feeds them the following recommendations: discussions, articles and questions to read. The cluster a user is connected to is determined by their top 3 most viewed categories (e.g. Biology, Physics and Maths). That is, there is a cluster for Biology, Physics and Maths, a cluster for Biology, Physics and Sport, etc. Using our current number of categories, 22, this amounts to 1,540 clusters (22 choose 3). Each cluster then uses information from the connected users to determine recommendations for those users. It will do so by recommending the most viewed threads (discussions, articles, questions, etc.) from all of its users combined. More specifically, to reduce computation it will identify the total most viewed sub-sub-categories from those users combined, then simply take the most viewed thread from each of those sub-sub-categories. Connections are therefore based on categories, but recommendations are based on sub-sub-categories. Figure 1: http://imgur.com/An9cawZ Consider the cluster above, C1 (cluster 1). Users A, B and C are connected to the Maths, Bio, Physics cluster because they each have these three as their top three most viewed categories. If both B and C have the sub-sub-category, Number Theory, then A will most likely receive a recommendation of the most viewed thread in Number Theory. The particulars: - Each cluster must have at least 40 users connected before it becomes operational - this will prevent recommendations that are unequally weighted (e.g. user A gets recommended everything user B reads). - If a user’s primary (read: best) cluster is not operational, they will be connected to the next best cluster. They will remain here until their primary cluster becomes operational. - A user’s cluster is determined by their most viewed categories. As this changes, their cluster changes. - Every user connected to a particular cluster receives the same recommendations from that cluster. The assumption is that users with the same categories in their most viewed will be interested in similar things. The recommendation list will be filtered based on the users top sub-subs so that no redundant recommendations occur. For instance, a user which has a particular sub-sub in their most viewed list that is also a top sub-sub in the cluster will not be recommended anything from this sub-sub. This is based on the assumption there’s a good chance they have viewed the most viewed material in that sub-sub. - We only show the top 20 recommendations at any point in time, then we load more when they scroll - If we assume that there are 22 categories, an average of 20 sub-categories per category and 15 sub-sub-categories per sub-category, we have approx. 6,600 sub-sub-categories. This should be an overestimate. - If we assume that there are 22 categories, then there are 1,540 clusters (22 choose 3). The method: Each user has a data profile detailing the following: http://imgur.com/td6ApWF From this data, the user’s ‘cluster profile’ can be determined. To determine the cluster, we simply sort the views column and extract the top three most viewed sub-sub-categories; these constitute the cluster profile. This table will consist of a maximum of 26,400 (6,600 * 4) cells per user. If we don’t create cells until they are required and we make the assumption that in reality most users won’t view threads from more than 150 sub-sub-categories, we can use a working assumption of a maximum of 600 (150 * 4) cells per user. This will be even lower until they reach this working maximum. In addition to the above, another site-wide table will be kept to determine what each cluster should recommend. It will appear as follows: http://imgur.com/rKHdjpB Each time a user views a thread in a sub-sub-category, the cell corresponding to that sub-sub-category and that user’s cluster is incremented. That is, if a user that’s connected to cluster 1 views a thread in particle physics, the first cell is incremented by 1. Cluster 1’s recommendations are then simply the highest viewed threads from the most viewed of these sub-sub-categories. Each user from cluster 1 will receive say the top 20 of these as recommendations. Alternatively, we can do away with this table entirely. Instead of performing a cluster-check each time a thread is viewed and incrementing the site-wide table, we can utilise the data we already have. As we have already stored how many views a particular user has for each sub-sub-category, we can calculate the total number of views in a particular cluster for each sub-sub-category on the fly. That is, when cluster 1 needs to make a recommendation, it looks at each user connected to it, totals their views for each sub-sub-category, calculates the total views per sub-sub-category for the entire cluster and makes the recommendation based on that. Clearly, the trade-off here is saving storage in exchange for computation. The computation appears minimal in comparison to the storage however, so the trade-off appears worthwhile at first glance. Thanks again for taking the time to read this. Any assistance would be greatly appreciated.