First off, I compiled my data in August of 2014. I used only data that fell under the designation of “Tabletop Games.” When I compiled the data, there had been 4,432 projects launched under the designation of “Tabletop Games”
A programmer friend of mine built a “bot” to scrape the Kickstarter website. The data scraped was then compiled into a spreadsheet. This data consisted mainly of information that anyone can see when viewing a Kickstarter project page such as the funding goal, total number of backers, total funding raised, date launched, date ended, etc…
With this data, I've tried to identify specific variables that correlated to either the success, or lack of success, of projects.
A Problem With Our Kickstarter Data
We were only able to collect data for 3,629 of those 4,432 projects. The remaining 803 projects were redirecting to another location and we could not figure out why. After much deliberation, we identified where.
Those 803 projects were all failed projects that had been re-launched at a later date. Therefore the URLs were redirecting to the later version of the campaign and gave no information about the previous launch of the campaign. Of these 803 projects, some failed and some succeed in the subsequent launches. So we decided to just use the 3,629 projects that were readily available and perform our statistical analysis on these projects alone.
The Kickstarter Data is Skewed Positively
But if you are an intelligent person, a huge problem may be glaring you in the face. The data is skewed! The data shows a much higher proportion of successful projects than it realistically should since 803 projects which failed are being excluded from the analysis.
Yes, this is a problem, but this is what we have and this is what we will use. Although, I think the data is still credible for a number of reasons:
- In general we will be performing statistical tests which consider the average success rates of one group and then compare them to another group. Since the 803 failed projects were projects of all types, this means it should affect all the groups within our analyses in a similar way, so it shouldn't really affect the results of our statistical test.
- According to a sample size analysis (this is going to get super geeky – so put on your stats cap for just a moment) having 3,629 pieces of data from a set of 4,432 gives us a 99% confidence level with a confidence interval of just under 2%. (I'll go into more detail about these terms later in this blog)