Sunday, July 13, 2014

05 - Week

This week's readings focused on how big data can be used to identify patterns that can be helpful in numerous ways. The Numerati disucsses a company that employes mathematician to sift through massive troves of online data gathered by people's interactions with technology to identify patterns and relationships that could be beneficial to advertisers. In an interesting example provided in the reading the author discusses how he found a correlation between people who had just watched a romantic movie and clicked a rental car ad banner. Mining big data provides us with these curious insites into human behavior. To make sense of why a correlation exists further analysis of data is needed. By analyzing data and understanding the relationships, companies create customer profiles that can be used to target customers with the right mix of product and service.

Big data analysis is not just for marketing and providing the right product, at the right time, at the right price to a customer. Data mining is used in the medical industry to understand causes of diseases. As the "Big Data and Your Health" article discusses, how using large pools of data gathered about patients, researchers can understand the specific genes that cause certain diseases. They can test their hypothesis through experimentation and data mining and then can formulate medical treatment plans and recommendations to lower (or even prevent) the probability of diseases in high risk patients. In some cases heightened surveillance is recommended for patients that have a certain predisposition to a particular disease. To make these recommendations medical researchers mine the vast quantities of patient data available in hospitals.

There are several algorithms and techniques available to help mine data. With increased computing power and web 2.0's propensity to collect masses of data; more correlations will be found and investigated. 6 years ago as a graduate student in Computer Science, I published a paper on a data mining algorithm called "Random Forests". This algorithm could be used to learn form data sets that are imbalanced (http://dl.acm.org/citation.cfm?id=1337332). As an analogy, the learner was trained to find a needle in a haystack. This is a classic example of working with big data to find root causes for a hypothesis. 8 years ago, I helped a Gastroenterology fellow at the Cleveland Clinic to validate the recommended age for first time colonoscopy. Similar to the study mentioned in the "Big Data and Your Health" article, we mined to 1000s of patient records to identify if there were any correlations between age, sex, race, gender, income etc.. and the presence of per-cancerours polyps in the colon. The study ultimately recommended that the first colonoscopy should be performed 10 years earlier than the current recommendation (the current recommended age for initial colonoscopy screening is 50). The abstract was published in the American Journal of Gastroenterology (430 Comparison of the Detection Rate of Colorectal Neoplasia by Colonoscopy in Average-Risk Patients Ages 40-49 vs. 50-59 Years )

Data mining has been a field of study for the last 10-15 years in computer science. The every increasing stores of data provide the Numerati with limitless possibilities in several disciplines form medicine, to advertising, to counter terrorism. Privacy hawks will consider the proliferation of big data as an assault on individual privacy. They may have a point as the NSA leaker, Edward Snowden's case has revealed. Ultimately we have to decided if the benefit that big data brings us out weighs the potential privacy violations that can occur. The current trend in the Web 2.0 world seems to be headed in the direction of creating ever more data that can be used to benefit (or harm) large segments of humanity.

No comments:

Post a Comment