Following last Wednesday post I will continue today on the data mining topic. I found a video from Stanford online’s YouTube channel from 2011 featuring a Google data engineer talks about data mining. The title of the video is quite suited for this blog: Data Mining: The Tool of The Information Age.
Some introductory aspects of the video are worth to think about. Google’s Search Scientist and Stanford Instructor Rajan Patel makes a good point of distinguish what is from what isn’t data mining from the outset while provides the best definition he finds helps us best understand the topic:
Data Mining: Data Mining is the process of automatically discovering useful information in large data repositories.
On other aspect that I thought worth to mention is the Neuroscience background of the speaker. The brain science field os study is one of the most significant scientific fields in the beginning of the XXI Century. It is inspiring several other fields of study. But not least it is also being of major significance to recent technological developments, in software development mainly, but also in cutting imaging of machine vision hardware developments. What isn’t often properly recognized is that data-driven approaches to Neuroscience are just one of many approaches to understand the human brain. Data-driven approaches are actually quite new and emergent, but are gaining traction within the broader Neuroscience communities. The also relatively new and emergent field of Bioinformatics
have been one the first to recognize data-driven approaches to structure and inform both theoretical and experimental developments in its own field, but by being closely related with Neuroscience, this trend just had the right ecosystem to spillover
its influence. After all biological systems can and might be viewed as big data systems. One patch of brain issue, representing a network of say 100 nodes is actually full of data in itself.
Data mining techniques thrive in big complex data environments. As the speaker refers to, with the complexity or high dimensionality of the data there is the enhanced potential to uncover new patterns, unseen from conventional techniques. Form this the other important distinction to be made, transparently done by Rajan Patel, is between performing descriptive tasks versus predictive tasks. Form descriptive tasks we normally refer to for example clustering
or association analysis
; Classification and Regression are typically predictive tasks.
The classification characteristic of imaging of medical images were a feature in this video was of value. I liked the part on classifying microcalcifications of digital mammography images. Notice to number and significance of the attributes that modern digital data mining are able to parse through. From this Rajan moved on to clustering with the attributes identification and similarity measures being the tasks of relevance.
The next part of the video pointed out the relevance of association rules to marketing, prediction of sales and what correlations or patterns can be found with a certain buying behavior by consumers., This is especially relevant within modern retail businesses, where they increasingly use data mining to have a rigorous judgement about their revenue and profit streams, both at present and as a forecasting ability.
Finally the author introduced a list of current data mining challenges:
featured image: Stanford: Data Mining, Data Science Online Courses, Certificate