data mining

(noun)

a technique for searching large-scale databases for patterns; used mainly to find previously unknown correlations between variables that may be commercially useful

Examples of data mining in the following topics:

Data Snooping: Testing Hypotheses Once You've Seen the Data
- Testing hypothesis once you've seen the data may result in inaccurate conclusions.
- The error is particularly prevalent in data mining and machine learning.
- Sometimes, people deliberately test hypotheses once they've seen the data.
- Data snooping (also called data fishing or data dredging) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data.
- Although data-snooping bias can occur in any field that uses data mining, it is of particular concern in finance and medical research, which both heavily use data mining.
Exploratory Data Analysis (EDA)
- Exploratory data analysis is an approach to analyzing data sets in order to summarize their main characteristics, often with visual methods.
- Exploratory data analysis (EDA) is an approach to analyzing data sets in order to summarize their main characteristics, often with visual methods.
- Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data and possibly formulate hypotheses that could lead to new data collection and experiments.
- Tukey promoted the use of the five number summary of numerical data:
- Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.
Applications of Statistics
- In calculating the arithmetic mean of a sample, for example, the algorithm works by summing all the data values observed in the sample and then dividing this sum by the number of data items.
- Statistical methods can summarize or describe a collection of data.
- These inferences may take the form of: answering yes/no questions about the data (hypothesis testing), estimating numerical characteristics of the data (estimation), describing associations within the data (correlation) and modeling relationships within the data (for example, using regression analysis).
- It can include extrapolation and interpolation of time series or spatial data and can also include data mining.
- This Boxplot represents Michelson and Morley's data on the speed of light.
Fundamentals of Statistics
- Data collected about this kind of "population" constitutes what is called a time series.
- Data collected about this kind of "population" constitutes what is called a time series.
- Numerical descriptors include mean and standard deviation for continuous data types (like heights or weights), while frequency and percentages are more useful in terms of describing categorical data (like race).
- These inferences may take the form of: answering yes/no questions about the data (hypothesis testing), estimating numerical characteristics of the data (estimation), describing associations within the data (correlation ) and modeling relationships within the data (for example, using regression analysis).
- It can include extrapolation and interpolation of time series or spatial data, and can also include data mining.
Stepwise Regression
- Hence it is prone to overfitting the data.
- This method is particularly valuable when data is collected in different settings.
- Stepwise regression procedures are used in data mining, but are controversial.
- The tests themselves are biased, since they are based on the same data.
- Models that are created may be too-small than the real models in the data.
Distorting the Truth with Descriptive Statistics
- Reporting bias involves a skew in the availability of data, such that observations of a certain kind may be more likely to be reported and consequently used in research.
- Descriptive statistics is a powerful form of research because it collects and summarizes vast amounts of data and information in a manageable and organized manner.
- correlate (associate) data or create any type of statistical relationship modeling relationship among variables;
- In other words, every time you try to describe a large set of observations with a single descriptive statistics indicator, you run the risk of distorting the original data or losing important detail.
Examining numerical data exercises
- Data were collected on life spans (in years) and gestation lengths (in days) for 62 mammals.
- Workers at a particular mining site receive an average of 35 days paid vacation, which is lower than the national average.
- Exercise 1.6 introduces a data set on the smoking habits of UK residents.
- Create a box plot for the data given in Exercise 1.30.
- (d) The time series plot shown below is another way to look at these data.
Confounding
- Beyond these factors, researchers may not consider or have access to data on other causal factors.
- Smoking and confounding are reviewed in occupational risk assessments such as the safety of coal mining.
Observations, variables, and data matrices
- These observations will be referred to as the email50 data set, and they are a random sample from a larger data set that we will see in Section 1.7
- The data in Table 1.3 represent a data matrix, which is a common way to organize data.
- Data matrices are a convenient way to record and store data.
- How might these data be organized in a data matrix?
- These data were collected from the US Census website.
Optional Collaborative Classrom Exercise
- The science of statistics deals with the collection, analysis, interpretation, and presentation of data.We see and use data in our everyday lives.
- Your instructor will record the data.
- For example, consider the following data:
- Where do your data appear to cluster?
- Effective interpretation of data (inference) is based on good procedures for producing data and thoughtful examination of the data.

data mining

Related Terms

Examples of data mining in the following topics: