What is data analysis
With the advent of the era of mobile Internet, especially the continuous development of science and technology such as virtual reality, artificial intelligence, the Internet of Things and the Internet of Vehicles, today’s world is increasingly dependent on information technology, and massive amounts of data are generated and stored every day.
There are various sources of data. In addition to the large amount of data generated by automatic detection systems, sensors and scientific instruments in the production process, online shopping, booking tickets, sending WeChat, and writing Weibo in daily life will also generate a large amount of data. . The process of processing this massive amount of data and extracting valuable information from it is data analysis.
Data analysis refers to the process of analyzing a large amount of raw data collected with appropriate statistical analysis methods, and conducting detailed research and generalization of the data in order to extract useful information and form conclusions. The purpose of data analysis is to extract and analyze information that is not easy to infer. Once this information is understood, it is possible to study the operating mechanisms of the system that produces the data, thereby making predictions about the possible responses and evolution of the system.
Originally used as data protection, data analytics has evolved into a methodology for data modeling. Modeling actually refers to transforming the system under study into mathematical form, and once a mathematical or logical model is established, it is possible to make predictions of the system’s response with varying degrees of accuracy. The predictive power of a model depends not only on the quality of the modeling, but also on the ability to select high-quality datasets for analysis.
Therefore, preprocessing tasks such as data acquisition, data extraction, and data preparation also belong to the category of data analysis, and they have an important impact on the final result.
In data analysis, there is no better way to understand data than to turn it into a visualization that conveys the (sometimes hidden) information contained in the numbers. Therefore, data analysis can be viewed as a model and a graphical presentation. Test the model with a dataset of known outputs based on the model’s ability to predict the response of the system under study.
This data is not used to generate a model, but to test whether the system can reproduce the actual observed output, so as to grasp the error of the model and understand its validity and limitations. Then, the new model is compared to the original model, and if the new model wins, the final step of data analysis—deployment—can take place. In the deployment stage, it is necessary to give the prediction results according to the model, realize the corresponding decision, and at the same time prevent the potential risks predicted by the model.
process of data analysis
The process of data analysis can be described in the following steps: transforming and processing raw data, presenting data visually, modeling and making predictions, where the role of each step is critical to the subsequent steps. Therefore, data analysis can be summarized as problem definition, data acquisition, data preprocessing, data exploration, data visualization, creation and selection of predictive models, model evaluation and deployment.
1. Problem Definition
Before data analysis, it is first necessary to clarify the goals of data analysis, that is, the main problems to be studied in this data analysis and the expected analysis goals, etc. This is called problem definition.
Defining the problem precisely is possible only when the system that is the object of study is deeply explored. In fact, a comprehensive or exhaustive study of a system can sometimes be complex, and there may not be enough information to begin with. Therefore the definition of the problem, and especially the planning of the problem, will be the guideline to be followed throughout the data analysis project.
The problem definition will generate some related documents. After the problem is defined and the document is formed, the project planning link of data analysis can be entered next. To figure out what professionals and resources are needed to efficiently complete a data analysis project in this link, you need to find experts in related fields and install data analysis software. Therefore, an efficient data analysis team should be formed during the project planning process. In general, the team should be interdisciplinary, because looking at data from different perspectives helps solve problems.
2. Data Collection
After the problem definition phase, the first thing to do is to acquire the data before analyzing it. Data selection must be based on the purpose of creating a predictive model, and data selection plays a crucial role in the success of data analysis. The sample data collected should reflect as much of the actual situation as possible, i.e. be able to describe the system’s response to stimuli from real-world conditions.
If inappropriate data is selected, or data analysis is performed on data sets that do not represent the system well, the resulting model will deviate from the system being studied. For example, if you need to explore the trend of air quality changes in Beijing, you need to collect air quality data, weather data, and even factory environmental condition data, gas emission data, important schedule data, etc. in Beijing in recent years; if you need to analyze the key factors affecting the company’s sales Factors, you need to call the company’s historical sales data, user data, advertising data and so on.
The data acquisition methods are as follows:
① Use SQL statements to directly retrieve relevant business data from the enterprise management database. For example, extract all the sales data in 2017 and the top 20 commodity data in sales, and extract the consumption data of users in East China, South China, and West China.
② Go to a specific website to download some public data sets opened by scientific research institutions, enterprises, and governments. These datasets are usually well-established and of relatively high quality. Of course, this method also has some shortcomings, that is, the release of these data is usually lagging behind, but because of its high objectivity and authority, it still has great value.
③ Write a web crawler to collect data on the Internet. For example, the sales and evaluation information of products on Taobao, the rental information of a city on the rental website, the list of movies and movie ratings on Douban, and the ranking list of NetEase Cloud Music reviews, etc., can be obtained through crawlers. Based on the data crawled from the Internet, it can be analyzed for a certain industry and a certain group of people, which is a very accurate way of market research and competitive product analysis.
Although data collection cannot obtain all the required data, more useful information can be extracted from the limited available data.
3. Data preprocessing
Most of the data obtained through data collection are incomplete and inconsistent “dirty data”, which cannot be directly analyzed. If used directly, the analysis results will be unsatisfactory. Data preprocessing is to transform the original data obtained in the data collection stage into “clean” data after data cleaning and data transformation. With these “clean” data, more accurate analysis results can be obtained.
Data cleaning is the process of re-examining and verifying data with the purpose of removing duplicate information, correcting existing errors, checking data consistency, and dealing with invalid and missing values. For example, there are many days of data in the air quality data that were not monitored due to equipment, some data were recorded repeatedly, and some data were invalid due to equipment failure. Then, for these incomplete data, whether to delete it directly or use the adjacent value to complete it, these are all issues that need to be considered.
Data transformation is the process of transforming data from one representation to another. Such as date format conversion, data measurement unit conversion, etc.
In addition, the calculation of basic descriptive statistics and the drawing of basic statistical graphics can also be used to find missing values and outliers.
4. Data exploration and data visualization
The essence of data exploration is to search for data from graphs or statistics to discover patterns, connections and relationships in the data. Data visualization is one of the best ways to get information. By presenting data visually, not only can key information be quickly grasped, but also patterns and conclusions that cannot be observed through simple statistics can be revealed.
Data exploration includes preliminary data inspection; determining the type of data, that is, categorical data or numerical data; and selecting the most suitable data analysis method to define the model.
Typically, at this stage, one or more of the following activities may be included in addition to a detailed study of the graphs obtained using data visualization methods.
- Summarize the data.
- Group the data.
- Explore relationships between different attributes.
- Identify patterns and trends.
- Build a regression model.
- Build a classification model.
Often, data analysis requires summarizing various representations related to data analysis. During the summarization process, the data is condensed into an explanation of the system without losing important information.
Clustering is a data analysis method used to find groups of common attributes.
Another important step in data analysis is to focus on identifying relationships, trends, and anomalies in the data. To find this information, it is necessary to use appropriate tools and, at the same time, to analyze the resulting image after visualization.
Other data mining methods, such as decision tree or association rule mining, automatically extract important facts or rules from data. These methods can be used in conjunction with data visualization to discover the various relationships that exist between data.
5. Creation and selection of predictive models
Predictive model refers to the quantitative relationship between things that is used for prediction and described in mathematical language or formula. It reveals the inherent regularity between things to a certain extent, and it is used as a direct basis for calculating the predicted value when making predictions. During the predictive model creation and selection phase of data analysis, an appropriate statistical model is created or selected to predict the probability of an outcome.
Specifically, the model is mainly used in the following two aspects.
- Use regression models to predict the value of data produced by the system.
- Classify new data using a classification model or a clustering model.
In fact, according to the type of output results, the models can be divided into the following 3 types.
- Classification model: The output of the model is categorical data.
- Regression model: The output of the model is numerical data.
- Clustering model: The output of the model is descriptive data.
Simple methods for generating these models include linear regression, logistic regression, classification, regression trees, and the K-nearest neighbor algorithm. But there are many types of analysis methods, each with its own specific data types that it is good at handling and analyzing. Each method can generate a specific model, and which method is selected is related to the characteristics of the model.
Some models output predicted values that are consistent with how the system actually behaves, and these models are structured so that they explain certain characteristics of the system under study in a concise and clear manner. Other models can also give correct predictions, but their structure is a “black box”, with limited ability to explain the characteristics of the system.
6. Model Evaluation
The model evaluation phase is also called the testing phase. In this phase, a part of the original data set of the entire data analysis is extracted as a validation set, and the validation set is used to evaluate whether the model created using the previously collected data is effective.
In general, the data used for modeling is called the training set, and the data used to validate the model is called the validation set.
By comparing the output of the model and the actual system, the error rate can be assessed. Using different test sets, the validity interval of the model can be derived. In fact, the prediction result is only valid within a certain range, or it varies depending on the range of the prediction value, and there are different levels of correspondence between the prediction value and the valid value.
The model evaluation process can not only get the exact degree of effectiveness of the model, but also compare it with other models. There are many techniques for model evaluation, the most famous of which is cross-validation. Its basic operation is to divide the training set into different parts, each part in turn serves as the validation set, while the rest is used as the training set. In this iterative way, the best model can be obtained.
The final step of data analysis is deployment, which aims to present the results, that is, to give the conclusion of the data analysis. If the application scenario is in the commercial field, the deployment process converts the analysis results into solutions that are beneficial to customers who purchase data analysis services. If the application scenario is in the field of science and technology, the results are converted into design proposals or scientific and technological publications. In other words, the deployment process is basically the application of the results obtained from data analysis into practice.
There are various deployment scenarios for the results of data analysis, usually this stage is also called data report writing. The writing of the data report should describe the following points in detail.
- Analyze the results.
- Decision to deploy.
- Risk Analysis.
- Business Impact Assessment.
If the output of the project includes generating predictive models, these models can be deployed as stand-alone applications or integrated into other software.
The role of data analysis
At present, both Internet companies and traditional companies need data analysis. If an enterprise needs to make business decisions or launch a new product, it needs to use data analysis to integrate and summarize some messy data, and determine the specific direction from it. In fact, in the business analysis of enterprises, data analysis has three functions.
The so-called status quo has two meanings, one meaning refers to what has happened, and the other meaning refers to what is happening now. By analyzing the basic weekly or monthly reports of the enterprise, you can understand the overall operation of the enterprise, discover the problems in the operation of the enterprise, and understand the current situation of the enterprise.
If through the analysis of the current situation, after knowing that there is a certain hidden danger in the enterprise, it is necessary to analyze the hidden danger. Learn why this vulnerability exists and how it arises.
After analyzing the current situation and analyzing the reasons, predictive analysis is needed. Using the data we have now, we can predict future development trends, etc.
In fact, these three functions are to analyze the overall operation of the enterprise in the past, analyze the hidden dangers that exist now, and predict the development trend of the enterprise in the future.