Data Security Assignment: Security & Privacy In Data Mining Techniques
Question
Task:
You are required to complete a data security assignment regarding the software vulnerability detection, consisting of 2500-4000 words on Data mining (unsupervised learning) techniques.
Answer
Abstract
There are numerous data-oriented technologies that have been developed and one of those have been explored in this data security assignment. These technologies are being created due to the increase in the use of the data sets. Data Mining is one such technology that is one of the emerging concepts. There are unsupervised techniques that can be used for data mining. There are a few risks and challenges that may be associated with data mining. The security and privacy concerns can arise with the data mining applications (unsupervised learning).
Keywords: Data mining, unsupervised learning, security, privacy
Introduction
Data mining refers to the technique in which the determination of the anomalies and patterns is done for the massive sets of data in order predict the outcomes. There is a wide range of data mining techniques that are defined and can be used by the business firms to enhance the decision-making capabilities and earn better revenues [1].
With the advancement in the technologies, it is now possible to manage the data with automated systems and mechanisms. The manual time-consuming processes are no longer required to analyse the data sets. The higher complexity of the data sets illustrates the increased potential to manage the information pieces. The determination of the relationships between the data sets is being used in a number of business and industrial sectors, such as retail industry, baking firms, manufacturing units, tele-communications, and a lot more [2]. Data mining is an umbrella of techniques that represent a large number of concepts that are defined under the same. Broadly, the data mining techniques are classified in two major categories as supervised and unsupervised learning. Supervised learning is a concept that includes learning by example. It is the process in which the attempts are made by the system to determine the specific concepts and descriptions on the basis of the pre-classified examples. In the case of the unsupervised learning, such examples are not present which makes the learning process more difficult. The determination of the specific patterns and concepts is done without the presence of the labelled data sets in the case of unsupervised learning. The use of the specific techniques under unsupervised and supervised learning is used for carrying out the data mining activities [3].
Existing Methods and Techniques for Unsupervised Learning – Data Mining Techniques
Clustering
Clustering is one of the unsupervised learning techniques that are used for exploring the data sets. There are data sets that may not have the default or usual groupings. In such cases, the use of clustering algorithms can be done for data mining activities in order to find out the natural groupings. The use of clustering technique and analysis is done to determine the clusters present in the data sets [4]. These clusters refer to the specific collection of the data sets and objects that may have certain similarity with each other. The suitable clustering techniques and methods produce high-quality clusters so that the similarity in the inter-cluster is low and for the intra-clusters, it is high.
There are a number of steps that are involved in the data mining process. The use of clustering technique is done as a data pre-processing step so that the determination of the homogenous groups can be effectively done. The outcomes and results are not guided or based on the labelled examples and data sets. This is what makes the clustering technique different from the supervised learning models. The technique works on the basis of the optimization criteria that determine the high and low similarity between the clusters [5]. The allotment of the cluster points can be done on the basis of the outcomes. The use of clustering technique is done for organizational data mining and it involves centroid, attribute histograms, and the specific spots for the clusters in the hierarchical tree. There are several algorithms that can be used to implement the clustering technique for data mining. K-means algorithm and orthogonal partitioning are some of the primary techniques that are used. The clusters that are identified with the aid of these algorithms are utilized to determine the primary characteristics of the data sets [6].
Association
Another unsupervised learning technique that is used for data mining is the association technique. The technique is usually applied in order to find out the relationships and specific patterns in the data sets. The business firms can carry out market basket analysis using the association technique. It assists in the discovery of the specific data patterns which can then be used to develop the market strategies and patterns. Retail industry, for instance, can make use of the unsupervised data mining using the association technique [7]. The enhanced use of barcode technology is now done in the retail industry to monitor and collect the sales information. The association models can be applied to the sales data captured with the barcode technology for cross-marketing and promotional operations. The use of the patterns identified can also be done for the purpose of customer segmentation or target marketing.
Earlier, the use of the association models was done to determine the specific trends associated with the customers through the analysis of the transactions made by the customers. With the advancement in the technology, it is now possible to determine and predict the web access for personalization. This enhances the level of interaction and connection with the customers and can do wonders for the customer loyalty with the business organization. The use of such patterns and trends is specifically used by the e-commerce firms to link various web pages as per the customer interests and preferences [8].
Feature Extraction
Another technique that is commonly used for unsupervised learning is the feature extraction technique. The creation of a large number of features is done using this technique and it is developed on the basis of the original data [9]. Feature refers to the combination of specific attributes that may be of specific interests for the users. The significant characteristics of the data sets are captured using these techniques. There are several applications of the technique in data mining, such as data decomposition, semantic analysis, pattern recognition, and likewise. The effectiveness of the overall process can also be improved with the aid of this technique. The use of this technique may be done to identify the themes associated with document collection. These documents may be represented using certain keywords. The representation of the features is done with the aid of the keywords that are identified along with the associated frequency [10].
Significance of Unsupervised Data Mining
The unsupervised techniques in data mining offer a wide range of benefits. The use of clustering technique provides the ability to automatically split the data sets in specific groups. The grouping is performed on the basis of the similarities in the data sets. This can be used by the business firms to make strategies and business decisions.
There are a number of transactions that take place over the networking mediums and channels. There can be a few unusual data points that may be present in the data sets and the determination of the anomalies can be done using the unsupervised learning techniques [11]. The identification of the fraudulent transactions can be easily done using these techniques. The determination of the patterns is done using the association models which can assist in the marketing, advertising, and customer interaction processes.
The use of latent variable model is done for data pre-processing activities. The specific number of features in the data set may be reduced or decomposed for effective analysis. As a result, there are numerous benefits that unsupervised learning techniques can offer.
Literature Review – Security & Privacy in Unsupervised Data Mining
There are security and privacy concerns that are determined for a majority of the automated technologies and concepts. The increase in the use of the automated technologies and concepts has also resulted in the increase in the data requirements. The frequency of the data operations has also increased with the enhancement of the data usage. This has also led to the increase in the privacy issues and concerns. This has resulted in privacy emerging as an organizational and a government issue that needs to be resolved [12].
Data mining is a technique that is becoming popular with the increase in the usage of the data sets. There are large volumes of the data that are collected from varied sources and comprise of the personal details. The analysis of the personal and critical information is usually conducted. While analysing such data sets, it is essential to make sure that the privacy of the data owners does not get violated. The significance of the data sets can be carried out to enhance the revenues. The use of unsupervised learning techniques, such as clustering, association, etc. is often done for analysing the massive pieces of information through data mining tools. This has also led to the increase in the occurrence of privacy attacks [13]. A number of research studies suggest the use of privacy preserving data mining techniques. These are the techniques in which the extraction of the knowledge is done while preserving the user’s privacy. Data mining is also referred as knowledge discovery in databases, KDD. The technique provides the mechanism of extracting the implicit and unknown data pieces from the databased. While creating the knowledge sets, it is also necessary to make sure that the preservation of the data privacy is effectively done.
There are technical and implementation concerns that are associated with data mining that often lead to the occurrence of the security and privacy issues. Apart from these, the researchers have also highlighted the social issues with data mining that also lead to the violation of privacy and security of the data. The individual privacy is put at risk due to such concerns. For instance, the use of data mining is common in the e-commerce businesses and applications. The technology is applied to determine the customer interests and preferences. This can be misused and there are confidentiality violations that may occur. The issue around data integrity is also common with the data mining techniques [14]. The analysis of the data can be done using the unsupervised techniques, such as association or clustering. There may be issues in the data sets itself which may lead to the redundancy concerns or the conflicts in the data sets. For instance, there are banking applications that run on the data mining techniques. The credit card details of the users are stored on the banking databases. There may be variations in the address or contact details of the user in the information stored across the databases. This may lead to the integration issues. The occurrence of the security and privacy concerns over the data sets also leads to the financial losses and concerns.
There are numerous queries that are used in the unsupervised data mining techniques. These queries may different from one technique and one algorithm to the other. With the increase in the reliance on the data mining operations and outcomes, it has become crucial to enhance the power of the database queries being used. The extraction of the hidden information may be done using the data mining queries and there are network-based security issues or the injection attacks that may occur in the process [15]. These security and privacy issues can have the significant impacts as the attack surface involved is high and it may not be easy to prevent the occurrence of such security issues. There are several benefits that the data mining techniques provide to the end-users. These are in terms of the ability to accurately assess the data sets or analyse the massive pieces of information within a few minutes. This is resulting in the increased use of data mining in the business applications and operations. The predictive information that is extracted using the data mining techniques may be misused by the attackers.
The research studies also determine the role of the end-users in the unsupervised learning and mining procedures. The end-users now have the awareness regarding the use of automated systems and technology in the handling of the data sets and in the execution of the business operations. The consumers are now aware of the data that they shall share over the networking channels. This has also enabled the business firms to make changes in the ways of extracting and using the customer information. However, there are newer forms of attacks that often come up which result in the occurrence of the security and privacy attacks. These may include the different forms of social engineering and impersonation techniques. The users are tricked to share the confidential information [16].
Analysis – Examples & Challenges
There is specific security concerns that may arise with the data mining techniques conducted on the basis of unsupervised learning. The data mining tools and applications comprise of the massive data pieces and these are stored on the platforms with the aid of the security techniques, such as user id & password or the antiviruses. However, there are additional vulnerabilities that may be present which may enhance the chances of hacking of the information. The existing access control measures may not be effective with all forms of the security and privacy attacks. The literature review that has been conducted showcases a number of security and privacy concerns associated with data mining and unsupervised learning techniques. The utilization of the applications of data mining is increasing with every passing day resulting in the increased frequency of the security and privacy attacks [17].
The privacy issues and concerns are significant challenges as the collection and analysis of information involved in the clustering or association techniques is done for the business-oriented applications. The information is also shared from one platform to the other using the networking channels which expose the information to the security attacks. It may be the unstructured information being shared for the clustering procedures or the prediction models and details obtained after the application of the association techniques [18]. The exposure of any such information can have significant security concerns which may lead to the violation of the security and privacy of the information sets.
In spite of the conduction of the research on security and privacy concerns around data mining (unsupervised learning), there are a number of theoretical and practical challenges that are still present. The technologies, such as data mining and machine learning are still the emerging concepts. There are numerous concepts and aspects associated with these technologies that are still not known properly. As a result, there is new theoretical information that regularly emerges making it difficult to implement the security or privacy controls that are designed. The unsupervised learning techniques do not rely upon the labelled data and examples. It is essential for such mining techniques to ensure the privacy of the data sets so that they may be used for the critical applications. For example, finance or healthcare sectors have the applications that comprise of confidential and critical pieces of information. It is not feasible for such applications to be impacted with the privacy violations or security attacks. It is essential that the significant controls are applied on these data sets so that the privacy of the data is always preserved [19]. However, the involvement of multiple sources of data and the use of a large number of networking techniques in the data mining processes through unsupervised learning techniques makes it complex to ensure the security and privacy of the information sets. The large-scale systems require the use of frequent data integration and linkage of the data sets. The information sharing must also be frequently done in such systems. This makes it complex for the data pieces to be kept secured from the security attacks.
There are additional challenges and limitations that are determined with the unsupervised data mining concepts. It is not easy to define a normal region in the unsupervised learning techniques as the existence of the normal behaviours at the common location is difficult. There is an extremely fine differentiation present between the normal and the anomalous behavior. It is possible that the observation that is present near the boundary is normal. When the malicious entities carry out the activities to make the anomalous behaviour as normal, there are significant complications that can arise [20]. The normal behavior is also not static. There are continuous evolutions that may take place and the changes may be identified in the future. There is also a wide range of application domains present and these may have varied notions of anomaly. For example, the acceptance and tolerance of fluctuations in the marketing applications and scenarios may be high while it is not acceptable in the healthcare domain. As a result, the security and privacy protocols that are developed cannot be the generic techniques that may apply to all of the unsupervised data mining techniques.
Therefore, the anomaly detection issue is difficult to deal with. It becomes even more challenging in the case of unsupervised learning as there is no labelled data that is present.
Privacy Preserving Techniques
There are a number of privacy preserving techniques for unsupervised data mining that are identified which can be used to preserve the privacy and security of the overall information and data sets.
The cryptographic approaches and techniques shall be preferred with the unsupervised data mining concepts and applications. With the use of such protocols, the higher efficiency can be obtained and the privacy of the information of the data owners can also be preserved. It will also not require the data owners or the information sources to remain available over the network at all times. These techniques will provide resolution to the horizontally partitioned data. In the unsupervised data mining techniques, it is possible that the same features are extracted for varied data objects [21]. For instance, face recognition is an example of such an application. The features that are extracted comprise of the same set of feature vectors by different data owners. The use of homomorphic encryption can be done with the unsupervised data mining technique, such as feature extraction processes. This technique will enable the computation carried out on the encrypted data sets. The specific operations, such as addition or multiplication can be applied and utilized for the conduction of complex arbitrary functions. Currently, there is limited set of applications and operations that can be conducted using the homomorphic encryption technique. The extension of the technique can be done by using the advanced homomorphic encryption algorithms. The additive homomorphic techniques can be extended with the use of the data packing mechanisms. The data owners can also be involved in such encryption mechanisms wherein the public key of the privacy service provider is involved and the encrypted data can be sent to the service provider using the same.
In many cases, it is required that the data is published in its original form and is shared in the public domain. The data in such cases cannot be encrypted; however, there are certain measures that need to be taken so that the anonymization of the data is maintained. The data shall be protected from the security issues around identity theft and the associated frauds that may occur. Anonymization is one of the privacy preserving techniques that can be used with the unsupervised data mining tools and applications [22].
There are numerous methods defined under the anonymization techniques for unsupervised data mining, such as suppression, generalization, permutation, swapping, and others. K-anonymity technique is one of the conventional approaches that can be used. There are also advanced mechanisms, such as t-closeness, km-anonymization, and etc. that can also be used.
Quasi-identifier is the combination of the person-specific critical parameter which could be age, name, pin code, etc. The removal of such identifiers from the data sets cannot assure the preservation and protection of the information from the identity thefts. However, the use of k-anonymization technique can be done for publishing the data using the data mining tools and applications. The generalization of the fields involved in the mechanism shall be done. The use of the bottom-up technique can also be done in the process of generalization and to group the specific identifiers involved [23]. There is often complexities that are identified in the maintenance of the trade-offs between the privacy and utility of the data sets involved in the unsupervised data mining procedures. The task-based technique can be used to balance such trade-offs. The application of the mining technique, such as clustering or association is done when the sensitive data is properly hidden and dealt with. However, the anonymization of the quasi identifiers can lead to the loss of significant information which can have negative impacts on the outcomes of the mining processes.
The use of k-anonymization technique can be done for the optimal feature-set partitioning. The cluster analysis can be conducted using the technique which can then be used to enhance the privacy and security of the overall data sets. The data reconstruction approach can also be used in which k-anonymity protection can be provided to the data sets managed using the clustering or association models. The predictive data mining applications can benefit from such techniques. The use of mathematical and statistical approaches can also be done so that the security and privacy of the information sets can be preserved and improved at all times. Condensation is one of such approaches that can be used to construct the constrained clusters from the data sets and then come up with the pseudo data using the statistics involved [24]. The issues around classification can be effectively resolved using such a technique. The preservation of the aggregate behaviour of the information and data sets is also done using the technique. It provides the ability to protect the information and data pieces. The security and privacy of the information sets can also be preserved with such an approach.
There are also other techniques that can be followed to make sure that the preservation of the privacy is done at all times. The fuzzy algorithms can be used to make sure that anonymization is achieved and the relevant and critical information does not get lost in the process. The use of these algorithms can be done to amalgamate the varied sets of data in the form of clusters. These clusters that are developed do not have the similarity with the rest of the clusters and therefore, the identification of other clusters using the information associated with one cluster cannot be done [25]. The fuzzy algorithms can be easily used with the k-anonymization technique which can ensure that the privacy is preserved in the clustering mechanisms that are used for unsupervised learning and data mining.
The combination of different techniques and approaches shall be done so that the overall data security and privacy is preserved. It will also make sure that the dynamic nature of the data mining tools and applications is also handled using such an approach.
Recommendations
There are certain recommendations that are provided on the basis of the further research surveys and reviews that were conducted. There are additional privacy preserving methods and techniques that are identified and the same are recommended for use.
Neural network is one of the techniques that can be used to enhance the overall privacy of the data mining applications and concepts. The technique will make sure that the occurrences of data loss are avoided and the privacy of the data sets is maintained. The peer-to-peer data mining can be carried out in a secure manner with the aid of the neural networks. The hybrid model for data mining can be carried out wherein the supervised and unsupervised methods can be combined and the overall security and privacy is enhanced [26].
It is also recommended that the use of the basic security tools and controls is always maintained so that the overall privacy can be improved. For instance, the use of network-based intrusion detection and prevention systems, anti-malware systems, firewalls, etc. shall always be done so that the security concerns can be managed properly. The avoidance of the security issues in the beginning will ensure that the overall security attacks and privacy issues are prevented.
Conclusion
The data mining techniques are being used in a large number of business sectors and industrial areas. The security and privacy concerns with data mining can lead to the resistance among the end-users to make effective use of the applications. There are specific concerns associated with the unsupervised techniques that are used in the mining applications. The avoidance of such tools and controls must be done in an effective manner. There are privacy preserving techniques that shall be used with the unsupervised data mining mechanisms to make sure that the overall privacy and security is ensured.
References
[1] P. Subhashini and P. R. B, “Confidential Data Identification Using Data Mining Techniques in Data Leakage Prevention System,” International Journal of Data Mining & Knowledge Management Process, vol. 5, no. 5, pp. 65–73, Sep. 2015.
[2] E. J. Wegman, “Special issue of statistical analysis and data mining,” Statistical Analysis and Data Mining, vol. 5, no. 3, pp. 177–177, May 2015.
[3] J. Clark and F. Provost, “Unsupervised dimensionality reduction versus supervised regularization for classification from sparse data,” Data Mining and Knowledge Discovery, vol. 33, no. 4, pp. 871–916, Feb. 2019.
[4] A. Zimmermann, “Method evaluation, parameterization, and result validation in unsupervised data mining: A critical survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Jul. 2019.
[5] Sumeet Dua and Pradeep Chowriappa, Data mining for bioinformatics. Boca Raton: Crc Press/Taylor & Francis Group, 2016.
[6] E. Padmalatha, C. R. K. R. C.R.K.Reddy, and P. Rani, “Mining Concept Drift from Data Streams by Unsupervised Learning,” International Journal of Computer Applications, vol. 117, no. 15, pp. 35–34, May 2015.
[7] B. Nath, D. K. Bhattacharyya, and A. Ghosh, “Incremental association rule mining: a survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 3, no. 3, pp. 157–169, Feb. 2015.
[8] S. Pandey, “Multilevel Association Rules in Data Mining,” Journal of Advances and Scholarly Researches in Allied Education, vol. 15, no. 5, pp. 74–78, Jul. 2018.
[9] Sreelekshmi. U, “A Survey on Feature Extraction Techniques for Image Retrieval using Data Mining & Image Processing Techniques,” data security assignment International Journal Of Engineering And Computer Science, Nov. 2016.
[10] Y. Han, “Extraction and Mining of Video Feature in Sport Videos,” International Journal of Performability Engineering, 2018.
[11] A. Derntl and C. Plant, “Clustering techniques for neuroimaging applications,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 6, no. 1, pp. 22–36, Dec. 2015.
[12] R. Mendes and J. P. Vilela, “Privacy-Preserving Data Mining: Methods, Metrics, and Applications,” IEEE Access, vol. 5, pp. 10562–10582, 2017.
[13] S. Nathiya, C. Kuyin, and j. D. Sundari, “Providing Multi Security In Privacy Preserving Data Mining,” International Journal Of Engineering And Computer Science, Jan. 2016.
[14] “Multi-Level Trust Privacy Preserving Data Mining to Enhance Data Security and Prevent Leakage of the Sensitive Data,” Bonfring International Journal of Industrial Engineering and Management Science, vol. 7, no. 2, pp. 21–25, May 2017.
[15] Sumeet Dua and Xian Du, Data mining and machine learning in cybersecurity. Boca Raton, Fla.: Crc Press, 2015.
[16] M. K. Gupta and P. Chandra, “A comprehensive survey of data mining,” International Journal of Information Technology, Feb. 2020.
[17] Kautkar Rohit A, “A COMPREHENSIVE SURVEY ON DATA MINING,” International Journal of Research in Engineering and Technology, vol. 03, no. 08, pp. 185–191, Aug. 2014.
[18] A. Bhardwaj and R. Gupta, “Financial Frauds: Data Mining based Detection – A Comprehensive Survey,” International Journal of Computer Applications, vol. 156, no. 10, pp. 20–28, Dec. 2016.
[19] A. Naik and N. Naik, “Prognosis of Heart Disease using Data Mining Techniques: A Comprehensive Survey,” International Journal of Computer Applications, vol. 181, no. 17, pp. 14–18, Sep. 2018.
[20] N. R. Nanavati and D. C. Jinwala, “A novel privacy-preserving scheme for collaborative frequent itemset mining across vertically partitioned data,” Security and Communication Networks, vol. 8, no. 18, pp. 4407–4420, Oct. 2015.
[21] P. Wang, T. Chen, and Z. Wang, “Research on Privacy Preserving Data Mining,” Journal of Information Hiding and Privacy Protection, vol. 1, no. 2, pp. 61–68, 2019.
[22] S. Reddi, “Privacy Preserving Data Mining Using Time Series Data Aggregation,” International Journal of Strategic Information Technology and Applications, vol. 8, no. 4, pp. 1–15, Oct. 2017.
[23] B. Fabian and T. Göthling, “Privacy-preserving data warehousing,” International Journal of Business Intelligence and Data Mining, vol. 10, no. 4, p. 297, 2015.
[24] Jaideep Vaidya, Christopher Wade Clifton, and M. Zhu, Privacy preserving data mining. New York?; London: Springer, 2015.
[25] S. Fletcher and M. Z. Islam, “Measuring Information Quality for Privacy Preserving Data Mining,” International Journal of Computer Theory and Engineering, vol. 7, no. 1, pp. 21–28, Feb. 2015.
[26] B. Custers, Toon Calders, B. Schermer, and Tal Zarsky, Discrimination and Privacy in the Information Society Data Mining and Profiling in Large Databases. Berlin Springer Berlin, 2015.