Due to the numerous technological advancements, there have been universal challenges on high dimensional data classification for instance, the bag-of-words representation that has a gene expression classification, vast dictionary or multimedia classifications. The current high dimensionality has caused huge mathematical challenges compared to the tradition methods due to the space complexity and the computational time. There are two main problems in high dimensional data processing. One of the challenges is that of the Curse of Dimensionality (Ando, 2005). Expanding of the dimensions can bring about an explosive growth in memory space and the calculation time. The other challenge is the challenge of the inefficient similarity measure. It does make sense for the Euclidean distances in roughly 2 to 10-dimensional spaces that are normally used as a similarity measure between data points (Cai, 2007). Despite, the comparability of the Euclidean distance between data points is not there are the high dimensional spaces because of the sparsity of the high dimensional data. This means that its accuracy reduces along with the increase in the dimensions.
From the various scholars, the two main techniques that can be used to deal with the high dimensional data are the future selection and the Dimensionality reduction (Zhu, 2005). The feature selection method works to obtain the more efficient features and does away with the irrelevant features. Currently, studies on the feature selection aim at the evaluation criteria and the best search strategy. The dimension reduction method works to acquire a low dimensional embedding from a high dimensional data that is divided into two main categories, the nonlinear methods, and the linear methods. Some of the common nonlinear techniques that are used are the Local Linear Embedding, Laplacian Eigenmaps, Isometric mapping or the multi-dimensional mapping. The linear techniques that are used are the Independent Component Analysis, Principal Component Analysis, and the Linear Discriminant Analysis (Tenenbaum, et al. 2000). All these two methods converge at the point that they all try to use less significant and discriminatory features to show the original data.
Despite the two main challenges that can be used in the classification of high dimension data, this paper proposes a unique and different way to classify the high dimensional data. The paper will look at the semi-supervised setting that is based on the following points. The data acquisition will be an easy task owing to the much technological advancement. Therefore, the size of the data that will be collected will be more and more and will have higher dimensions (Raina, 2007). That will be referred to as the Big Data in this study. Another second important thing is considering the application domain that the labeled data is expensive and scarce though the unlabeled data is cheap and large (Grandvalet, 2004). Supervised learning will not serve any purpose in this study. However, semi-supervised learning that uses a large amount of unlabeled data in assisting to improve the classification performance was appropriate for this purpose. This is because it is cheap, large and will be cost effective for the experiment.
To deal with the challenges of the high dimensional data processing suggested in this paper, there are two main facts that will be considered. The first fact is the subset based large scale graph construction method Anchor Graph. This graph uses a little subset of data points called anchors to construct the entire graph of the data set (Basu, et al. 2014). There is the application of the Anchor Graph Method to create the graph in the randomly selected feature space and carry out a semi-supervised inference on the graph created. The other fact is the Random Forest Method, which can handle high dimensional data minus dimensionality reduction or even feature selection. This method has good generalization performance, and it is not easy to out do it because of the randomness (Criminisi, et al. 2012). The paper will adopt the same idea to bring about randomness in graphs by randomly choosing a subset of features to construct the graph. Normally, the size of the feature subset will be far way smaller compared to the initial dimension. Therefore, in the chosen feature space similarity, the measure that is founded on the Euclidean distance can be of significance because of its lower dimensions. Therefore, combining the idea of the Random Forests with the semi-supervised learning based on the anchor graph, it will be important to introduce a new semi-supervised framework known as the Random multi-graphs to deal with the large-scale data problem and the high dimension (Chawla, 2004). The paper will randomly choose a subset of features and use the Anchor graph to construct the graph. The described process is repeated to get multiple graphs that can be executed in parallel to make sure that the runtime efficiency and then the multiple graphs vote so that they can show the labels for the unlabeled data.
This research paper will evaluate the method proposed on eight real-world data set unlike the two traditional graph-based techniques as well as one state of the art semi-supervised learning method that is founded on the Anchor graph to indicate the effectiveness (Belkin, 2004). The paper will also show the analysis of the data so as to show the face recognition from the images.
Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov), 1817-1853.
Basu, S., Bilenko, M., & Mooney, R. J. (2004, August). A probabilistic framework for semi-supervised clustering. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 59-68). ACM.
Belkin, M., & Niyogi, P. (2004). Semi-supervised learning on Riemannian manifolds. Machine learning, 56(1-3), 209-239.
Cai, D., He, X., & Han, J. (2007, October). Semi-supervised discriminant analysis. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on (pp. 1-7). IEEE.
Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1), 1-6.
Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends in Computer Graphics and Vision, 7(23), 81-227.
Grandvalet, Y., & Bengio, Y. (2004, December). Semi-supervised Learning by Entropy Minimization. In NIPS (Vol. 17, pp. 529-536).
Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007, June). Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning (pp. 759-766). ACM.
Tenenbaum, J. B., De Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. science, 290(5500), 2319-2323.
Zhu, X. (2005). Semi-supervised learning literature survey.
Cite this page
A Semi-supervised Learning Framework for Classification of High Dimensional Data. (2021, Jun 22). Retrieved from https://proessays.net/essays/a-semi-supervised-learning-framework-for-classification-of-high-dimensional-data
If you are the original author of this essay and no longer wish to have it published on the ProEssays website, please click below to request its removal:
- Why Students Should Take a Gap Year? - Argumentative Essay
- Do Single-Family Kids in High School Have Bad Performances and Lower Grades Than Nuclear-Family Kids?
- Essay Sample on Cognitive Habits in Critical Thinking
- The Journey That Changed My Life Essay
- Uniforms in Public Schools Analysis
- Organizing Framework and Program Outcomes
- Report of School: St. John's University