A Scalable Data Mining Framework for Knowledge Discovery Using Distributed Big Data Analytics in Heterogeneous Systems
Keywords:
Big Data Analytics, Distributed Data Mining, Knowledge Discovery, Heterogeneous Systems, Machine Learning, Apache Spark, Hadoop, Scalability, Distributed Computing.Abstract
The massive increase in structured and unstructured computing resources in the form of cloud platforms, IoT devices, distributed networks, enterprise systems, among others, has made big data analytics a critical area of research. Conventional data mining methods tend to have serious problems with big data due to the physical unscalability of these methods, excessive computational cost, latency and inefficient use of resources with a distributed system. These issues require scalable and efficient frameworks which can handle large quantities of heterogeneous data and guarantee the correct knowledge discovery. This study presents a scalable distributed data mining architecture that will be used to boost knowledge discovery by big data analytics in heterogeneous systems. The framework combines Apache Hadoop and Apache Spark to have the ability of efficient distributed stored data, parallel computing and in-memory computing. The proposed model takes into account machine learning and data mining algorithms such as Random Forest, Decision tree, K-Means clustering and FP-Growth to do the job of classification, clustering and pattern extraction effectively with the distributed datasets. The framework is analyzed based on various machine learning and distributed system performances metrics like accuracy, precision, recall, F1-score, execution time, scalability, throughput, and resource utilization. It has been shown in experiments that the suggested framework greatly enhances processing efficiency, scalability, and predictive performance of traditional centralized systems and Hadoop systems and provides efficient and reliable knowledge discovery in large-scale heterogeneous settings.




