Privacy Preserving Mining using Data Encryption scheme for Hadoop Ecosystem
Authors: Sonal Jain, Mohit Jain
Certificate: View Certificate
Abstract
Nowadays, explosive amount of data is being generated every day. Data from sensors, mobile devices, social networking websites, scientific data & enterprises – all are contributing to this huge explosion in data. This sudden bombardment can be grasped by the fact that we have created a vast volume of data in the last two years. Big Data- as these large chunks of data is generally called- Big Data and has become one of the hottest research trends today. Research suggests that tapping the potential of this data can benefit businesses, scientific disciplines and the public sector contributing to their economic gains as well as development in every sphere. Security is one of the important features to keep information safe and secure from unwanted and unintended data. Study of existing work concludes that HDFS does not have any security framework or algorithm to keep data safe and secure. This work proposed a solution to perform encryption of large data going to be put into HDFS as safe and secure.
Introduction
Big Data is the aggregation of bulk quantity of data and that data can be in any form, may be in structured form or unstructured form. It is widely popular in several fields due to its storage capacity of relational and non-relational, structured and unstructured data. For big organizations and business development it is an opportunity to enhance business. Data is generated in large amount due to the communication and transmission of data and big data is needed to be processed for data mining algorithms. Big data consist of three V's, called as: Volume, Variety and Velocity. The need is to develop efficient systems that can exploit this potential to the maximum, keeping in mind the current challenges associated with its analysis, structure, scale, timeliness and privacy. There has been a shift in the architecture of data-processing systems today, from the centralized architecture to the distributed architecture. The Big Data research orientation, invariably encounter Hadoop. Hadoop is designed to process large amount of data, regardless of its structure. The core of Hadoop is MapReduce framework, created by Google to solve the problem of web search indexes. The nonprofit organization [2] Apache Software Foundation (ACF) maintain and manage Hadoop framework and Hadoop environment technology. The framework such as Mongo DB, NoSql, Pig and many other are introduce in big data environment to manage massive amount of sensitive data at any given time. Several technologies related to Hadoop [3] include the HDFS which is used for distributed file system. The Hive component is developed to maintain data warehouse application with Hadoop server. The MapReduce is a programming model of Hadoop. The Pig is used for querying language in Hadoop, which is similar to SQL language but SQL is for relational database. The Sqoop, provide connectivity to upload data to HDFS and to Hive from MySql. There are several other technologies developed in the Hadoop environment to play with BigData and expert one’s own skills.
Copyright
Copyright © 2025 Sonal Jain, Mohit Jain. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.