A Survey on Big Data Analytics Using HADOOPAuthor : S. Mamatha and T. Sudha
Volume 8 No.3 Special Issue:June 2019 pp 35-40
In this digital world, as organizations are evolving rapidly with data centric asset the explosion of data and size of the databases have been growing exponentially. Data is generated from different sources like business processes, transactions, social networking sites, web servers, etc. and remains in structured as well as unstructured form. The term ― Big data is used for large data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data varies in size ranging from a few dozen terabytes to many petabytes of data in a single data set. Difficulties include capture, storage, search, sharing, analytics and visualizing. Big data is available in structured, unstructured and semi-structured data format. Relational database fails to store this multi-structured data. Apache Hadoop is efficient, robust, reliable and scalable framework to store, process, transforms and extracts big data. Hadoop framework is open source and fee software which is available at Apache Software Foundation. In this paper we will present Hadoop, HDFS, Map Reduce and c-means big data algorithm to minimize efforts of big data analysis using Map Reduce code. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools and related fields.
Big Data, Mining, Heterogeneity, HDFS, Map Reduce, HADOOP, Cluster, Name node, Data Node
 S. Shuman, “Structure, mechanism, and evolution of the mRNA capping apparatus”, Prog Nucleic Acid Res MolBiol, 2000.
 A. Rajaraman and J. D. Ullman, “Mining of Massive Datasets. Cambridge – United Kingdom: Cambridge University Press, 2012.
 G. F. Coulouris, J. Dollimore, and T. Kindberg, Distributed Systems: Concepts and Design: Pearson Education; 2005.
 M. De Oliveira Branco, Distributed Data Management for Large Scale Applications. Southampton – United Kingdom: University of Southampton; 2009.
 W. Raghupathi and V. Raghupathi, “Big data analytics in healthcare: promise and potential”, Health Inform Sci Syst., Vol. 2, No. 1, pp. 3, 2014.
 D. E. Bell, H. Raiffa, and A. Tversky, “Descriptive, normative, and prescriptive interactions in decision making”, DecisMak, 1988.
 I. Foster, and C. Kesselman, “The Grid 2: Blueprint for a new Computing Infrastructure”, Houston – USA, Elsevier, 2003.
 J. D. Owens, M. Houston, D. Luebke, S. Green, “Stone and JC. Phillips: GPU computing”, Proc IEEE, Vol. 96, No. 5, pp. 879–899, 2008.
 N. Satish, M. Harris and M. Garland, “Designing efficient sorting algorithms for manycore GPUs”, In Parallel &Distributed Processing, 2009 IPDPS 2009 IEEE International Symposium on: 2009, IEEE, pp. 1–10, 2009.
 B. He, W.Fang, Q. Luo, NK. Govindaraju, and T. Wang, “Mars: a MapReduce framework on graphics processors”, In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008,
 J. Dean, S. Ghemawat, “MapReduce: simplified data processing on large clusters”, Commun ACM 2008, Vol. 51, No. 1, pp. 107–113.
 S. L. Peyton Jones, The Implementation of Functional Programming Languages (Prentice-Hall International Series in Computer Science). New Jersey – USA: Prentice-Hall, Inc; 1987.
 R. E. Bryant, “Data-intensive super computing: The case for DISC”, Pittsburgh, PA – USA: School of Computer Science, Carnegie Mellon University; 2007, pp.1–20.
 T. White: Hadoop: The Definitive Guide. Sebastopol – USA: ― O’Reilly Media, Inc.‖; 2012.
 K. Shvachko, H. Kuang, S. Radia and R. Chansler, “The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST)”, 2010 IEEE 26th Symposium, IEEE, pp.1-10, 2010.
 The Apache Software Foundation. [http://apache.org/]
 M. Olson, “Hadoop: Scalable, flexible data storage and analysis”, IQT Quart, No. 3, pp. 14–18, 2010.
 J. Xiaojing, “Google Cloud Computing Platform Technology Architecture and the Impact of Its Cost”, In 2010 Second WRI World Congress on Software Engineering, pp. 17–20, 2010.
 A. Thusoo, JS. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P Wyckoff, and R Murthy, “Hive: a warehousing solution over a map-reduce framework”, Proc VLDB Endowment, Vol. 2, No. 2, pp.1626–1629, 2009.
 C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins, “Pig latin: A not-so-foreign language for data processing”, In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data: 2008, ACM; 2008, pp. 1099–1110.
 S. Prabha and P. Kola Sujatha, “Reduction Of Big Data Sets Using Fuzzy Clustering”, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), Vol. 3 No. 6, June 2014.
 R. Madhuri, M R Murty, J. V. R. Murthy, PVGD Prasad Reddy and S.C Satapathy, “Cluster analysis on di_erent data sets using k-modes and k-prototype algorithms, ICT and Critical Infrastructure”, Proceedings of the 8thAnnual Convention of Computer Society of India,Springer, Vol. 2, pp. 137-144, 2014.
 X. F. Jiang, “Application of parallel annealing particle clustering algorithm in data mining”, TELKOMNIKA Indonesian Journal of Electrical Engineering, Vol. 12, No. 3, pp. 2118-2126, 2014.
 R. Krishnapuram and J. M. Keller, “A possibilistic approach to clustering”, IEEE Transactions on Fuzzy Systems, Vol. 1, pp. 10-12, 1993.
 N. Janardhan, T. SreePravallika and SowjanyaGorantla, “An efficient approach for integrating data mining into cloud computing”, International Journal of Computer Trends and Technology (IJCTT), Vol. 4, No. 5, May 2013.