Most of the researches showed that data generated in the modern world is huge and growing fast. Therefore, big data and business analytics are trends and positively impacting to the business world. These data include structured and unstructured data which flow into the organization daily. The unstructured data cannot be managed in the Traditional Relational Database Management System (RDBMS). Because, those data include the text files, web, social media posts, emails, images, audio, movie and etc. (Majority of the digital data). So, data proliferation requires new techniques for capturing, storing and processing this big data. That means, it takes the attention towards various platforms, applications and trends of Big Data Technology and Business Analytics.
Recent Developments in Big Data Technology
Big Data emerged with the development of social media and blogs. When comes to the real world examples:
- Walmart process more than a million customer transactions hourly and stores 2.5 petabytes of customer data.
- Library of Congress collects 235 terabytes of new data per year and store 60 petabytes of data.
- McKinsey says that pieces of content uploaded to Facebook are in the 30 billion while the value of big data for the healthcare industry is about 300 billion.
All of the above growths are caused due to the technological changes, e-commerce activities, business operations, manufacturing and healthcare systems. However the surge in data volume is driven by a number of technologies as follows:
Distributed computing: Big data handling in large scale distributed computing systems provide direct access and long-term storage for petabytes of data while powering extreme performance.
Flash Memory: It delivers random access speed of less than 0.1 milliseconds unlike access of 3 to 12 milliseconds. Therefore, there’s a high probability for using a lot of flash memory to improve access time data in future big data solutions.
Cloud computing: It created a new economy of computing by moving storage, databases, service, into the cloud and offers great access for rapidly deploying big data solutions
Data Analytics: This is a multistage approach that includes data collection, preparation and processing, analyzing and visualizing large scale data to produce actionable insight for business intelligence.
In-memory applications: significantly increase database performance
Example:
Let’s take an example; a solution can be developed to tie customer/merchants bank verification number (BVN) and subscriber identification module (SIM) registration details to a unique digital identity. The solution will use the unique digital identification number (ID) and stream mobile payment transaction data through a mobile device into a big data repository. Then we can discover if there’s an occurrence of fraudulent or false payment alert from a customer to a merchant through monitoring and standard machine learning techniques of collected data.
In such a case, a warning alert can be shared with their mobile operators and the merchant’s banks, possibly even before the merchant releases the product. At the mobile operator’s side, the SIM registration record and the Global Positioning System (GPS) technology can use to create the customer’s crime chart and alert the police for the offender’s arrest. From the other end, the intelligent agent model running in the bank application would trigger a warning alert to the merchant to ignore such a transaction request.
Big Data Analytics Platforms
Big data technology includes a large number of open source software components. Majorly like Apache project, available for use in constructing a big data platform. Specially, this software is designed to work in a distributed and cloud computing environment. However, there were some common problems faced by computer scientists in designing efficient and effective big data computing platforms. The problems include, how to move large volumes of transactional data; how to move large volumes of static data across the network; how to process large volumes of data very fast; how to ensure even job scheduling a fair usage of resources; how to handle errors interrupting other jobs and how to coordinate and optimize resources.
To avoid above problems, recently Hadoop was designed as an open source framework to handle big data analytics through the batch processing approach. The benefits of Hadoop includes: its principles which contain less dependency on expensive high-end hardware platforms and infrastructure; parallel processing to reduce computing time; not moving data from disk to the central application to be processed; embrace failure; build applications with less dependent on infrastructure and utilization of flexibility of Hadoop. The principles of Hadoop, helped in cost reduction; platform optimization; fast processing and achieving efficiency. Specially, Hadoop ecosystem enables the implementation of big data and business analytics through its specific structure, components and tools.
Big Data Analytics Tools
The huge volume of big data has rendered the traditional data analysis approach ineffective for processing huge amounts of generated data in the current world. Therefore, various big data tools have been proposed and implemented recently for efficient data generation, transmission, processing, storage and analysis of big data. The following tools are continuously updating and many tools are being introducing on a regular basis to the industry.
- MapReduce: A Handoop distributed programming framework. It is responsible for resource scheduling and job management. The two parts, the Mapper filters and transforms data while reducer sorts each file by key and aggregates them into a larger file.
- Hive: A SQL like interface for Hadoop originated at Facebook.
- Pig: Originally created at Yahoo. Pig converts the scripts into MapReduce jobs.
- Flume: This tool uses an agent for extracting large amounts of data into and out of Hadoop. Flume is well suited for gathering weblogs from multiple sources by the use of agents. Each agent is connected to each other. And Flume is highly scalable across many machines.
- Sqoop: A tool for moving data into and from RDBMS. The name Sqoop is a combination of SQL and Hadoop as it is a great tool for exporting and importing data between any RDBMS and Hadoop Distributed File System.
- Apache Spark: An open source computing framework which runs onto HDFS and is able to use YARN.
- Oszie: It is a workflow and coordination tool used in a Hadoop cluster.
- Hbase: This is a popular No SQL columnar database deployed on top of Hadoop.
- Mahout: Mahout is a scalable, simple and extensible machine learning library supported by Java, Scala and Python for building distributed learning algorithm in Hadoop.
- MLib: This is an open source machine learning library native to Apache Spark.