Problems with traditional large-scale systems in big data

A Taxonomy and Survey of Stream Processing Systems

Xinwei Zhao, ... Rajkumar Buyya, in Software Architecture for Big Data and the Cloud, 2017

11.6 Conclusions and Future Directions

Big data problems have brought many changes in the way data is processed and managed over time. Today, data is not just posing challenge in terms of volume but also in terms of its high speed generation. The data quality and validity varies from source to source, and thus are difficult to process. This issue has led to the development of several stream processing engines/platforms by different companies such as Yahoo, LinkedIn, etc. Besides better performance in terms of latency, stream processing overcomes another shortcoming of batch data processing systems, i.e., scaling with high “velocity” data. Availability of several platforms also resulted in another challenge for user organizations in terms of selecting the most appropriate stream processing platform for their needs. In this chapter, we proposed a taxonomy that facilitated the comparison of different features offered by the stream processing platforms. Based on this taxonomy we presented a survey and comparison study of five open source stream processing engines. Our comparison study provides an insight of how to select the best platform for a given use case.

From the comparison of different open source stream processing platforms based on our proposed taxonomy, we observed that each platform offers very specific special feature that makes its architecture unique. However, some features make a stream processing platform more applicable for different scenarios. For example, if the organization's data volume changes dynamically, it is better to choose a platform such as Storm which allows dynamic addition of nodes rather than Yahoo! S4. Similarly, if an organization wants to process all the data that is ingested into the system, the guaranteed data processing feature is what it should look for. In contrast to commercial offerings, organizations can save on licensing fees by using open-source platforms. However, the support given for maintenance of such platforms becomes an important factor in making decisions about adopting a particular platform. The user base and support given for each platform varies quite drastically. Storm has the largest user base and also supports services. Yahoo! S4 comes with almost no support.

Based on the survey, it also becomes clear that the performance of a stream processing system depends on multiple factors. However, the performance will always be limited by the capacity of the underlying cluster environment in which real processing is done. More or less every system that was surveyed does not allow using cloud computing resources which can scale up and down according to the volume and velocity of data that needs to be processed. Moreover, the job scheduling mechanisms used by the systems are not very sophisticated and do not take into the consideration the performance of underlying infrastructure which can be quite heterogeneous in some cases. In the future, we would like to conduct a cost and risk analysis of different streaming platforms and conduct a more extensive comparison study. The current taxonomy is derived after studying different open-source stream processing platforms, which limits the scope of our taxonomy. To overcome this limitation, we would also study some key commercially available stream processing platforms such as IBM Stream.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128054673000119

Big Data Integration

April Reeve, in Managing Data in Motion, 2013

Traditional big data use cases

Certain big data problems and use cases are common to every organization, such as e-mail, contracts and documents, web data, and social media data. In addition, almost every industry has specific cases where they have big data management problems to solve. Many industries have always had to deal with huge volumes and varied types of data. Organizations in the telecommunications industry must keep track of the huge network of nodes through which communications can pass and the actual activity and history of connections. Finance has to process both the history of prices of financial products and the detailed history of financial transactions. Organizations in airplane manufacturing and operation must track the history of every part and screw of every airplane and vehicle planned and operated. Publishing organizations must track all the components of documents through the development versions and production process. Interestingly, pharmaceutical firms have similar strict document management requirements for their drug submissions to the FDA in the United States, and comparable requirements in other countries; thus, for pharmaceutical firms advanced document management capabilities are a core competency and a traditional “big data” problem.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780123971678000212

A Deep Dive into NoSQL Databases: The Use Cases and Applications

Mohammad Samadi Gharajeh, in Advances in Computers, 2018

4.2 Big Data Problems in Bioinformatics

Main categorizes of the big data problems in bioinformatics include seven terms as entitled by Fig. 9 [6]. They are described in the following with more details.

Fig. 9. Major categorizes of big data problems in bioinformatics.

Microarray data analysis: Because of reducing the operational costs and widespread use of microarray experiments, the size and number of microarray datasets are increasing quickly. Beside, microarray experiments are carried out for gene-sample-time space to store the noticeable changes over time or various stages of a disease. Fast construction of the coexpression and regulatory networks needs big data technologies with the aid of voluminous microarray.

Gene–gene network analysis: Gene coexpression network analysis predicts the relation among different gene–gene networks achieved from gene expression analysis. Differential coexpression analysis looks for those of the changes which are created by the gene complexes over time or various stages of a disease. This process leads to find the correlation between gene complexes and traits of interest. Consequently, gene coexpression network analysis is defined as a complex and highly iterative problem that needs large-scale data analytics systems.

PPI data analysis: PPI complexes and their changes contain high information about various diseases. PPI networks are considered in different domains of life sciences in cooperation with the production of voluminous data. The volume, velocity, and variety of data cause PPI complex analytics to be a genuine big data problem. Hence, an efficient and scalable architecture should be provided to give the fast and accurate PPI complex generation, validation, and rank aggregation.

Sequence analysis: RNA sequencing technology is addressed as a strong successor to the microarray technology. This successor is resulted because of its more accurate and quantitative gene expression measurements. RNA sequence data involve additional information that needs considerable machine-learning models. Therefore, big data technologies can be applied to indicate mutations, allele-specific expressions, and exogenous RNA contents [e.g., viruses].

Evolutionary research: The recent advances in molecular biological technologies have led to a noticeable source for big data generation. Various projects at microbial level [e.g., genome sequencing and microarrays] have generated huge amounts of data. Bioinformatics has been considered as an important platform for the analysis and achievement of this essential information. Functional trend of the adaptation and evolution by microbial research is an important big data problem in bioinformatics.

Pathway analysis: Pathway analysis makes an association between genetic products and phenotypes of interest. This correlation is done to estimate gene function, identify biomarkers and traits, and also make a category of the patients and samples. Association analysis on huge volumes of the data can be performed by increasing genetic, genomic, metabolomics, and proteomic data as well as by offering the big data technologies.

Disease network analysis: Large disease networks are provided by formulating various species [e.g., human]. Complexity and data of these networks increase over time as well as new networks are added by using different sources in their own format. Consequently, sophisticated networks of molecular phenotypes cause to estimative genes or mechanisms for the disease-associated traits. Some of the efficient integration methods should be applied to evaluate multiple, heterogeneous omics databases.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/S0065245817300335

Big Data, Hadoop, and Cloud Computing

Nauman Sheikh, in Implementing Analytics, 2013

Hadoop as an Analytical Engine

The other option for Hadoop within an analytics solution is where the entire data mining algorithm is implemented within the Hadoop programming environment and it acts as the data mining engine shown in Figure 11.4. This is used when there is no option of reducing, aggregating, or sampling the data to eliminate the V’s from Big Data characteristics. This becomes quite complex as the performance variables, the heart of innovation within analytics modeling, cannot be easily added to the data set without creating an additional storage layer. However, various problems actually require running the entire Big Data set looking for specific trends or performing fuzzy search or correlations between variables through Hadoop and its data access programs.

Figure 11.4. Hadoop as an analytical engine.

Outside of analytics—as defined in the first chapter—Hadoop can be used as generic data processing engine that can search and look for patterns and correlations almost like a reporting tool of sorts. It should be treated as a commodity data processing layer that can be applied to any overwhelming data size problem.

Big Data and Hadoop—Working Example

Let’s take a Big Data problem from the retail shopping sector. A large brick and mortar retailer with thousands of square footage in floor space wants to understand the following three things.

1.

How are the floor placement of promotions and products attracting people i.e., how many people stop and how many ignore and pass?

2.

How many people actually enter the store broken down by entrance and time of day and how does that correlate with sales?

3.

What is the probability that a customer stopping at a display will buy the product?

In order to answer these questions, it is evident that a new form of data is required. The traditional structured data available with retailers such as point of sale, inventory, merchandising, and marketing data cannot answer these questions. Therefore an additional source of data is required to answer these questions. As most stores are fitted with security cameras and the digital surveillance technology has improved dramatically, we will assume that the security cameras are available across the store and they do have digital footage from the stores 24/7.

Dozens of video files are available from security cameras on a daily basis and for a large retail chain the number could be in thousands. The video files are typically voluminous, so we are certainly looking at the “V” of volume to run this analysis. It is also certain that the “V” of variety is not applicable here and that velocity will be a challenge if we decide to do this analysis in real time as videos are streaming. So certainly, this is a Big Data problem.

Specialized code will be required to analyze the videos and separate people, entrance and floor layout. Some kind of logical grid of the floor plan will be needed and then the displays will have to be marked in that logical construct. In the first step, we will use Hadoop as an ETL engine and process all the video files.

For Question 1: The output should provide the required metrics by each display as the video image processing logic [implemented using Hadoop technology stack described above] should identify the display, count the people passing, and count the people stopping. This is a massive computing challenge in storage and CPU and Hadoop is exactly designed to handle this. The structured output data should be recorded in the data warehouse

For Question 2: This one combines the data extracted from the videos using the Hadoop implementation with some structured data available within the data warehouse system as Point-of-Sale[PoS] systems track the sales transactions and have a timestamp on them. So the same logical grid of the floor plan will be used to identify entrance and video imaging program [implemented in Hadoop technology stack] will count people entering and leaving. This output will be recorded into the data warehouse that already has the information on PoS. With the two datasets in structured form now recorded in the data warehouse, all sorts of additional questions can be asked intermixing the data

For Question 3: This one is an extremely complicated question because it requires identifying the customers with a unique ID so they can be tracked passing through displays as well as when they bought something. We will first limit our solution to the video footage of shoppers who stopped at a display. At the time of purchase, we will need structured data to link their loyalty card or another form of ID we may already have on file with this video file ID. The two streaming videos have to be combined, but we do not know which ones might have the same customer in them although we can limit ourselves to the same store footage and within the 2–3 hour window of the video showing someone entering. This completes one part of the solution where we are able to track the ones who stopped at a display and also bought something [this makes up our “1” records for predictive model training]. The other part of the data [the “0” records] who stopped at a display but did not buy will be the remaining population in the video footage with a unique ID. Now that we have two population sets, we can start to work on performance variables and as described in Chapter 4 on performance variables, there is a lot of freedom in looking for variables from the videos, like clothing, individual or family, young or old, other shopping bags in hand, food in hand, on phone or not, etc. are all variables that can be extracted from the video files. The image processing technology has dramatically improved in the past few years so do not be overwhelmed by the massive video image processing required here. Once the variables are identified, the model needs to be trained and then if the predictions are needed in real time for the ones who may not buy so a sales associate can attend to them then we are dealing with the “V” of velocity as well. This type of problem requires the entire solution to be coded in the Hadoop cluster because even if we take the performance variables out as structured data and use conventional data mining algorithm to build the model, for real-time use, the video has to be constantly processed and run through the model. Anything outside of Hadoop environment may create a scalability bottleneck.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780124016965000116

Developing the Big Data Roadmap

David Loshin, in Big Data Analytics, 2013

11.1 Introduction

The goal of this book is to influence rational thinking when it comes to big data and big data analytics. Not every business problem is a big data problem, and not every big data solution is appropriate for all organizations. One benefit of rational thought is that instead of blindly diving into the deep end of the big data pool, following the guidance in this book will help frame those scenarios in which your business can [as well as help determine it cannot] benefit from big data methods and technologies. If so, the guidance should also help to clarify the evaluation criteria for scoping the acquisition and integration of technology.

This final chapter is meant to bring the presented ideas together and motivate the development of a roadmap for evaluating and potentially implementing big data within the organization. Each of the sections in this chapter can be used as a springboard for refining a more detailed task plan with specific objectives and deliverables. Hopefully, this chapter can help devise a program management approach for transitioning to an environment that effectively exploits these exciting technologies.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780124173194000119

The Value Proposition for MDM and Big Data

John R. Talburt, Yinle Zhou, in Entity Information Life Cycle for Big Data, 2015

MDM and Big Data – The N-Squared Problem

Although many traditional data processes can easily scale to take advantage of Big Data tools and techniques, MDM is not one of them. MDM has a Big Data problem with Small Data. Because MDM is based on ER, it is subject to the O[n2] problem. O[n2] denotes the effort and resources needed to complete an algorithm or data process growing in proportion to the square of the number of records being processed. In other words, the effort required to perform ER on 100 records is 4 times more than the effort to perform the same ER process on 50 records because 1002 = 10,000 and 502 = 2,500 and 10,000/2,500 = 4. More simply stated, it takes 4 times more effort to scale from 50 records to 100 records because [100/50]2 = 4.

Big Data not only brings challenging performance issues to ER and MDM, it also exacerbates all of the dimensions of MDM previously discussed. Multi-domain, multi-channel, hierarchical, and multi-cultural MDM are impacted by the growing volume of data that enterprises must deal with. Although the problems are formidable, ER and MDM can still be effective for Big Data. Chapters 9 and 10Chapter 9Chapter 10 focus on Big Data MDM.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128005378000016

Big Data Technologies and Cloud Computing

Wenhong Tian, Yong Zhao, in Optimized Cloud Resource Management and Scheduling, 2015

Summary

Big Data is the hot frontier of today’s information technology development. The Internet of Things, the Internet, and the rapid development of mobile communication networks have spawned big data problems and have created problems of speed, structure, volume, cost, value, security privacy, and interoperability. Traditional IT processing methods are impotent when faced with big data problems, because of their lack of scalability and efficiency. Big Data problems need to be solved by Cloud computing technology, while big data can also promote the practical use and implementation of Cloud computing technology. There is a complementary relationship between them. We focus on infrastructure support, data acquisition, data storage, data computing, data display, and interaction to describe several types of technology developed for big data, and then describe the challenges and opportunities of big data technology from a different angle from the scholars in the related fields. Big data technology is constantly growing with the surge of data volume and processing requirements, and it is affecting our daily habits and lifestyles.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128014769000021

Big Data Analytics = Machine Learning + Cloud Computing

C. Wu, ... K. Ramamohanarao, in Big Data, 2016

1.1 Introduction

Although the term “Big Data” has become popular, there is no general consensus about what it really means. Often, many professional data analysts would imply the process of extraction, transformation, and load [ETL] for large datasets as the connotation of Big Data. A popular description of Big Data is based on three main attributes of data: volume, velocity, and variety [or 3Vs]. Nevertheless, it does not capture all the aspects of Big Data accurately. In order to provide a comprehensive meaning of Big Data, we will investigate this term from a historical perspective and see how it has been evolving from yesterday’s meaning to today’s connotation.

Historically, the term Big Data is quite vague and ill defined. It is not a precise term and does not carry a particular meaning other than the notion of its size. The word “big” is too generic; the question how “big” is big and how “small” is small [1] is relative to time, space, and circumstance. From an evolutionary perspective, the size of “Big Data” is always evolving. If we use the current global Internet traffic capacity [2] as a measuring stick, the meaning of Big Data volume would lie between the terabyte [TB or 1012 or 240] and zettabyte [ZB or 1021 or 270] range. Based on the historical data traffic growth rate, Cisco claimed that humans have entered the ZB era in 2015 [2]. To understand the significance of the data volume’s impact, let us glance at the average size of different data files shown in Table 1.

Table 1. Typical Size of Different Data Files

MediaAverage Size of Data FileNotes [2014]Web pageeBookSongMovie
1.6–2 MB Average 100 objects
1–5 MB 200–350 pages
3.5–5.8 MB Average 1.9 MB/per minute [MP3] 256 Kbps rate [3 mins]
100–120 GB 60 frames per second [MPEG-4 format, Full High Definition, 2 hours]

The main aim of this chapter is to provide a historical view of Big Data and to argue that it is not just 3Vs, but rather 32Vs or 9Vs. These additional Big Data attributes reflect the real motivation behind Big Data analytics [BDA]. We believe that these expanded features clarify some basic questions about the essence of BDA: what problems Big Data can address, and what problems should not be confused as BDA. These issues are covered in the chapter through analysis of historical developments, along with associated technologies that support Big Data processing. The rest of the chapter is organized into eight sections as follows:

1]

A historical review for Big Data

2]

Interpretation of Big Data 3Vs, 4Vs, and 6Vs

3]

Defining Big Data from 3Vs to 32Vs

4]

Big Data and Machine Learning [ML]

5]

Big Data and cloud computing

6]

Hadoop, Hadoop distributed file system [HDFS], MapReduce, Spark, and Flink

7]

ML + CC [Cloud Computing] → BDA and guidelines

8]

Conclusion

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128053942000015

Machine learning

Patrick Schneider, Fatos Xhafa, in Anomaly Detection and Complex Event Processing over IoT Data Streams, 2022

XGBoost

XGBoost10 is an open-source library for optimized distributed gradient tree boosting [28] that gained popularity lately as it has been the algorithm of choice for many winning teams from several ML competitions. This gradient boosting framework provides parallel tree augmentation to solve many data science problems quickly. XGBoost runs in a distributed environment [Hadoop, SGE, MPI] and can solve big data problems. The term gradient boosting originates in the idea of enhancing or improving a single weak model by combining it with several other weak models to produce a collectively robust model. The weak learning models are reinforced through iterative learning.

Strengths

Interfaces for C ++, Java, Python, R, and Julia and works on Linux, Windows, and Mac OS.

Supports the Apache Hadoop/Spark/Flink, and DataFlow distributed processing frameworks.

High performance in model training and execution speed.

Parallelization of the tree construction using all CPU cores during the training.

Distributed computing for training large models with the help of a machine cluster.

Out-of-core computing for huge amounts of data that do not fit in memory.

Cache optimized data structures for optimal use of the hardware.

Weaknesses

Boosting library designed for tabular data, hence, won't work for other tasks like NLP or Computer Vision.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128238189000195

People, Process and Politics

Rick Sherman, in Business Intelligence Guidebook, 2015

Getting Started with Data Governance

Ideally, data governance is, eventually, an enterprise-wide effort involving multiple business groups and IT. In most companies, though, it starts with a project or two and then grows. If you’re responsible for getting a governance effort launched, start by creating the organization, then focus on defining processes and data.

Start Small, Build for the Future

A typical approach would be to start small and take a tactical approach with a handful of projects that have data issues. The processes you build may be focused on these projects, but they should also be reusable for future projects. The budget, too, should be set with the goal of sustaining the governance effort over the long haul.

A significant trap that many data governance efforts fall into is trying to solve all of an organization’s data problems in the initial phase of the project. Or, companies start with their biggest data problems, issues that span the entire enterprise, and are likely to be very political. It’s almost impossible to establish a data governance program while at the same time tackling data problems that have taken years to build up. This is a case in which you need to “think globally and act locally.” In other words, data problems need to be broken down into incremental deliverables. “Too big, too fast” is a sure recipe for disaster.

Narrow Down the Scope

Another issue that can arise is that the program is too high-level and substantive data issues are never really dealt with; or, the opposite problem, it attempts to create definitions and rules for every data field in every table in every application that an enterprise has—with the result being that the effort gets bogged down in minutiae. There needs to be a happy compromise between those two extremes that enables the data governance initiative to create real business value.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780124114616000174

What are the challenges that the traditional database technologies face when it comes to big data?

Top 6 Big Data Challenges.
Lack of knowledge Professionals. To run these modern technologies and large Data tools, companies need skilled data professionals. ... .
Lack of proper understanding of Massive Data. ... .
Data Growth Issues. ... .
Confusion while Big Data Tool selection. ... .
Integrating Data from a Spread of Sources. ... .
Securing Data..

Why can big data not be held in a traditional database structure?

RDBMS lacks in high velocity because it's designed for steady data retention rather than rapid growth. Even if RDBMS is used to handle and store “big data,” it will turn out to be very expensive. As a result, the inability of relational databases to handle “big data” led to the emergence of new technologies.

What is traditional system in big data?

Traditional data: Traditional data is the structured data that is being majorly maintained by all types of businesses starting from very small to big organizations. In a traditional database system, a centralized database architecture used to store and maintain the data in a fixed format or fields in a file.

What are the major challenges with traditional Rdbms?

5 Sure Signs It's Time to Give Up Your Relational Database.
The Problem with Relational Databases. ... .
A Large Number of JOINs. ... .
Numerous Self-JOINs [or Recursive JOINs] ... .
Frequent Schema Changes. ... .
Slow-Running Queries [Despite Extensive Tuning] ... .
Pre-Computing Your Results..

Chủ Đề