What is heterogeneous in data warehouse?

Data heterogeneity mitigation in healthcare robotic systems leveraging the Nelder–Mead method

Pritam Khan, ... Sudhir Kumar, in Artificial Intelligence for Future Generation Robotics, 2021

6.1 Introduction

Robotic systems are gaining importance with the progress in artificial intelligence. The robots are designed to aid mankind with accurate results and enhanced productivity. In the medical domain time-series data and image data are used for research and analysis purposes. However, heterogeneity in data from various devices processing the same physical phenomenon will result in inaccurate analysis. Robots are not humans, but are trained by humans, therefore the implementation of the data heterogeneity mitigation technique in the robotic systems will help in improving the performance. In IoT [Internet of Things] networks, heterogeneity is a challenge and various methods are designed for mitigating the problem. The machine learning algorithms that are used for classification, regression, or clustering purposes, are based on the data that are the input to the model. Therefore the data from different devices can yield different classification accuracies.

6.1.1 Related work

Data heterogeneity and its mitigation have been explored in few works. Jirkovzky et al. [1] discuss the various types of data heterogeneity present in a cyberphysical system. The different categories of data heterogeneity are syntactic heterogeneity, terminological heterogeneity, semantic heterogeneity, and semiotic heterogeneity [1]. In that work, the causes of the heterogeneity are also explored. The device heterogeneity is considered for smart localization using residual neural networks in [2]. However, heterogeneity mitigation in the data used for localization would have generated more consistency in the results. Device heterogeneity is addressed using a localization method and Gaussian mixture model by Pandey et al. [3]. Zero-mean and unity-mean features of Wi-Fi [wireless fidelity] received signal strength used for localization assist to mitigate device heterogeneity in Refs. [4] and [5]. The approach is however not used for data classification or prediction purposes from multiple devices using neural networks. The work presented in Ref. [6] aims at bringing interoperability in one common layer by using semantics to store heterogeneous data streams generated by different cyberphysical systems. A common data model using linked data technologies is used for the purpose. The concept of service oriented architecture is introduced in Ref. [7] for mitigation of data heterogeneity. However, the versatility of the data management system remains unexplored.

In this work, we leverage the Nelder–Mead [NM] method for heterogeneity mitigation in the raw data from multiple sources. Mitigation of data heterogeneity from multiple sources will increase the reliability of robotic systems owing to the consistency in prediction, classification, or clustering results. We classify normal and abnormal electrocardiogram [ECG] signals from two different datasets. Although the ECG signals in the two datasets belong to different persons, for the sake of proving the benefit of heterogeneity mitigation, we classify the normal and abnormal ECG signals using Long Short Term Memory [LSTM] from two different datasets.

6.1.2 Contributions

The major contributions of our work are as follows:

1.

Data heterogeneity is mitigated, thereby increasing the consistency in performance among various devices processing the same physical phenomenon.

2.

The NM method is a proven optimization technique which validates our data heterogeneity mitigation process.

3.

A robotic system classifies the normal and abnormal ECG signals with consistency in classification accuracies after mitigating data heterogeneity.

The rest of the paper is organized as follows. In Section 6.2, we discuss the preprocessing of data from two datasets and mitigate the heterogeneity using the NM method. The classification of the ECG data is discussed in Section 6.3. The results are illustrated in Section 6.4. Finally, Section 6.5 concludes the paper indicating the scope for future work.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780323854986000125

Machine learning in precision medicine

Dipankar Sengupta, in Machine Learning, Big Data, and IoT for Medical Informatics, 2021

3.1 Detection and diagnosis of a disease

In precision medicine, data heterogeneity forms a major challenge in the development of early diagnostic applications. This can be addressed with the aid of machine learning, as it assists in extracting relevant knowledge from clinical and omics-based datasets, like disease-specific clinical-molecular signatures or population-specific group patterns. One such explicatory example is the recent development of a classifier for predicting skin lesions [skin cancer] using a single CNN [convolutional neural network], with its competency comparable to a dermatologist [Esteva et al., 2017]. Detection and diagnosis of any disease shapes the ground for clinicians to plan and provide a targeted treatment ensuring minimal/no side-effects, along with consideration of the patient’s past clinical history and medications. In the past five years, numerous machine learning-based efforts have been made for a better understanding of diseases facilitating predictive diagnosis in cancer, cardiac arrhythmia, gastroenterology, ophthalmology, and other diseases. The genotype–phenotype associative analysis from this would help in translating the clinical management by early diagnosis and patient stratification, and thus, in the decision-making for selection among the available drug treatments, treatment alterations, and additionally in prognosis, providing personalized care to each patient.

In the past two decades, omics-based technologies have remarkably advanced, which is making a tremendous impact on a better understanding of complex diseases, like cancer. And with the growing data complexity, machine learning is helping to get insights, which help in the development of computational tools for early diagnosis for different cancer types. As in diseases like cancer, an early diagnosis ensures higher chances of survival for a patient. Leukemia, an hematological malignancy, has a high occurrence and prevalence rate, with its early diagnosis being a key challenge. To address this, diagnostic applications have been developed, which are based on CNN, SVM [support vector machines], hybrid hierarchical classifiers, and other pattern-based approaches [Salah et al., 2019]. Similar studies have explored different supervised machine learning approaches being used for breast cancer diagnosis, primarily based on histopathological images and mammograms [Gardezi et al., 2019; Yassin et al., 2018]. Also, there have been population-specific diagnostic predictors developed, considering the genetic and physiological differences amidst the human ethnicities. Like, models based on SVM, Least Squares Support Vector Machine [LS-SVM], Artificial Neural Network [ANN], and Random Forest [RF] to detect prostate cancer in Chinese populations using the prebiopsy data [Wang et al., 2018].

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128217771000136

Automatic Integration and Querying of Semantic Rich Heterogeneous Data

Muhammad Rizwan Saeed, ... Viktor K. Prasanna, in Managing the Web of Things, 2017

9.4.1 Smart Oilfields

One environment characterized by data heterogeneity, where Semantic Web technologies are being used for discovering, modeling, integrating and sharing knowledge is the Smart [or Digital] Oilfield. Multiple processes on oil and gas facilities generate vast volumes of data. The data can be generated by variety of sensors and process controllers or manually gathered through inspections. This gathered data is often located in multiple independent repositories and undergoes further processing by different users by performing analytics and other domain-specific activities that result in generation of derived data. For example, engineers working in the oilfields often create or re-use different simulation models [56] such as geographic models, reservoir simulation models, network models, integrated [coupled] simulation models etc., all of which produce and consume vast amounts of data. Another area where effective decision making is critical is Asset Integrity Management, where assets are continuously monitored to ensure that they perform their required functions effectively while operating within defined safe operating ranges. Asset integrity is affected by many parameters [22] including normal wear and tear, weather, production, repair, and human factors etc. In a rare operating condition, these seemingly independent parameters may collectively trigger a fault that could lead to a potentially disastrous loss of containment [LOC] event. The oil and gas industry always seeks to prevent LOC events. To prevent such incidents, engineers rely on inputs from various asset databases and software tools to make important safety-related assessments and decisions on a daily basis. Due to heterogeneity of these data sources, providing on-demand access to information with an integrated view can be challenging. A unified view of current data sources is desirable for decision making as it could lead to identification of telltale signatures of LOC events. However, manually cross-referencing and analyzing such data sources is labor intensive. Another challenge is knowledge management, which refers to a systematic way to capture the results of various engineering analyses and prediction models. To summarize, there are three key challenges that must be addressed to facilitate all such decision making processes [56]:

Integrated view of the information: For effective decision making, there needs to be a system that presents a comprehensive and continuous view of the assets and processes. Useful information may reside in multiple databases or files. Having scattered information makes it difficult to connect the dots and come up with actionable information. Hence, integration of multiple data sources is a crucial step.

Knowledge management: As the models [e.g. for modeling production or external corrosion] are constantly being used and improved with passage of time [due to availability of more data], the rationale behind the changes and decisions are generally lost. Such knowledge could be extremely useful for auditing and training purposes.

Efficient access to information: It is critical for the decision maker [production engineer, asset integrity manager etc.] to have access to relevant pieces of information required to make an informed decision. For example, an asset integrity manager who is looking for solution to a problem can benefit from a database of surveys conducted on other facilities to look for similar problems and associated recommendations or solutions.

9.4.1.1 Semantic Model of Asset Integrity Data

Semantic Web technologies can be used for an expressive representation of various heterogeneous data sources to deal with aforementioned challenges. First of all, the data streams need to be modeled using appropriate ontological models that capture the relevant domain knowledge, as is done through SOFOS [Smart Oilfield Ontology] and ECD [External Corrosion Detection] ontologies in [22]. The ontologies capture physical entities and their inter-relationships as well as the associated observed data, metadata and derived data. Once all the entities in the data streams have been identified, the next step is to perform record linkage across data streams. Different types of data may be recorded for same assets in different databases. The key idea is to integrate and present a unified view of the environment. Fig. 9.4 shows data organized as an integrated RDF graph where raw data was originally stored as multiple CSV files, databases, images and text files. Data from multiple sources is integrated into a central repository serving as a single endpoint for maintaining and retrieving knowledge. Asset integrity managers can query previously separate databases to issue meaningful queries e.g. all work orders for assets that have been labeled severely corroded in the most recent survey. This query requires data from work orders database and inspection database, which have now been linked together after the integration process. This way data from disjoint repositories can be combined to provide actionable information. This is, essentially, an end goal for any enterprise to not just store and manage vast amounts of data, but also get actionable insights faster for robust decision making process. Other automatic and semi-automatic approaches of using ontologies for organizing data and metadata into semantic repositories are discussed in [56] and [57].

Figure 9.4. Integrated asset integrity data. Data from one oil and gas facility is organized as a graph where the center node represents the facility. Each group of nodes represents all information related to a single equipment organized in a hierarchical way. A hierarchy of data for a single equipment is shown in the expanded view

9.4.1.2 Accessing Integrated Information

Once we have the integrated repository, the data become available for querying. However, non-expert users require IT experts to build applications or create queries to be used. Our definition of a non-expert user is someone who is not familiar with the concepts of databases, querying, Semantic Web and ontologies but is an expert in his area of specialization e.g. an asset integrity manager or a production engineer. To do his job that person only requires access to data irrespective of underlying techniques of data organization and retrieval. For such users, ASQFor algorithm [discussed in Section 9.3.4.1] aims to abstract Semantic Web concepts [such as ontologies, RDF, SPARQL] and allows them to formulate queries at a higher level which the system then translates into formal SPARQL queries automatically.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128097649000123

Data Fusion Methodology and Applications

Federica Mandreoli, Manuela Montangero, in Data Handling in Science and Technology, 2019

3.2.1 Dealing With Conflicts

Possible conflicts fall into two categories: uncertainty and contradiction. Uncertainty occurs in case of missing information, that is when a data source either does not represent a property for a given entity that instead is represented in other data sources or exhibits a null value for that property. Contradiction, instead, is a conflict between two nonnull values for the same property and the same entity.

Data conflicts occur frequently in highly heterogeneous and data-intensive scenarios such as life science and can be addressed in different ways. The various strategies range from conflict ignoring strategies to conflict avoiding and conflict resolution strategies. The first one does not take a position on what to do in case of conflict, for instance, by presenting all different values as the Pass It On strategy does. The second one always adopt the same strategy to solve conflicts, regardless of the involved values. As an example, the Trust Your Friends strategy prefers data coming from a special data source over the others. For instance, when data sources such as patient forums can be of interest from various points of view, for instance, to understand the feeling of patients on diseases, treatment effectiveness, and side effects, and so on. However, not all kinds of data available in these sources are trustworthy because of the lack of scientific knowledge of the involved people. So, in case of data conflicts with “more controlled” data sources [see, e.g., those presented in Ex. 1], the Trust Your Friends strategy would prefer data coming from such sources.

Finally, conflict resolution strategies do regard all the data and the knowledge about the data before deciding. A conflict resolution strategy that chooses one of the available values is named deciding strategy. Instead, if it presents a value that can also be out of the available values, it is a mediating strategy. For instance, the Keep Up to Date strategy is a deciding strategy that selects the most recent value, whereas the Meet in the Middle strategy is a mediating strategy that takes an average value. Details on alternative strategies and their implementations are available in [63].

To implement a conflict resolution strategy [63,64], introduce the notion of conflict resolution function. A conflict resolution function is a function that, applied to a set of conflicting values, outputs a single resolved value. Several conflict resolution functions can be defined. Samples of conflict resolution functions related to the strategies presented earlier are shown in Table 9.1.

Table 9.1. Samples of Conflict Resolution Functions From [65] for the Strategies Described in the Text

FunctionDefinitionStrategy
CONCAT Returns the concatenated values. May include annotations, such as the names of the data sources Pass It On
MOST COMPLETE Returns the nonnull value of the source that contains the fewest null values in the property in question Trust Your Friends
AVERAGE/MEDIAN Compute average and median of all present nonnull data values Meet in the Middle
MOST RECENT Returns the most recent value. Recency is evaluated with the help of another attribute or other metadata about values Keep Up to Date

Definition 1 [Conflict Resolution Function [64]]. An n-ary conflict resolution function is a function f defined on a domain D and maps a set C of n conflicting input values to one output value of the same or another domain S:

f:D ×…×D→S

Choosing a good conflict resolution function for a specific data heterogeneity problem is not an easy task. To this end, expert users usually weight different aspects such as the cost of the strategy, in terms of the computation cost, the quality of the results, and the information at function disposal.

Algorithm 1 Data fusion
1: Remove exact duplicates from R
2: if fuse without subsumed records then
3: R = R\ subsumed – records[R]
4: End if
5: Compute r = [p1:v1,…,pn:vn]

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780444639844000090

Technical and clinical challenges of A.I. in retinal image analysis

Gilbert Lim, ... Tien Yin Wong, in Computational Retinal Image Analysis, 2019

3.3 Heterogeneous data

A third major technical challenge for A.I. in retinal imaging is data heterogeneity. Heterogeneity may manifest itself in many ways, both technological and methodological.

Firstly, images in a dataset may be heterogeneous due to equipment and differing image capture standards. For example, images from various sources may have been saved with different compression algorithms and compression ratios, as discussed above. Similarly, images taken by different camera models may be trivially distinguished from each other due to features such as tabs at the edge of the retinal circle, and the presence of borders at the top and bottom [Fig. 2]. Cameras with the same major specifications [such as resolution and FOV] may in fact retain subtle differences in image capture. Even cameras that are of the exact same manufacturer make may be differentiated by features such as minute scratches and artifacts.

Fig. 2. Example of retinal photographs that may be expected to be classified with the same model.

Secondly, images in a dataset may be heterogeneous because of the underlying distribution of the subjects. For example, the image data may be composed almost entirely of subjects of a certain ethnicity, gender or age [Fig. 3]. There may be further hidden environmental variables, such as room lighting and the time of day.

Fig. 3. Example of retinal photographs from subjects of different ethnicities.

The technical significance of these observations is that popular neural network architectures are extremely sensitive to image details, and are capable of learning features that are not discernible by humans, from the evidence of adversarial images [61]. Therefore, data heterogeneity may cause the model to learn spurious relationships. For example, consider researchers testing a hypothesis as to whether a particular medical condition is predictable by a machine learning model from retinal fundus photographs. It may be that very good performance is achieved on the available training and validation data, but only because of technological [e.g., all images from one class are uncompressed, while all images from the other class have very minor compression artifacts] or methodological [e.g., all images from one class are of one ethnicity, while all images from the other class are from another ethnicity] biases.

Data heterogeneity is ideally mitigated by ensuring that the training set used is of as similar a distribution as that of the actual application, as possible. For example, a model to be used at a particular site is usually best trained or fine-tuned on data from that site. However, this may not always be possible. In these cases, the impact of heterogeneity may be reduced by principled sampling according to known variables. For example, it may be that for a particular rare condition, there exist 500 images each from Ethnicity A and B. However, for normal retina images, there are ten thousand images from Ethnicity A, but only 200 images from Ethnicity B. Then, if a model is trained to detect the rare condition without taking the underlying ethnic distribution into consideration, it is possible that the model would ultimately classify for ethnicity, rather than for the target condition. This may not even be realized during validation, if the validation data set maintains the same data distribution as the training set. Preprocessing may also be undertaken to reduce heterogeneity, before the retinal images are input to the actual model. Techniques such as color and contrast normalization eliminate the most obvious image variation. Adaptation to new data distributions may be attempted by additional classifier training on model outputs, given a sample of the new distribution. A simple realization would be adjustable score thresholds [40].

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780081028162000228

Machine learning

Patrick Schneider, Fatos Xhafa, in Anomaly Detection and Complex Event Processing over IoT Data Streams, 2022

Challenges and considerations

Despite the benefits of FL, it does not solve all the inherent problems in learning on medical data. Successful model training still depends on factors such as data quality, bias, and standardization [104]. These issues need to be addressed for both federated and nonfederated learning efforts through suitable measures, such as careful study design, standard protocols for data collection, structured reporting, and sophisticated methods for detecting bias and hidden stratification.

Below, key aspects of FL are reviewed that are particularly relevant when applied to digital health and need to be considered when building an FL. Technical details and in-depth discussions can be found in recent studies [115,50,112].

Data heterogeneity

Medical data are diverse-not only because of the variety of modalities, dimensionality, and characteristics in general, but also within a given protocol due to factors such as acquisition differences, medical device brand, or local demographics. FL can help address specific sources of bias through potentially increased data source diversity, but heterogeneous data distribution presents a challenge for FL algorithms and strategies, as many assume independently and identically distributed [IID] data on participants. In general, strategies such as FedAvg [71] tend to fail under these conditions [71,62,120], which in part defeats the very purpose of collaborative learning strategies. However, recent results suggest that FL training is still feasible [65] even when medical data are not evenly distributed across institutions [63,87] or contain local bias [8]. Research addressing this problem includes, for example, FedProx [62], partial data sharing strategy [120], and FL with domain adaptation [64]. Another challenge is that data heterogeneity can lead to a situation where the optimal global solution is not optimal for a single local participant. Therefore, the definition of the optimality of the model training should be agreed upon by all participants prior to the training.

Privacy and security

Healthcare data is sensitive and must be protected accordingly, with appropriate confidentiality procedures. Therefore, some of the most important considerations are the trade-offs, strategies, and remaining risks regarding the privacy-friendly potential of FL.

Privacy vs. performance

FL does not solve all possible privacy issues, and, similar to ML algorithms in general, it will always introduce some risks. Privacy-preserving techniques for FL provide a level of protection that exceeds current commercially available ML models [50]. However, there is a trade-off in performance, and these techniques may affect the accuracy of the final model [61]. In addition, future techniques and auxiliary data could be used to compromise a model that was previously considered low risk.

Level of trust

Participating parties can enter into two types of FL collaboration:

Trusted

- For FL consortia, where all parties are deemed trustworthy and bound by an enforceable collaboration agreement, we can rule out many of the more nefarious motivations, such as deliberate attempts to extract sensitive information or deliberate damage to the model. This reduces the need for sophisticated countermeasures and draws on the principles of standard collaboration research.

Untrusted

- For FL systems operating at scale, it may be impractical to establish an enforceable collaboration agreement. Some clients may intentionally attempt to degrade performance, crash the system, or extract information from other parties. Therefore, security strategies are required to mitigate these risks, such as advanced encryption of model inputs, secure authentication of all parties, action traceability, differential privacy, verification systems, execution integrity, model confidentiality, and protection against adversarial attacks.

Information leakage

By definition, FL systems avoid sharing health data among participating institutions. However, the shared information may still indirectly reveal private data used for local training, such as the model inversion [111] of model updates, the gradients themselves [121], or adversarial attacks [106,46]. FL differs from traditional training because the training process is exposed to multiple parties, which increases the risk of leakage through reverse engineering. Leakage can occur when attackers can observe model changes over time, observe specific model updates [i.e., updating a single institution], or manipulate the model [e.g., additional memorization by others through gradient-ascent style attacks]. Developing countermeasures, such as limiting the granularity of updates and adding noise [63,64] and ensuring adequate differential privacy [10], may be required and is still an active area of research [50].

Traceability and accountability

Required for all safety-critical applications, system reproducibility is critical for FL in healthcare. Unlike centralized training, FL requires computation with multiple participants in environments with a significant variety of hardware, software, and networks. Traceability of all system resources, including data access history, training configurations, and hyperparameter tuning throughout the training process, is imperative.

Especially for untrusted federations, traceability and accountability processes require integrity in execution. When the training process achieves the mutually agreed-upon model optimality criteria, it may be helpful to measure each participant's contribution, such as the computational resources consumed or the quality of data used for local training. These measurements could then determine appropriate compensation and establish a revenue model among participants [42]. One implication of FL is that researchers cannot examine which models are trained to understand unexpected outcomes. In addition, taking statistical measurements of their training data as part of the model development workflow must be approved by the collaborating parties as not to violate data privacy. Although each individual will have access to its raw data, the collaboratives may decide to provide some type of secure intra-node viewing capability to meet this need or provide another way to increase the explainability and interpretability of the global model.

System architecture

Unlike running FL on a large scale among consumer devices [71], healthcare participants have relatively powerful computational resources and reliable higher-throughput networks that enable larger models with many more local training steps that share more model information between nodes. These characteristics of FL in healthcare also carry challenges, such as ensuring data integrity in communications by using redundant nodes, designing secure encryption methods to prevent data leaks, or designing appropriate node schedulers to optimize distributed computing devices and reduce idle time.

The management of such a federation can be implemented in a variety of ways. In situations that require the strictest privacy between parties, training can be done through a type of “honest broker” system, where a trusted third party acts as an intermediary and facilitates access to the data. This setup requires an independent control entity of the overall system, which is not always feasible because it could involve additional costs and procedural implementations. However, it has the advantage of abstracting the exact internal mechanisms from the clients, making the system more agile and easier to update. In a peer-to-peer system, each site communicates directly with the other participants. There is no gatekeeper function, where all protocols must be agreed upon in advance, which requires significant agreement effort, and changes must be made by all participants in a synchronized manner to avoid problems.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128238189000195

Compared Evaluation of Scientific Data Management Systems

Zoé Lacroix, Terence Critchlow, in Bioinformatics, 2003

13.3 TRADEOFFS

This section explores some of the tradeoffs to be considered when evaluating systems and some of the unique characteristics of biological data management systems that complicate their design and evaluation. As with the evaluation metrics, there is no clearly best approach, but rather the user requirements and system constraints need to be included in the evaluation. The purpose of the following sections is simply to call out certain characteristics and encourage the readers to consider them. This is meant to be an illustrative, not exhaustive, list. There are many tradeoffs and considerations that are not discussed but that may be important for evaluating a system within a specific environment.

13.3.1 Materialized vs. Non-Materialized

Materialized approaches usually are faster than non-materialized ones for query execution. This makes intuitive sense because the data is stored in a single location and in a format supportive of the queries. To confirm intuition, tests were run in 1995 with several implementations of the query: “Retrieve the HUGO names, accession numbers, and amino acid sequences of all known human genes mapped to chromosome c” [12]. These tests were performed using the Genomic Unified Schema [GUS] warehouse as the materialized source and the K2/Kleisli3 system as the non-materialized source. The query requires integrating data from the Genome DataBase [GDB], the Genome Sequence DataBase [GSDB], and GenBank. Measures showed that for all implementations, the warehouse is significantly faster. In certain cases, queries executed by K2 as part of this evaluation failed to complete due to network timeouts. The expression of the query [using semi-joins rather than nested loop iterations] also affected the performance of the execution of the query.

In addition to the communication overhead, the middleware between the user interface and the remote data may introduce computational overhead. Recently, tests have been performed at IBM to determine whether or not a middleware approach such as DiscoveryLink [presented in Chapter 11] affects the access costs when interacting with a single database. They conducted two series of tests in which DiscoveryLink was compared to a production database at Aventis [13]. The results show that, in the tested context, for a single user, the middleware did not affect the performance. None of the tested queries involved the manipulation of large amounts of data; however, they presented many sub-queries and unions. In some cases, accessing the database through a middleware and a wrapper was even faster than the direct access to the database system. The load test shows that both configurations scale well, and the response times for both approaches are comparable to the single-user case.

There are a variety of factors to be considered beyond the execution cost. Materialized databases are generally more secure because queries can be performed entirely behind a firewall. Non-materialized approaches have the advantage that they always return the most up-to-date information available from the sources, which can be important in a highly dynamic environment. They also require significantly less disk space and can be easier to maintain [particularly if the system does not resolve semantics conflicts].

13.3.2 Data Distribution and Heterogeneity

Many systems presented in the previous chapters are mediation systems. Mediation systems integrate fully autonomous, distributed, heterogeneous data sources such as various database systems [relation, object-relational, object, XML, etc.] and flat files. In general, the performance characteristics of distributed database systems are not well understood [14]. There are not enough distributed database applications to provide a framework for evaluation and comparison. In addition, the performance models of distributed database systems are not sufficiently developed, and it is not clear that the existing benchmarks to test the performance of transaction processing applications in pure database contexts can be used to measure the performance of distributed transaction management. Furthermore, because the resources are not always databases, the mediation approach is more complex than the multi-database and other distributed database architectures typically studied in computer science.

For many bioinformatics systems, issues related to data distribution and heterogeneity are considerable and significantly affect the performance. As a result, they typically integrate only the minimal number of sources required to perform a given task, even when additional information could be useful. The complexity of this domain and the lack of objective information favor domain-specific evaluation approaches over generic ones for this characteristic.

13.3.3 Semi-Structured Data vs. Fully Structured Data

Previous chapters have pointed out that scientific data are usually complex, and their structures can be fluid. For these reasons, a system relying on a semi-structured framework rather than a fully structured approach, such as a relational database, seems more adequate. Although there are systems that utilize meta-level capabilities within relational databases to develop and maintain flexibility, they are usually not scalable enough to meet the demands of modern genomics. The success of XML as a self-describing data representation language for electronic information interchange makes it a good candidate for biological data representation. The design of a generic benchmark for evaluating XML management systems is a non-trivial task in general, and it becomes much more challenging when combined with data management and performance issues inherent to genomics.

Some attempts have been made to design an XML generic benchmark. Three XML generic benchmarks limited to locally stored data and in a single machine or single user environment have been designed: XOO7 [3], XMach-1 [15], and XMark [16]. XOO7 attempts to harness the similarities in data models of XML and object-oriented approaches. The XMach-1 benchmark [15] is a multi-user benchmark designed for business-to-business applications, which assumes the data size is rather small [1 to 14 KB]. XMark [16] is a newer benchmark for XML data stores. It consists of an application scenario that models an Internet auction site and 20 XQuery queries designed to cover the essentials of XML query processing.

XOO7 appears to be the most comprehensive benchmark. Both XMark and XMach-1 focus on a data-centric usage of XML. All three benchmarks provide queries to test relational model characteristics such as selection, projection, and reduction. Properties such as transaction processing, view manipulation, aggregation, and update, are not yet tested by any of the benchmarks. XMach-1 covers delete and insert operations, although the semantics of such operations are not yet clearly defined for the XML query model. Additional information about XML benchmarks can be found in Bressan et al.'s XML Management System Benchmarks [17].

Native XML systems have been compared to XML-enabled systems [relational systems that provide an XML interface that allows users to view and query their data in XML] with three collections of queries: data-driven, document-driven, and navigational queries [18]. Tests confirm that XML-enabled management systems perform better than XML native systems for data-driven queries. However, XML native systems outperform XML-enabled ones on document-driven and navigational queries. This is not unexpected because enabled systems are tuned to optimize the execution of relational queries. However, they do not efficiently represent nested or linked data. Thus, navigational queries within enabled systems are rather slow; whereas native systems are able to exploit the concise representation of data in XML. Finally, document queries may use the implicit order of elements within the XML file. This ordering is not typically represented in relational databases, therefore defining an appropriate representation is a tedious task and negatively affects performance.

The type of system that is most appropriate depends heavily on the types of queries expected, the data being integrated, and the tools with which the system must interact. Scientific queries exploit all characteristics of XML queries: data, navigation, and document. An XML biological information system will need to perform well in all these contexts. An XML biological benchmark will be needed to evaluate XML biological information systems.

13.3.4 Text Retrieval

For many tasks, scientists access their data through a document-based interface. Indeed, a large amount of the data consists of textual annotations. Life scientists extensively use search engines to access data and navigation to explore the data. Unlike database approaches, structured models cannot be used to represent a document or many queries over document sets [e.g., given a document, find other documents that are similar to it]. The evaluation of a textual retrieval engine typically relies on the notion of relevance of a document. A document is relevant if it satisfies the query. The notion of relevance is subjective because retrieval engines typically provide users with a limited query language consisting of Boolean expression of keywords or phrases [strings of characters]. In such context, the query often does not express the user's intent, and thus, the notion of relevance is used to capture the level of satisfaction of the user rather than the validation of the query. Relevance is considered to have two components: recall and precision. Recall is the ratio of the number of relevant documents retrieved by the engine to the total number of relevant documents in the entire data set. A recall equal to one means all relevant documents were retrieved, whereas a recall of zero means no relevant document was retrieved. A recall of one does not guarantee the satisfaction of the user; indeed, the engine may have retrieved numerous non-relevant documents [noise].

Precision is the ratio of the number of relevant documents retrieved by the engine to the total number of retrieved documents and thus reflects the noise in the response. A precision equal to one means all retrieved documents are relevant, whereas a precision of zero means no retrieved document is relevant. Ideally, a document would have both a precision and a recall of one, returning exactly the set of documents desired. Unfortunately, state-of-the-art text query engines are far from that ideal. Currently, recall and precision are inversely related in most systems, and a balance is sought to obtain the best overall performance while not being overly restrictive.

13.3.5 Integrating Applications

System requirements usually include the ability to use sophisticated applications to access and analyze scientific data. The more applications that are available, the better functionality the system has. However, integrating applications such as BLAST may significantly affect the system performance in unanticipated and unpredictable ways. For example, a call to blastp against a moderate size data set will return a result within seconds, whereas a call to tblastn against a large data set may require hours. The evaluation of the performance of the overall integration approach must include information about the stand-alone performance of the integrated resources. This information, including the context in which optimal performance can be obtained, is often poorly documented. This is partially because many of the useful analysis tools are developed in academic contexts where little effort is made to characterize and advertise their performance. Readers who are involved in tool development are invited to better characterize the performance of these tools for systems to better integrate them.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9781558608290500152

Creating University Analytical Information Systems

I. Guitart, J. Conesa, in Formative Assessment, Learning Data Analytics and Gamification, 2016

5.2 A Grand Challenge?

Tony Hoare [2003] suggested seventeen criteria that must be satisfied by a research goal in order for it to qualify as a grand challenge. These criteria include the five key properties dealt with in the previous section. Twelve additional criteria relate mainly to the maturity of the discipline and the feasibility of the goal. This section assesses the proposed research goals with respect to the additional criteria.

Fundamental. Creating analytic systems that deal with educational data in universities is a fundamental concern in the information systems and the e-learning fields.

Astonishing. The results of a system like the proposed one would bring analytics to another level, allowing the creation of decision support systems useful for teachers, teacher managers, students and researchers. These decision support systems would allow teachers to improve their teaching activities [the perceived quality of learning resources, the main doubts of students, students’ perceptions of the subject by analyzing the messages they have written in the communication messages, a 360 degree view of students to easily detect the strengths and weaknesses for each of them, etc.]. Managers learn to update the subjects or academic programs they manage according to the job market needs; students experience a more personalized learning experience, where the resources, activities and feedback will depend on their performance and expectations; and researchers have more information about their related research and are able to better deal with the infoxication they experience due to the huge amount of papers that are published. In our past experience [Guitart and Conesa, 2014; Guitart et al., 2013; Guitart et al., 2015; Guitart and Conesa, 2015], these agents are not aware of the possibilities of analytics for learning processes and in some cases even some resistance was found at the beginning of the project because people believed that the analytical questions the project needed to solve where irresoluble.

Revolutionary. If the goal is achieved, it will lead to a radical change in universities. In particular, an analytical system as proposed, would change the way in which teaching and learning are performed.

Research-directed. Significant parts of the goal can be achieved only through research. Among these are the gathering and integration of diverse data in the context of universities, the management of very large volumes of data, heterogeneity, defining methodologies to address UAIS projects more effectively, defining metrics and evaluation models to state the relevance and usefulness of UAIS systems in a given environment, and efficiently adapting the lessons learnt from BI in enterprises to the higher education context.

Inspiring. The goal [using analytics in education] has the support of almost the entire e-learning research community and the university community.

International. The goal has an international scope. Researchers from all around the world are already participating in it by working in LA and related areas.

Historical. The goal of creating analytic information systems in the context of universities has already been proposed in the past [Merriam, 1998; Rothschild and White, 1995]. Since then the use of analytics in universities has continued to be a very prolific research area, which has led to different approaches. Some of these have focused on reformulating analytics for use in universities, eg, learning analytics [Ferguson, 2012], academics analytics [Campbell et al., 2007] and action analytics [Norris et al., 2008]. Others focused on advancing in small steps to take full advantage of analytics in the context of universities, such as educational data mining [Romero and Ventura, 2010], or the definition of standards and specifications in universities [IEEE-LOM, 2011; Dublin Core, 2001].

Feasible. The goal is now more feasible than ever. Over the last few years the crisis has worsened the panorama for universities, creating new constraints: less public funds, more competition and more exigent students with more expectations and higher economical restrictions. In this new panorama universities have to be managed efficiently and gain competitively in order to guarantee their sustainability. In this context, the university staff has focused on working on analytical approaches in order to improve their internal management, understand better their context and stakeholders, and improve their main activities [teaching and research]. The new analytical approaches, mostly under the umbrella of learning analytics, have improved the awareness of university staff about the relevancy of analytics. Then, UAIS are more feasible than ever, because there is a critical mass of researchers working in this direction, a lot of novel and relevant approaches are being created and universities have the culture to adopt and use UAIS efficiently. In addition, the advancement of other research areas that may help in the attainment of the goal of UAIS, such as Big Data or Visual Analytics, potentially support its feasibility as well.

Cooperative. The work required to achieve the proposed goal can be parceled out to teams working independently. For example, some groups may focus on the definition of the skeleton of an analytical IS for universities, other on the definition of relevant indicators that should be taken into account in universities, and others on the data management techniques necessary to deal with the huge quantity of data gathered from the VLE and in the definition of models to evaluate analytic IS in university context. Different groups can perform these different research activities concurrently and their outcomes can be adapted by other groups to create an integrated analytical system cooperatively.

Competitive. The proposed goal encourages and benefits from competition among teams. Cultural aspects of different universities and the high levels of autonomy teachers enjoy make each university a particular scenario with its own actors, processes and politics. The inclusion of analytics cannot limit the autonomy of teachers and researchers, but promotes their creativity and productivity. Otherwise, the academic field would be standardized and the academic community as we know would disappear. This heterogeneity that universities promote means that most of the problems related to the research agenda have different solutions with different ranges of application and different levels of efficiency for each university.

Effective. The promulgation of the necessity of creating analytical IS for universities is intended to cause a shift in the attitudes and activities of the relevant research, student, academic and professional communities. Academics should be aware of the amount of data that is generated during teaching activities and the amount of data that can be extracted from their learning resources. Accordingly they should begin to participate in the definition of the key indicators that may help in their work and to document thoroughly the learning resources used in order to facilitate data extraction. VLE developers should begin to gather data from all the implementations they perform. The success of analytics depends greatly on the amount of data gathered, so each process implemented in a VLE should gather and store the data related to the process and the involved actors. Researchers in the field of LA, educational data mining and action analytics should take into account the lessons learnt in the BI field, profiting from their methodologies, knowledge, technologies and tools to create analytical systems that are simpler, more powerful, more generalizable and more testable. Students should be aware that all their data [personal, enrollment, performance, navigational, etc.] can be used in order to create services that improve teaching, learning and university sustainability. These new services will surely improve the learning experience of students but at the cost of their privacy. Therefore, students should be aware of this fact and play an active role in stating what privacy they agree to forgo in order to gain a better learning experience.

Risk-managed. The main critical risks to the project arise from the impossibility of gathering relevant information from some learning resources, privacy aspects and adoption of the analytical IS for the leadership of universities. Text learning resources [such as PDF files, ePub books, etc.] can be parsed to get information from their content in order to nourish the analytical systems. However, the parse of some kinds of resources cannot generate relevant information, such as videos or audio files. In this case, information can be extracted from its use [how students consume the resources] but not from its content. This mismatch between different learning resources may pose difficulties to the creation of some analytical services related to the content of courses. This problem can be minimized if universities develop a governance of what kind of formats should be used for each learning resource. Another potential problem would be to convince the university leaders of the necessity of creating analytical ISs for teachers, researchers and students. There exists the risk of implementing systems considering only the technical perspective and not as a management project that uses a right methodology to solve a given problem. That may result in the failure of the project and generate incomplete results that do not reach the expected objectives and are not generalizable. The development of this kind of system is very expensive. If they are seen as systems that benefit academics in a task they are supposed to do with or without the system, the temptation may be to not invest in such analytical systems. In order to avoid this problem, validation models should be created to evaluate the usefulness of analytical ISs and measures such as Return of Investment or Payback should be rewritten in the case of universities to show clearly how the improvement of teaching and research can impact in university sustainability. The privacy problem is ongoing and is out with the scope of this research community. However, some measures can be taken in order to avoid collecting personal information of students and using anonymization systems in order to make sure that particular students cannot be singled out from the analysis of data.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128036372000099

The Rebirth of Enterprise Data Management

Alan Simon, in Enterprise Business Intelligence and Data Management, 2014

1.1 In the beginning: how we got to where we are today

Those who cannot remember the past, are condemned to repeat it.

- George Santayana [1863–1952]

To best understand the state of enterprise data management [EDM] today, it’s important to understand how we arrived at this point during a journey that dates back nearly 50 years to the days when enormous, expensive mainframe computers were the backbone of “data processing” [as Information Technology was commonly referred to long ago] and computing technology was still in its adolescence.

1.1.1 1960s and 1970s

Many data processing textbooks of the 1960s and 1970s proposed a vision much like that depicted in Figure 1.1.

Fig. 1.1. 1960s/1970s vision of a common “data base.”

The simplified architecture envisioned by many prognosticators called for a single common “data base”1 that would provide a single primary store of data for core business applications such as accounting [general ledger, accounts payable, accounts receivable, payroll, etc.], finance, personnel, procurement, and others. One application might write a new record into the data base that would then be used by another application.

In many ways, this “single data base” vision is similar to the capabilities offered today by many enterprise systems vendors in which a consolidated store of data underlies enterprise resource planning [ERP], customer relationship management [CRM], supply chain management [SCM], human capital management [HCM], and other applications that have touch-points with one another. Under this architecture the typical company or governmental agency would face far fewer conflicting data definitions and semantics; conflicting business rules; unnecessary data duplication; and other hindrances than what is found in today’s organizational data landscape.

Despite this vision of a highly ordered, quasi-utopian data management architecture, the result for most companies and governmental agencies looked far more like the diagram in Figure 1.2, with each application “owning” its own file systems, tapes, and first-generation database management systems [DBMSs].

Fig. 1.2. The reality of most 1960s/1970s data environments.

Even when an organization’s portfolio of applications was housed on a single mainframe, the vision of a shared pool of data among those applications was typically nowhere in the picture. However, the various applications – many of which were custom-written in those days – still needed to share data among themselves. For example, Accounts Receivable and Accounts Payable applications needed to feed data into the General Ledger application. Most organizations found themselves rapidly slipping into the “spider’s web quagmire” of numerous one-by-one data exchange interfaces as depicted in Figure 1.3.

Fig. 1.3. Ungoverned data integration via proliferating one-by-one interfaces.

By the time the 1970s drew to a close and computing was becoming more and more prevalent within business and government, any vision of managing one’s data assets at an enterprise level was far from a reality for most organizations. Instead, a world of uncoordinated, often conflicting data silos was what we were left with.

1.1.2 1980s

As the 1980s progressed, the data silo problem actually began to worsen. Minicomputers had been introduced in the 1960s and had grown in popularity during the 1970s, led by vendors such as Digital Equipment Corporation [DEC] and Data General. Increasingly, the fragmentation of both applications and data moved from the realm of the mainframe into minicomputers as organizations began deploying core applications on these newer, smaller-scale platforms. Consequently, the one-by-one file transfers and other types of data exchange depicted in Figure 1.3 were now increasingly occurring across hardware, operating system platforms, and networks, many of which were only beginning to “talk” to one another. As the 1980s proceeded and personal computers [often called “microcomputers” at the time] grew wildly in popularity, the typical enterprise’s data architecture grew even more fragmented and chaotic.

Many organizations realized that they now were facing a serious problem with their fragmented data silos, as did many of the leading technology vendors. Throughout the 1980s, two major approaches took shape in an attempt to overcome the fragmentation problem:

Enterprise data models

Distributed database management systems [DDBMSs]

1.1.2.1 Enterprise Data Models

Companies and governmental agencies attempted to get their arms around their own data fragmentation problems by embarking on enterprise data model initiatives. Using conceptual and logical data modeling techniques that began in the 1970s such as entity-relationship modeling, teams of data modelers would attempt to understand and document the enterprise’s existing data elements and attributes as well as the details of relationships among those elements. The operating premise governing these efforts was that by investing the time and resources to analyze, understand, and document all of the enterprise’s data across any number of barriers – application, platform, and organizational, in particular – the “data chaos” would begin to dissipate and new systems could be built leveraging the data structures, relationships, and data-oriented business rules that already existed.

While many enterprise data modeling initiatives did produce a better understanding of an organization’s data assets than before a given initiative had begun, these efforts largely withered over time and tended not to yield anywhere near the economies of scale originally envisioned at project inception. The application portfolio of the typical organization in the 1980s was both fast-growing and very volatile, and an enterprise data modeling initiative almost certainly fell behind new and rapidly changing data under the control of any given application or system. The result even before completion, most enterprise data models became “stale” and outdated, and were quietly mothballed.

[As most readers know, data modeling techniques are still widely used today, although primarily as part of the up-front analysis and design phase for a specific software development or package implementation project rather than attempting to document the entire breadth of an enterprise’s data assets.]

1.1.2.2 Distributed Database Management Systems [DDBMSs]

Enterprise data modeling efforts on the parts of companies and governmental agencies were primarily an attempt to understand an organization’s highly fragmented data. The data models themselves did nothing to help facilitate the integration of data across platforms, databases, organizational boundaries, etc.

To address the data fragmentation problem from an integration perspective, most of the leading computer companies and database vendors of the 1980s began work on DDBMSs. The specific technical approaches from companies such as IBM [Information Warehouse], Digital Equipment Corporation [RdbStar], Ingres [Ingres Star], and others varied from one to another, but the fundamental premise of most DDBMS efforts was as depicted in Figure 1.4.

Fig. 1.4. The DDBMS concept.

The DDBMS story went like this: regardless of how scattered an organization’s data might be, a single data model-driven interface could sit between applications and end-users and the underlying databases, including those from other vendors operating under different DBMSs [#2 and #3 in Figure 1.4]. The DDBMS engine would provide location and platform transparency to abstract applications and users from the underlying data distribution and heterogeneity, and both read-write access as well as read-only access to the enterprise’s data through the DDBMS would be possible.

For a number of reasons the DDBMS approach of the late 1980s faltered. Computing technology of the day wasn’t robust or powerful enough to handle the required levels of cross-referencing, filtering, and other data management operations across vast networks. Consequently, the state of the art in distributed transaction management to allow relational database COMMIT and ROLLBACK operations across multiple physical databases – and in particular, multiple databases under the control of heterogeneous DBMSs – became the undoing of the DDBMS movement. Other reasons also came into play that are beyond the scope of our discussion here; but the key takeaway is that as the 1980s gave way to the 1990s, organizations were still left with an enterprise data fragmentation problem that was becoming worse by the year.

1.1.3 1990s

Throughout the 1980s and even back into the 1970s, many organizations built extract files that pulled select data from an operational system and loaded the data into a separate file system or database to produce reports. The primary reason for creating duplicate data was to avoid adversely impacting the operational systems, which were usually finely tuned to achieve the best possible performance with the technology of the day. With this in mind, two new approaches sprouted in the early 1990s:

Data warehousing

Read-only distributed data access

1.1.3.1 Data Warehousing

The more popular and long-lasting of the two by far was the data warehouse, which essentially was taking the extract file approach of the 1970s and 1980s and adding a great deal more rigor and discipline to how organizations copied data from source systems into a separate “reporting database.”

Whereas most reporting databases pulled data from only one or two source applications to produce a precisely targeted set of reports, the data warehousing concept was originally envisioned by most early proponents to be enterprise wide in scale. Figure 1.5 depicts the typical enterprise data warehouse [EDW] attempt of the early 1990s, with the vast majority of any given organization’s key applications feeding data into the EDW…which would then be the primary source for reporting and other data access needs for the majority of users and needs across the enterprise.

Fig. 1.5. The EDW vision.

Even though an EDW appears to be a straightforward proposition, project cost and schedule overruns, as well as outright failures, in the early and mid-1990s were fairly common. EDWs failed for a number of reasons, and not all of those reasons had to do with database technology or underperforming/overpromising first-generation business intelligence [BI] tools. EDW initiatives ran into problems in the 1990s for many of the same reasons they run into problems today: scoping problems, master data management [MDM] discrepancies, data governance conflicts, and the other issues addressed in this book.

While EDWs only made slight headway in addressing the overall problem of EDM, the discipline did establish enough of a beachhead and gained enough inertia that we still have enterprise data warehousing as a key weapon in today’s and tomorrow’s efforts to once and for all make headway in addressing our enterprise data challenges.

1.1.3.2 Read-Only Distributed Data Access

Even as data warehousing and its companion discipline BI gained in popularity throughout the early and mid-1990s, some technologists rebelled against the concept of copying data into a separate database where reports and analytics would then be run. To their way of thinking, storage was still a somewhat precious commodity, and duplicating data was a costly prospect. Further, each extraction, transformation, and loading [ETL] job to copy data from one or more source systems into a data warehouse was ripe for introducing errors and anomalies in the data. [Never mind that the quality of the original-form data housed in many applications was itself highly suspect.]

Taking a fresh look at the failure of DDBMSs of the 1980s and with regards to our earlier discussion, the belief emerged that DDBMSs had failed primarily because they were built to be read-write environments rather than read-only. By removing “write” operations from the DDBMS picture, the thinking went, the synthesized data model sitting on top of multiple underlying databases would therefore be able to help address the data fragmentation problem, at least for reporting and data access.

Many of the DDBMS vendors repurposed their platforms into read-only environments as alternatives to the copying-based approach of data warehousing. DEC, for example, attempted to repurpose its RdbStar DDBMS into a new environment called the Information Network. A Computerworld article in September, 19922 noted that:

DEC officials also spoke about the remnants of the earlier RdbStar distributed database technology, now referred to as the Information Network. They hope to release a version of the product by early 1993 that will act as a manager of heterogeneous RDBMSs so that users will be able to access and manage data located across a range of databases.

Other vendors such as Information Builders with their Enterprise Data Access [EDA]/SQL product joined in the approach to “virtual data warehousing” as an alternative to what we might term “physical data warehousing,” as discussed in the previous section. The virtual data warehousing approach didn’t gain much traction as the 1990s progressed, but has remained a niche player over the years. In the mid-2000s, enterprise information integration [EII] capabilities were offered by some vendors, and the basic concept has evolved into today’s data virtualization capabilities offered by a number of vendors.

Essentially, read-only distributed data access and its generational successors attempted to address a large part of the EDM fragmentation problem by overlaying many different underlying databases and their respective DBMSs with a unified, understandable, and well-governed layer that supports the mapping into whatever physical topology quagmire exists underneath.

Even as organizations tried to gain a foothold with their EDM problems, even more challenges resulted [albeit inadvertently] from the Y2K problem. Companies and governmental agencies had a choice between two different approaches to addressing Y2K as the clock ticked down:

1.

Remediate [patch and fix] existing custom and packaged software to correct any two-digit date issues in the code; or

2.

Replace outdated legacy software with modern, Y2K-compliant software packages…typically ERP software from vendors such as SAP, Oracle, PeopleSoft, and others.

Given the urgency of the Y2K problem, many organizations who chose option #2 – replacing legacy applications with well-architected ERP software – were so focused on beating the Y2K clock that they didn’t have the time, personnel, or financial resources to take advantage of the rare opportunity to address their EDM challenges at the same time. These organizations were also trying to come to grips with the first generation of eCommerce as well as new CRM systems and SCM applications, and with all that was going on in most organizations it isn’t surprising that data architecture and governance took a back seat to getting systems installed.

Most organizations had every intention of addressing EDM – as well as integrating their new enterprise systems, and a host of other on-the-books initiatives – after Y2K came and went.

1.1.4 2000s

Between early 2000 and late 2002, the global economy was subjected to:

The dot-com meltdown;

The aftermath of the 9/11 terrorist attacks, including deep budget cuts in many companies and governmental agencies as the economy continued to slow;

The fallout from the accounting and business scandals of the early 2000s [Enron, Tyco, WorldCom, and others] that did further damage to the overall economy and business budgets.

For close to 3 years, many organizations retrenched into “maintenance mode” in which they focused their efforts largely on break-fix support work, with significant cutbacks in enhancing and integrating their new systems…not to mention putting many initiatives that fell under the EDM umbrella on the back burner. Data warehousing-type projects continued to get scaled back to more modest data marts that often successfully addressed specific reporting and analysis needs…but also increased the data fragmentation and reporting silo problem. Mantras from the 1990s and the dawn of the BI/data warehousing modern era such as “seeking a single version of the truth” were as distant a dream as ever for most organizations in the early 2000s.

By early 2004, most economies around the world had recovered sufficiently that technology spending increased and organizations once again began to take a critical look at their EDM problems. Some organizations made significant progress over the next couple of years, while others were far less successful. But regardless of the gains any given organization did or didn’t achieve in the 2004–2008 timeframe, the Great Recession that began in late 2008 had an even more severe impact on technology investment for most companies than the recession at the beginning of the decade. Even though conventional wisdom holds that the actual recession was over by mid-2009, the severity of the downturn resulted in overall suppressed technology investments for several more years.

And all the while, organizations continued to struggle through many of the same EDM challenges that they’ve faced for decades.

1.1.5 Today

For most businesses and governmental agencies, the Great Recession is behind us. Technology investment is on the upswing, and has been for several years. The Big Data Era is upon us, with an entirely new portfolio of high-capacity, high-velocity technology available for a new generation of data management. More importantly, we have a quarter century’s worth of best practices, success stories, and lessons learned to draw from.

Further, many organizations are finally coming to grips with the realization that failure to get their EDM house in order is a recipe for even greater chaos than they may have experienced in the past. Data volumes are exploding, and even if organizations can apply data warehousing appliances and Big Data technologies and architecture to deal with the data volumes, meaningful progress will be hard to come by without an accompanying well-formulated EDM roadmap.

Chapter 5 contains further discussion about today’s – and tomorrow’s – data management architecture; we will look at the concept of the Big Data-driven “data lake” and various architectural options for how “data lakes” either coexist or supplant traditional data warehousing. Stay tuned.

Read full chapter

URL: //www.sciencedirect.com/science/article/pii/B9780128015391000010

Middleware for the Internet of Things: A survey on requirements, enabling technologies, and solutions

Jingbin Zhang, ... Xiao-dong Sun, in Journal of Systems Architecture, 2021

2.1.2 Things-oriented characteristics

Heterogeneity: The heterogeneity of IoT mainly includes three aspects: device heterogeneity, data heterogeneity, and communication heterogeneity. In IoT, various types of connected devices sense and store data in different formats and support different communication technologies [e.g., LoRA [29], ZigBee, Bluetooth Low Energy, 4G, 5G]. There is an urgent need but it is still a significant challenge to rationally handle the interoperation and data exchanging between heterogeneous devices.

Resource-constrained: Embedded devices are limited in their memory, communication, and processing power, and cannot handle complex computation. Since these devices do not have enough power to run traditional operate systems, lightweight operate systems are designed, such as Contiki [30], LiteOS [31], and TinyOS [32].

Read full article

URL: //www.sciencedirect.com/science/article/pii/S1383762121000795

What is meant by heterogeneous data?

Heterogeneous data are any data with high variability of data types and formats. They are possibly ambiguous and low quality due to missing values, high data redundancy, and untruthfulness. It is difficult to integrate heterogeneous data to meet the business information demands.

What is heterogeneous in DBMS?

A heterogeneous database system is an automated [or semi-automated] system for the integration of heterogeneous, disparate database management systems to present a user with a single, unified query interface.

What are heterogeneous sources?

1. A class of traffic consisting of a number of flows of the same traffic category [e.g., video or voice] but different QoS parameters [e.g., inter-arrival time, packet length distribution, etc.].

What is heterogeneity in big data?

From another perspective, it is defined as “any data with high variability of data types and formats” [3]. So, from the previous definitions, big data heterogeneity can be defined as “any massive data gathered from diverse data sources with high variety of data types and formats.”

Bài Viết Liên Quan

Chủ Đề