Which of the following is a correct statement regarding in-memory analytics?
Visualizing Airline Delays with SparkIn Chapter 3, we explored using Hadoop Streaming and MapReduce to compute the average flight delay per airport using the US Department of Transportation’s on-time flight dataset. This kind of computation—parsing a CSV file and performing an aggregate computation—is an extremely common use case of Hadoop, particularly as CSV data is easily exported from relational databases. This dataset, which records all US domestic flight departure and arrival times along with their delays, is also interesting because while a single month is easily computed upon, the entire dataset would benefit from distributed computation due to its size. Show In this example, we’ll use Spark to perform an aggregation of this dataset, in particular determining which airlines were the most delayed in April 2014. We will specifically look at the slightly more advanced (and Pythonic) techniques we can use due to the increased flexibility of the Spark Python API. Moreover, we will show how central the driver program is to the computation by pulling the results back and displaying a visualization on the driver machine using In order to get a feel for how Spark applications are structured, and to see the template described in the previous section in action, we will first inspect a 10,000-foot view of the complete structure of the program with the details snipped out:
This snippet of code, while long, provides a good overview of the structure of an actual Spark program. The imports show the usual use of a mixture of standard library tools as well as a third-party library, 0. Note As with Hadoop Streaming, any third-party Python dependencies that are not part of the Python standard library must be pre-installed on each node in the cluster. However, unlike Hadoop Streaming, the fact that there are two contexts, the driver context and the executor context, means that some heavyweight libraries (particularly visualization libraries) can be installed only on the driver machine, so long as they are not used in a closure passed to a Spark operation that will execute on the cluster. To prevent errors, wrap imports in a 2s. The application then defines some data that is configurable, including the date and time format for parsing datetime strings and the application name. A specialized 3 data structure is also created in order to create lightweight and accessible parsed rows from the input data. This information should be available to all executors, but is lightweight enough to not require a broadcast variable. Next, the processing functions, 4, 5, and 6 are defined, as well as a 7 function that uses the 8 to define the actions and transformations on the airline dataset. Finally, the 9 configures Spark and executes the 7 function. With this high-level overview complete, let’s dive deeper into the specifics of the code, starting with the 7 method that defines the primary Spark operations and the analytical methodology:
Our first job is to load our two data sources from disk: first, a lookup table of airline codes to airline names, and second, the flight instances dataset. The dataset airlines.csv is a small jump table that allows us to join airline codes with the full airline name; however, because this dataset is small enough, we don’t have to perform a distributed join of two RDDs. Instead we store this information as a Python dictionary and broadcast it to every node in the cluster using 2, which transforms the local Python dictionary into a broadcast variable. The creation of this broadcast variable and execution is as follows. First, an RDD is created from the text file on the local disk called airlines.csv (note the relative path). Creation of the RDD is required because this data could be coming from a Hadoop data source, which would be specified with a URI to the location (e.g., hdfs:// for HDFS data or s3:// for S3, etc.). Note if this file was simply on the local machine, then loading it into an RDD is not necessary. The 5 function is then mapped to every element in the dataset, as discussed momentarily. Finally, the 4 action is applied to the RDD, which brings the data back from the cluster to the driver as a Python list. Because the 4 action was applied, when this line of code executes, a job is sent to the cluster to load the RDD, split it, then return the context to the driver program:
The 5 function parses each line of text using the 7 module by creating a file-like object with the line of text using 8, which is then passed into the 9. Because there is only a single line of text, we can simply return 0. While this method of CSV parsing may seem pretty heavyweight, it allows us to more easily deal with delimiters, escaping, and other nuances of CSV processing. For larger datasets, a similar methodology is applied to entire files using 1 to process many CSV files that are split into blocks of 128 MB each (e.g., the block size and replication on HDFS):
Next, the 7 function loads the much larger flights.csv, which needs to be computed upon in a parallel fashion using an RDD. After splitting the CSV rows, we map the 4 function to the CSV row, which converts dates and times to Python dates and times, and casts floating-point numbers appropriately. The output of this function is a 3 called 5 that was defined in the module constants section of the application. Named tuples are lightweight data structures that contain record information such that data can be accessed by name—for example, 6 rather than position (e.g., 7). Like normal Python tuples, they are immutable, so they are safe to use in processing applications because the data can’t be modified. Additionally, they are much more memory and processing efficient than dictionaries, and as a result, provide a noticeable benefit in big data applications like Spark where memory is at a premium. With an RDD of 5 objects in hand, the final transformation is to map an anonymous function that transforms the RDD to a series of key/value pairs where the key is the name of the airline and the value is the sum of the arrival and departure delays. At this point, besides the creation of the airlines dictionary, no execution has been performed on the collection. However, once we begin to sum the per airline delays using the 9 action and the 0 operator, the job is executed across the cluster, then collected back to the driver program. At this point, the cluster computation is complete, and we proceed in a sequential fashion on the driver program. The delays are sorted by delay magnitude in the memory of the client program. Note that this is possible for the same reason that we created the airlines lookup table as a broadcast variable: the number of airlines is small and it is more efficient to sort in memory. However, if this RDD was extremely large, a distributed sort using 1 could be used. Finally, instead of writing the results to disk, the output is printed to the console. If this dataset were big, the 2 action might be used to take the first 3 items rather than printing the entire dataset, or by using 4 to write the data back to our local disk or to HDFS. Finally, because we have the data available in the driver, we can visualize the results using
Hopefully this example illustrates the interplay of the cluster and the driver program (sending out for analytics, then bringing results back to the driver), as well as the role of Python code in a Spark application. To run this code (presuming that you have a directory called ontime with the two CSV files in the same directory), use the 6 command as follows: hostname $ Because we hardcoded the master as 7 in the configuration under the 9, this command creates a Spark job with as many processes as are available on the localhost. It will then begin executing the transformations and actions specified in the 7 function with the local 8. First it loads the jump table as an RDD, collect, and broadcast it to all processes, then it loads the flight data RDD and processes it to compute the average delays in a parallel fashion. Once the context and the output from the collect is returned to the driver, we can visualize the result using What is inIn-memory analytics is an approach to querying data when it resides in a computer's random access memory (RAM), as opposed to querying data that is stored on physical disks.
How does inWith in-memory analytics, all the data used by an application is stored within the main memory of the computing environment. Rather than accessing the data on a disk, data remains suspended in the memory of a powerful set of computers.
Which of the following best describes a characteristic of big data?What are the Characteristics of Big Data? Three characteristics define Big Data: volume, variety, and velocity. Together, these characteristics define “Big Data”.
Which of the following statements defines data?Answer: raw facts and figures.
|