Google file System in big data geeksforgeeks

Because one HDFS instance may consist of thousands of servers, failure of at least one server is inevitable. HDFS has been built to detect faults and automatically recover quickly.

Access to streaming data

HDFS is intended more for batch processing versus interactive use, so the emphasis in the design is for high data throughput rates, which accommodate streaming access to data sets.

Accommodation of large data sets

HDFS accommodates applications that have data sets typically gigabytes to terabytes in size. HDFS provides high aggregate data bandwidth and can scale to hundreds of nodes in a single cluster.

Portability

To facilitate adoption, HDFS is designed to be portable across multiple hardware platforms and to be compatible with a variety of underlying operating systems.


An example of HDFS

Consider a file that includes the phone numbers for everyone in the United States; the numbers for people with a last name starting with A might be stored on server 1, B on server 2, and so on.

With Hadoop, pieces of this phonebook would be stored across the cluster, and to reconstruct the entire phonebook, your program would need the blocks from every server in the cluster.

To ensure availability if and when a server fails, HDFS replicates these smaller pieces onto two additional servers by default. (The redundancy can be increased or decreased on a per-file basis or for a whole environment; for example, a development Hadoop cluster typically doesn’t need any data redundancy.) This redundancy offers multiple benefits, the most obvious being higher availability.

The redundancy also allows the Hadoop cluster to break up work into smaller chunks and run those jobs on all the servers in the cluster for better scalability. Finally, you gain the benefit of data locality, which is critical when working with large data sets.

In this article, you will learn about the distributed file system in the operating system and its features, components, advantages, and disadvantages.

What is Distributed File System?

A distributed file system (DFS) is a file system that is distributed on various file servers and locations. It permits programs to access and store isolated data in the same method as in the local files. It also permits the user to access files from any system. It allows network users to share information and files in a regulated and permitted manner. Although, the servers have complete control over the data and provide users access control.

DFS's primary goal is to enable users of physically distributed systems to share resources and information through the Common File System (CFS). It is a file system that runs as a part of the operating systems. Its configuration is a set of workstations and mainframes that a LAN connects. The process of creating a namespace in DFS is transparent to the clients.

DFS has two components in its services, and these are as follows:

  1. Local Transparency
  2. Redundancy

Local Transparency

It is achieved via the namespace component.

Redundancy

It is achieved via a file replication component.

In the case of failure or heavy load, these components work together to increase data availability by allowing data from multiple places to be logically combined under a single folder known as the "DFS root".

It is not required to use both DFS components simultaneously; the namespace component can be used without the file replication component, and the file replication component can be used between servers without the namespace component.

Features

There are various features of the DFS. Some of them are as follows:

Transparency

There are mainly four types of transparency. These are as follows:

1. Structure Transparency

The client does not need to be aware of the number or location of file servers and storage devices. In structure transparency, multiple file servers must be given to adaptability, dependability, and performance.

2. Naming Transparency

There should be no hint of the file's location in the file's name. When the file is transferred form one node to other, the file name should not be changed.

3. Access Transparency

Local and remote files must be accessible in the same method. The file system must automatically locate the accessed file and deliver it to the client.

4. Replication Transparency

When a file is copied across various nodes, the copies files and their locations must be hidden from one node to the next.

Scalability

The distributed system will inevitably increase over time when more machines are added to the network, or two networks are linked together. A good DFS must be designed to scale rapidly as the system's number of nodes and users increases.

Data Integrity

Many users usually share a file system. The file system needs to secure the integrity of data saved in a transferred file. A concurrency control method must correctly synchronize concurrent access requests from several users who are competing for access to the same file. A file system commonly provides users with atomic transactions that are high-level concurrency management systems for data integrity.

High Reliability

The risk of data loss must be limited as much as feasible in an effective DFS. Users must not feel compelled to make backups of their files due to the system's unreliability. Instead, a file system should back up key files so that they may be restored if the originals are lost. As a high-reliability strategy, many file systems use stable storage.

High Availability

A DFS should be able to function in the case of a partial failure, like a node failure, a storage device crash, and a link failure.

Ease of Use

The UI of a file system in multiprogramming must be simple, and the commands in the file must be minimal.

Performance

The average time it takes to persuade a client is used to assess performance. It must perform similarly to a centralized file system.

Distributed File System Replication

Initial versions of DFS used Microsoft's File Replication Service (FRS), enabling basic file replication among servers. FRS detects new or altered files and distributes the most recent versions of the full file to all servers.

Windows Server 2003 R2 developed the "DFS Replication" (DFSR). It helps to enhance FRS by only copying the parts of files that have changed and reducing network traffic with data compression. It also gives users the ability to control network traffic on a configurable schedule using flexible configuration options.

History of Distributed File System

The DFS's server component was firstly introduced as an additional feature. When it was incorporated into Windows NT 4.0 Server, it was called "DFS 4.1". Later, it was declared a standard component of all Windows 2000 Server editions. Windows NT 4.0 and later versions of Windows have client-side support.

Linux kernels 2.6.14 and later include a DFS-compatible SMB client VFS known as "cifs". DFS is available in versions Mac OS X 10.7 (Lion) and later.

Working of Distributed File System

There are two methods of DFS in which they might be implemented, and these are as follows:

  1. Standalone DFS namespace
  2. Domain-based DFS namespace

Standalone DFS namespace

It does not use Active Directory and only permits DFS roots that exist on the local system. A Standalone DFS may only be acquired on the systems that created it. It offers no-fault liberation and may not be linked to other DFS.

Domain-based DFS namespace

It stores the DFS configuration in Active Directory and creating namespace root at domainname>dfsroot> or FQDN>dfsroot>.

DFS namespace

SMB routes of the form are used in traditional file shares that are linked to a single server.

\\\\

Domain-based DFS file share paths are identified by utilizing the domain name for the server's name throughout the form.

\\\\

When users access such a share, either directly or through mapping a disk, their computer connects to one of the accessible servers connected with that share, based on rules defined by the network administrator. For example, the default behavior is for users to access the nearest server to them; however, this can be changed to prefer a certain server.

Applications of Distributed File System

There are several applications of the distributed file system. Some of them are as follows:

Hadoop

Hadoop is a collection of open-source software services. It is a software framework that uses the MapReduce programming style to allow distributed storage and management of large amounts of data. Hadoop is made up of a storage component known as Hadoop Distributed File System (HDFS). It is an operational component based on the MapReduce programming model.

NFS (Network File System)

A client-server architecture enables a computer user to store, update, and view files remotely. It is one of various DFS standards for Network-Attached Storage.

SMB (Server Message Block)

IBM developed an SMB protocol to file sharing. It was developed to permit systems to read and write files to a remote host across a LAN. The remote host's directories may be accessed through SMB and are known as "shares".

NetWare

It is an abandon computer network operating system that is developed by Novell, Inc. The IPX network protocol mainly used combined multitasking to execute many services on a computer system.

CIFS (Common Internet File System)

CIFS is an accent of SMB. The CIFS protocol is a Microsoft-designed implementation of the SIMB protocol.

Advantages and Disadvantages of Distributed File System

There are various advantages and disadvantages of the distributed file system. These are as follows:

Advantages

There are various advantages of the distributed file system. Some of the advantages are as follows:

  1. It allows the users to access and store the data.
  2. It helps to improve the access time, network efficiency, and availability of files.
  3. It provides the transparency of data even if the server of disk files.
  4. It permits the data to be shared remotely.
  5. It helps to enhance the ability to change the amount of data and exchange data.

Disadvantages

There are various disadvantages of the distributed file system. Some of the disadvantages are as follows:

What is Google file system used for?

GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by path- names. We support the usual operations to create, delete, open, close, read, and write files.

How Google file system solves big data processing challenges?

To prevent data corruption, the GFS uses a system called checksumming. The system breaks each 64 MB chunk into blocks of 64 kilobytes (KB). Each block within a chunk has its own 32-bit checksum, which is sort of like a fingerprint. The master server monitors chunks by looking at the checksums.

Why the Google File System uses large chunk?

All metadata of chunks are stored in the memory of master server in order to reduce the latency and increase the throughput. Large chunks means less metadata, and less metadata means less load time of the metadata.

What are the advantages of GFS?

GFS has a relaxed consistency model, which guarantees atomicity, correctness, definedness, fast recovery and no data corruption, and this model can be accommodated by GFS applications. The system is designed to minimize the master's involvement in all operations.