15 July 2013

SONAS vs. Isilon Part 7: Access Protocols and HDFS Support

One of the most important aspects in choosing a scale-out NAS system is the list of supported protocols. In this post I will focus on the access protocols and give some notes on the Hadoop Distributed Filesystem (HDFS) support in Isilon OneFS. The supported authentication protocols were covered in this other blog post and the management protocols may be covered in a future post.

At this point in time, the current Isilon operating system version is OneFS 7.1 and IBM’s SONAS version is  1.4.2.


IPv6 Support

Although it is not directly related to the access protocols, it is worth mentioning that SONAS does not appear to support IPv6 as it is not mentioned in the IBM SONAS documentation nor is it listed on the IBM IPv6 compliance product list. Considering that many enterprise customers are planning or starting their IPv6 rollout, it is another indication that IBM is not seriously innovating with SONAS and that it is not a strategic asset in their storage portfolio.
Isilon has supported IPv6 for many years now. I was able to find IPv6 support going back to version 6.5 released in 2011. The current version 7.1 builds upon this IPv6 maturity.  
The following table lists the supported file access protocols, as well as RESTful access options and HDFS support for Hadoop and compatible access methods.


Protocols
Isilon
SONAS
SMB1
Yes
Yes
SMB2
Yes
Partly2)
SMB 2.1
Yes
No
SMB3
planned
No
NFS3
Yes
Yes 3)
NFS4
Yes
No
HTTP
Yes
Yes
FTP
Yes
Yes
RESTful API
Yes
No
HDFS V 1.0
Yes
No
HDFS V 2.0
Yes
No
HDFS V 2.1
planned
No
Table 1: Supported Access Protocols in SONAS 1.4.2 and OneFS 7.1


SONAS SMB Limitations

Regarding SMB support, IBM lists several limitations and considerations in this official IBM document:
  • Alternate Data Streams are not supported. Alternate Data Streams have been introduced first in Windows NT for compatibility reasons with Apple’s Hierarchical File Systems (HFS) (where data is stored in two parts, the data fork and resource fork). But Alternate Data Streams are used for other purposes as well. For example, Meta data stored for Office documents (that you can access and modify via the File Properties menue) are stored in ADS.  Since SONAS doesn’t support ADS, these information is not accessible to clients and cannot be used for indexing.·         
  • SMB 2.1 is not supported at all·        
  • Level 2 Oplocks are not supported. That means client requests for L2 oplocks are not granted. That has impact on the clients ability to cache data locally which increases network traffic.        
Beside the fact that SONAS does not currently support SMB 2.1, IBM also lists significant technical limitations in the documentation. Some of these are precautionary considerations but others are real product limitations.

SONAS NFS Limitations

SONAS limitations for NFS are documented in this IBM article. Some of the comments are just considerations (i.e one should not mount the same data via different paths or exports to the same client. The potential data corruption is a result of how NFSv3 handles data, it is not a SONAS issue). However, the following limits are relevant and do not exist in Isilon OneFS:
  • NFS version 4.0 is not supported
  • Clients should mount IBM SONAS NFS exports using IP addressing only. Do not mount an IBM SONAS NFS export using a DNS RR entry name. If you mount an IBM SONAS NFS export using a host name, ensure that the name is unique and remains unique. This restriction prevents data corruption and data unavailability, as the lock manager on the IBM SONAS system is not "clustered-system-aware". That means they call it a Scale Out Cluster without clustered file-system awareness. Well done….
  • Files created on an NFSv3 mount on a Linux client are visible only through CIFS clients mounted on the same server node. CIFS clients mounted on different server nodes cannot view these files.

The last two points mean that you cannot adequately use a load balancer or DNS round robin to distribute the SMB and NFS mounts equally across the interface nodes. This static mapping seems very inflexible and requires administration overhead! Isilon shines with SmartConnect for this purpose. SmartConnect is an intelligent Domain Name Server that responds to client queries with an IP number from a relevant pool of IP addresses that can balance client connections based upon CPU load, interface throughput or connection count or simply provide round-robin load balancing. So no need to individually take care of SMB or NFS clients, OneFS is fully cluster-aware.

RESTful access to namespace

In today’s world, mobile devices are used on daily base to access data. Typically these devices access data via HTTP rather than NFS or CIFS. The same is true for many applications. Therefore, a REST API has been introduced to Isilon that is called RESTful Access to Namespace (RAN) [1]. RAN enables applications to create, delete and modify data through the API via HTTP/1.1 queries. Over time you may also see other RESTful API functionality integrated into OneFS.  If you would like to use existing REST APIs like Swift or S3 you can do it already by using VIPR.

Hadoop Distributed Filesystem (HDFS) Support

Data Analytics is a very hot topic these days as companies massively start to explore the value in their data. The classic Business Intelligence (BI) workload has done this for decades but was traditionally focused on structured data that was stored in large, monolithic databases.

However, since the introduction of MapReduce algorithms in BI working on unstructured data (i.e. files/objects) that reside in a Hadoop Distributed File System (HDFS), a whole new set of analytics opportunities have appeared. In the meantime, a number of Hadoop distributions have been established in the market like Apache, Cloudera, Pivotal HD, Hortonworks and others. All of them have in common that they can analyse massive amount of data that resides in a HDFS file system.  All Hadoop cluster nodes typically have a compute component and storage component (providing the HDFS layer). The storage is typically implemented with internal disks attached to compute nodes. Most Hadoop projects start small, so this is the most cost effective solution (in terms of CAPEX, but not operational cost). Figure 1 illustrated the components (compute, storage, IO-path) of a traditional Hadoop cluster.



Figure 1: Traditional Hadoop Cluster where all nodes contain the compute and storage components (typically DAS). Data must be copied into and result out of the HDFS cluster to access them with POSIX clients like NFS/SMB/FTP .

HDFS is optimized for this purpose but it has some drawbacks:

  • Data protection by default has each block is stored at least three times
  • Many distributions have a single point of failure with a lone primary name node
  • Missing enterprise storage features like remote replication, snapshots, backup APIs
  • HDFS is not POSIX compliant. Existing applications cannot access the data without special ‘gateway’ tools
  • Existing data that resides in a traditional filesystem must be imported into the HDFS namespace. This is a challenge of time, bandwidth, computation, and concurrent capacity. Imagine you need to copy 200TB over a 10 gigabit link into a HDFS namespace. Even if the link is dedicated to the task, this copy process would take more than 48 hours.

Isilon has HDFS integrated as a protocol

EMC has engineered HDFS as a built-in protocol in Isilon OneFS. That means that Isilon understands and talks HDFS with the compute nodes but stores the data internally in OneFS with POSIX semantics. This means the Hadoop cluster is now split into two parts: compute and storage. The following figure illustrates this.


Figure2: Hadoop Cluster with Isilon as a HDFS Storage Backend. No requirement to copy data into the HDFS cluster. Data can be accessed from HDFS compute nodes as well as traditional SMB/NFS/FTP clients.

This has some significant advantages:

  • You do not need to dedicate silos of compute and storage for Hadoop analytics.  You can analyse the data directly where it can be accessed through other protocols by other application sets,  you can perform snapshots and replicate that data elsewhere. This shared platform avoids the time consuming process of copying data in and out of a siloed HDFS filesystem.
  • Isilon stores the data much more effective than a native HDFS (see my other blog post for data protection). Roughly we could say that we achieve 80% usable to raw disk efficiency rather than 30% with the native HDFS approach. This efficiency cannot be matched by IBM, NetApp, or any other major storage vendor.
  • You can utilize your compute farm flexibly with different Hadoop distributions and HDFS versions with the same data sets. This is very nice for migrations and (as far as I know) not possible with the native implementations.
  • You can use enterprise storage features that you get with Isilon such as file system snapshots, replication, enhanced kerberized security, access zones, and more.
  • You can scale you compute farm independently from storage. If you need more compute capability, add more multicore physical or virtual servers, if you need more storage for your data sets , add Isilon storage nodes.  All HDFS data on Isilon can be accessed with traditional existing tools that follow POSIX semantics.

The performance figures I have seen so far are quite similar on Isilon vs. the direct-attached non-virtualized compute+storage model. Some workloads are slower, but many are faster. However, this is only about the compute time to result. As mentioned before, you save the majority of the time by not being forced to copy data between a HDFS and a POSIX filesystem.

Conclusion

If you consider using SONAS, you need to carefully check your use cases and environment since as many basic functions and protocols are not supported such as IPv6, SMB 2.1, NFSv4. Even for SMB 2 and NFSv3 there are several restrictions that can cause problems if you want to implement it in a heterogenious environment. Isilon has a much greater set of supported protocols and all of them are supported in a cluster-aware manner. Furthermore, Isilon directly supports the Hadoop Distributed  Filesystem with key improvements over classic architectures.

Acknowledgements

Thanks to my colleagues Ryan Sayre for reviewing this text and some great ideas for improvements.

3 comments: