Benefits of Isilon and Issues faced

It’s been a while since we started using Isilon in full-fledged manner and manage it completely. During this time we have found that Isilon is many ways better than traditional NAS provided by EMC.

Initially as always we were hesitant to accept it as we have to learn new thing from scratch but as we start going deep inside it we started liking it. Below are the some of the points that shows how is Isilon better than old NAS.

1: Scale out concept: Isilon is based on Scale-out NAS concept. As and when required, we can keep adding nodes (min 3 and max 144) and grow your storage capacity. There’s no dependency on back-end storage, say for example a Clarion, which can eventually become full one day or the other and then you have no other option than to replace it with a new unit all together.

2: Single FS: Single FS is another big benefit of Isilon. Across all the nodes (entire cluster), one File System can be created irrespective of host OS types connected, generation of Isilons (NL400, NL410, EX100) or even if we have different kinds of drives in it (SSD, SATA, FC etc). Unlike VNX, there’s no limitation to maximum size of shares/ exports which can be created on the Isilon.(16TB in case of VNX).

3: No raid configured, data striped across nodes, quick recovery from failure: All the data stored on the Isilon is stripped across all the nodes which are part of the cluster.
This comes helpful in case of drive or node failures situation where data rebuild on the new drive/ node takes very less time as compared to on any other back-end storage for a VNX or Celerra.

4: High granularity in replication: On VNX, replications can be configured only at the File System level but in Isilon replication can be done at folders and sub-folders level as well within a File system. For data protection, Isilon has its own schemes known as N+1 or N+2 or N+1+1 which are quite similar to Raid 5 and 6. Also we can make up to 8 copies of data within the cluster for redundancy and availability.

5: High speed: EMC claimed that they have the highest FS NAS throughput for any vendor with Isilon at 106 GB/sec.

6: Performance: High level of performance is something best explanation for Isilon. This is because of its multi node configuration where with each node increase we do get increase in the processing capabilities by increase in processor number, port count and Cache memory. These hardware in particular, in case of a VNX, remains constant for a particular model. Also with active – active configuration of the isilon data can be accessed via any node irrespective of where its saved and hence when read/ write is happening from host, Isilon will give the best output compared to VNX.

7: Easy management: Administration and management of the isilon is very easy with very less involvement required from engineer once everything is configured. Even in case of hardware failure adding or removing any hardware specially nodes to the cluster, no outage is required at all. Node Addition/ removal or a code upgrade (on nodes or the drives), can be done on the go. Even when a node fails, because of the “SmartFail” feature in Isilon, data is rebuilt on to the other online nodes in no time. User would never get to know that anything of this sort has happened in the background. Isilon assures 99.999 % availability. With its unique protection schemes, it can even handle multiple node and drive failures at the same time.


Issues Faced

After deployment of Isilon in production we started getting one or other issue day by day and it took us approx 5 to 8 month to stabilize the setup completely. With all these issues
we have learnt a lot that can help other when they are working on the isilon setup. Below are the list of issues we faced and its resolution.

1: High number of snapshots creating space crunch: We have configured daily and monthly snapshots at the same time replication snapshots are also there. All these snapshots and there retention has resulted in high utilization of space which caused space crunch. We had to remove some folders from replication and snapshots which are having high amount of data changes. We did implemented deduplication on the isilon to bring down space utilization. These steps has helped us to control the space utilization and bring down to desired level.

2: Snapshots number limitation 20000: As we have implemented hourly snapshots and daily snapshots in our setup, we have reached to limit of snapshots numbers set on the isilon. When snapshots started failing to create we found that there is a hard limit of 20000 set on isilon. This snapshots number includes regular and replication snapshots.

3: Snapshots failed to create with status – “schedule collide with other schedules”: It was found that the pattern section in the schedule was set same for all schedules which caused collision of schedules. There are three main parts of a snapshot schedule are- Name, Pattern and Path. They need to be unique for every snapshot schedule otherwise schedule collision error would come. They all had to be edited based on the naming convention set on them.

4. Capacity reporting on InsightIQ not enabled: Capacity reporting was supported only for OneFS version above 7.2. OneFS upgrade done on the isilon to make it compatible with InsightIQ for capacity reporting.

5: All Reporting’s halted on InsightIQ: InsightIQ is very important tool and it needs proper planning during implementation. In our setup we used smaller LUN to configure which resulted in filling the data store and stopping all kind of report generation. We have migrated InsightIQ data store to bigger size LUN to fix this problem.

6: FSA connection for InsightIQ: No FSA connection was configured between InsightIQ and all Isilons, FSA connection between InsightIQ and the Isilon Clusters had to be re-established to get the report generation working from all the Isilons. By default, a cluster’s (Isilon Cluster) database is stored in a share on the cluster (path- /ifs/.ifsvar/modules/fsa), from where reports are generated by InsightIQ. For this FSA (File System Analytics) has to be enabled within IIQ (one per monitored cluster), to generate the reports. Enabling FSA is nothing but establishing a connection between this share and InsightIQ server or we can say mounting this share on InsightIQ server via NFS.
*also to this share or path, no other host will have access to.

7: Isilon OneFS and InsightIQ not at compatible versions: OneFS and InsightIQ code was not at compatible version so InsightIQ code upgrade done. Below are the compatible code version for OneFS and InsightIQ.

OneFS code version:

InsightIQ code version: v3.2.2.0007

8: Message AD server missing needed SPN(s): A service principal name (SPN) is the name by which a client uniquely identifies an instance of a service. The SPN is unique, even for multiple service instances on computers within an Active Directory environment. SPN(s) was recreated for the missing hosts to fix this issue. An SPN (Service Principal Name) is a concept from Kerberos. SPN is the name by which a Kerberos client uniquely identifies an instance of a service for a given Kerberos target computer. If you install multiple instances of a service on computers throughout a forest, each instance must have its own SPN. A given service instance can have multiple SPNs if there are multiple names that clients might use for authentication. For example, an SPN always includes the name of the host computer on which the service instance is running, so a service instance might register an SPN for each name or alias of its host.

9: 4 nodes in the Isilon was over utilized which caused risk on the node to fail: Auto-balance feature enabled on the isilon to overcome the node utilization issue but it did not work as expected. Later with help of vendor we found the best option to configure auto-balance with multi-scan option. Multi-Scan job is recommended as it includes Collect and Auto-Balance together. This balanced all the nodes and brought all of them at same level.

10: Time sync issue between nodes: The current time differs from the Windows Active Directory server by over 5 minutes. Authentication services may be affected. NTP was not enabled on the isilon to do time syncing. We have enabled NTP service with using DC for the time reference. This has fixed the problem with set to auto time sync.

11: Replication issue: SyncIQ encountered problems with multiple policies during replication: Replication was not working as SyncIQ encountered error with not finding the reference snapshot for comparison. SyncIQ schedules had to be reset to recreate the first replication snapshot for reference and comparison.

12: ESRS version not compatible: ESRS configuration for isilon is only possible when the ESRS version is above 2.28.

13: Schedule adjustments: It was found that whenever we made any modification or adjustment in schedules for snapshots, SyncIQ or deduplication it did not work as expected after modification. Our conclusion to that was, modification do not take effect on the schedules. Then we deleted the schedules and recreated to make is working as per the requirement.

After all these modifications our setup has become quite stable but we are still on our toes for any new challenge in Isilon. We have got great understanding due to all the challenges we have encountered. I hope other will get benefited with our experiences.

Note: I would like to thanks my friend Kumar Pallav for helping me out with this article with his time and views.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s