NFSv4 offers a lot of improvements over NFSv3, but it also works differently. If you're a current NFSv3 user and considering a switch, there's a few things you need to know, ESPECIALLY if you're using Oracle databases.
... View more
See TR-4570 NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results for Customer Challenges, M ajor AI/ML/DL Use Cases and Architectures, Hardware/software environment & versions, and other details.
For this validation, we used four worker nodes and one master nodes with an AFF-A800 HA pair. All cluster members are connected through 10GbE network switches.
For other NetApp Spark solution validation, we used three different storage controllers: the E5760, the E5724, and the AFF-A800. The E-Series storage controllers are connected to five data nodes with 12Gbps SAS connections. The AFF HA-pair storage controller provides exported NFS volumes through 10GbE connections to Hadoop worker nodes. The Hadoop cluster members are connected through 10GbE connections in the E-Series, AFF, and StorageGRID Hadoop solutions.
We used the TeraSort and TeraValidate scripts in the TeraGen benchmarking tool to measure the Spark performance validation with E5760, E5724, and AFF-A800 configurations. In addition, three major use cases were tested: Spark NLP pipelines and TensorFlow distributed training, Horovod distributed training, and multi-worker deep learning using Keras for CTR Prediction with DeepFM.
For both E-Series and StorageGRID validation, we used Hadoop replication factor 2. For AFF validation, we only used one source of data.
Table 1 presents the hardware configuration for the Spark performance validation, while Table 2 shows the software requirement.
Type
Hadoop Worker Nodes
Drive Type
Drives per Node
Storage Controller
SG5712
4
SAS
12
Single high-availability (HA) pair
E5760
4
SAS
60
Single HA pair
E5724
4
SAS
24
Single HA pair
AFF800
4
SSD
6
Single HA pair
Software
Version
RHEL
7.9
OpenJDK Runtime Environment
1.8.0
OpenJDK 64-Bit Server VM
25.302
Git
2.24.1
GCC/G++
11.2.1
Spark
3.2.1
PySpark
3.1.2
SparkNLP
3.4.2
TensorFlow
2.9.0
Keras
2.9.0
Horovod
0.24.3
We have published TR-4910: Sentiment Analysis from Customer Communications with NetApp AI, in which an end-to-end conversational AI pipeline was built using the NetApp DataOps Toolkit, AFF storage, and NVIDIA DGX System. The pipeline performs batch audio signal processing, automatic speech recognition (ASR), transfer learning, and sentiment analysis leveraging the DataOps Toolkit, NVIDIA Riva SDK and Tao framework. Expanding the sentiment analysis use case to financial services industry, we built a SparkNLP workflow, loaded three BERT models for various NLP tasks such as named entity recognition, and obtained sentence-level sentiment for NASDAQ Top 10 companies’ quarterly earnings calls.
The following script sentiment_analysis_spark.py uses FinBERT model to process transcripts in HDFS and produce positive, neutral, and negative sentiment counts, as shown in the table below:
-bash-4.2$ time ~/anaconda3/bin/spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3 --master yarn --executor-memory 5g --executor-cores 1 --num-executors 160 --conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M" --conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M" /sparkusecase/tr-4570-nlp/sentiment_analysis_spark.py hdfs:///data1/Transcripts/ > ./sentiment_analysis_hdfs.log 2>&1 real 13m14.300s user 557m11.319s sys 4m47.676s
Table 3 ) Earnings call sentence-level sentiment analysis for NASDAQ Top 10 companies from 2016 to 2020.
Sentiment counts
All 10 Companies
AAPL
AMD
AMZN
CSCO
GOOGL
INTC
MSFT
NVDA
Positive counts
7447
1567
743
290
682
826
824
904
417
Neutral counts
64067
6856
7596
5086
6650
5914
6099
5715
6189
Negative counts
1787
253
213
84
189
97
282
202
89
Uncategorizedcounts
196
0
0
76
0
0
0
1
0
(total counts)
73497
8676
8552
5536
7521
6837
7205
6822
6695
In terms of percentages, most sentences spoken by the CEOs and CFOs are factual and therefore carry neutral sentiment. During an earnings call, analysts ask questions which might convey positive or negative sentiment. It is worth further investigating quantitatively how negative or positive sentiment affect stock prices on the same or next day of trading.
Table 3 ) Sentence-level sentiment analysis for NASDAQ Top 10 companies, expressed in percentage
Sentiment percentage
All 10 Companies
AAPL
AMD
AMZN
CSCO
GOOGL
INTC
MSFT
NVDA
Positive
10.13%
18.06%
8.69%
5.24%
9.07%
12.08%
11.44%
13.25%
6.23%
Neutral
87.17%
79.02%
88.82%
91.87%
88.42%
86.50%
84.65%
83.77%
92.44%
Negative
2.43%
2.92%
2.49%
1.52%
2.51%
1.42%
3.91%
2.96%
1.33%
Uncategorized
0.27%
0%
0%
1.37%
0%
0%
0%
0.01%
0%
In terms of the workflow runtime, we see a significant 4.78x improvement from local mode to distributed environment in HDFS, and a further 0.14% improvement by leveraging NFS:
-bash-4.2$ time ~/anaconda3/bin/spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3 --master yarn --executor-memory 5g --executor-cores 1 --num-executors 160 --conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M" --conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M" /sparkusecase/tr-4570-nlp/sentiment_analysis_spark.py file:///sparkdemo/sparknlp/Transcripts/ > ./sentiment_analysis_nfs.log 2>&1 real 13m13.149s user 537m50.148s sys 4m46.173s
As the figure below shows, data and model parallelism improve the data processing and distributed TensorFlow model inferencing speed. Data location in NFS yields slightly better runtime due to the bottleneck of the workflow being in downloading pretrained models. If we increase the transcripts dataset size, the advantage of NFS will be more obvious.
Figure) Earnings call sentiment analysis runtime comparison
For other use cases and the complete Python scripts tested, please refer to TR-4570.
See Hybrid Cloud Solution with Apache Spark and NetApp AI for hybrid cloud applications in distributed Spark clusters with the NetApp ® storage portfolio.
If you have questions, feel free to contact @rickhuang or AI Solutions Team (ng-ai-inquiry@netapp.com). 😊
... View more
NetApp® StorageGRID® S3 Object Lock functionality has now been validated for Veritas NetBackup 10.1.1 and later versions.
Object storage protection
With S3 Object Lock protection, you can create write once, read many (WORM) buckets that allow object uploads but that restrict overwrites, edits, and deletes. For your organization’s data security and regulatory concerns, S3 Object Lock confirms that objects are immutably retained for a set period or indefinitely, according to policies that you set.
Complete control and massive scalability with StorageGRID
NetApp StorageGRID is an enterprise-grade, on-premises object storage solution that supports the native Amazon Simple Storage Service (Amazon S3) API. StorageGRID is massively scalable, supporting low-touch, nondisruptive expansions, and can store billions of objects. In a single namespace, StorageGRID can scale up to 16 data centers worldwide. StorageGRID information lifecycle management (ILM) policies give you complete and granular control over how long your data is retained, where your data sits, when it is tiered to lower-cost object storage, and more.
NetBackup + S3 Object Lock for extra secure backup
The NetBackup enterprise backup and recovery solution automatically backs up data from VMware, Oracle, Microsoft Hyper-V, Microsoft SQL Server, and others. NetBackup can store backup data and snapshots on StorageGRID. Now, as a StorageGRID and NetBackup customer, you can rely on S3 Object Lock for an extra layer of security.
Other S3 Object Lock validations
To learn about some of the other StorageGRID S3 Object Lock validations, check out this blog post.
... View more
To help us continue to provide superior technology that meets worldwide standards, NetApp ® StorageGRID ® 11.6 has renewed certification with French standard NF Logiciel (NF-203) and with the international standard ISO/IEC 25051:2014.
... View more