SnapMirror is the core replication technology in NetApp ONTAP storage solutions for protecting your most important data. With the general availability of the latest ONTAP 9.12.1 release, see what new and expanded capabilities SnapMirror provides from on-premises solutions.
... View more
NFSv4 offers a lot of improvements over NFSv3, but it also works differently. If you're a current NFSv3 user and considering a switch, there's a few things you need to know, ESPECIALLY if you're using Oracle databases.
... View more
See TR-4570 NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results for Customer Challenges, M ajor AI/ML/DL Use Cases and Architectures, Hardware/software environment & versions, and other details.
For this validation, we used four worker nodes and one master nodes with an AFF-A800 HA pair. All cluster members are connected through 10GbE network switches.
For other NetApp Spark solution validation, we used three different storage controllers: the E5760, the E5724, and the AFF-A800. The E-Series storage controllers are connected to five data nodes with 12Gbps SAS connections. The AFF HA-pair storage controller provides exported NFS volumes through 10GbE connections to Hadoop worker nodes. The Hadoop cluster members are connected through 10GbE connections in the E-Series, AFF, and StorageGRID Hadoop solutions.
We used the TeraSort and TeraValidate scripts in the TeraGen benchmarking tool to measure the Spark performance validation with E5760, E5724, and AFF-A800 configurations. In addition, three major use cases were tested: Spark NLP pipelines and TensorFlow distributed training, Horovod distributed training, and multi-worker deep learning using Keras for CTR Prediction with DeepFM.
For both E-Series and StorageGRID validation, we used Hadoop replication factor 2. For AFF validation, we only used one source of data.
Table 1 presents the hardware configuration for the Spark performance validation, while Table 2 shows the software requirement.
Type
Hadoop Worker Nodes
Drive Type
Drives per Node
Storage Controller
SG5712
4
SAS
12
Single high-availability (HA) pair
E5760
4
SAS
60
Single HA pair
E5724
4
SAS
24
Single HA pair
AFF800
4
SSD
6
Single HA pair
Software
Version
RHEL
7.9
OpenJDK Runtime Environment
1.8.0
OpenJDK 64-Bit Server VM
25.302
Git
2.24.1
GCC/G++
11.2.1
Spark
3.2.1
PySpark
3.1.2
SparkNLP
3.4.2
TensorFlow
2.9.0
Keras
2.9.0
Horovod
0.24.3
We have published TR-4910: Sentiment Analysis from Customer Communications with NetApp AI, in which an end-to-end conversational AI pipeline was built using the NetApp DataOps Toolkit, AFF storage, and NVIDIA DGX System. The pipeline performs batch audio signal processing, automatic speech recognition (ASR), transfer learning, and sentiment analysis leveraging the DataOps Toolkit, NVIDIA Riva SDK and Tao framework. Expanding the sentiment analysis use case to financial services industry, we built a SparkNLP workflow, loaded three BERT models for various NLP tasks such as named entity recognition, and obtained sentence-level sentiment for NASDAQ Top 10 companies’ quarterly earnings calls.
The following script sentiment_analysis_spark.py uses FinBERT model to process transcripts in HDFS and produce positive, neutral, and negative sentiment counts, as shown in the table below:
-bash-4.2$ time ~/anaconda3/bin/spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3 --master yarn --executor-memory 5g --executor-cores 1 --num-executors 160 --conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M" --conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M" /sparkusecase/tr-4570-nlp/sentiment_analysis_spark.py hdfs:///data1/Transcripts/ > ./sentiment_analysis_hdfs.log 2>&1 real 13m14.300s user 557m11.319s sys 4m47.676s
Table 3 ) Earnings call sentence-level sentiment analysis for NASDAQ Top 10 companies from 2016 to 2020.
Sentiment counts
All 10 Companies
AAPL
AMD
AMZN
CSCO
GOOGL
INTC
MSFT
NVDA
Positive counts
7447
1567
743
290
682
826
824
904
417
Neutral counts
64067
6856
7596
5086
6650
5914
6099
5715
6189
Negative counts
1787
253
213
84
189
97
282
202
89
Uncategorizedcounts
196
0
0
76
0
0
0
1
0
(total counts)
73497
8676
8552
5536
7521
6837
7205
6822
6695
In terms of percentages, most sentences spoken by the CEOs and CFOs are factual and therefore carry neutral sentiment. During an earnings call, analysts ask questions which might convey positive or negative sentiment. It is worth further investigating quantitatively how negative or positive sentiment affect stock prices on the same or next day of trading.
Table 3 ) Sentence-level sentiment analysis for NASDAQ Top 10 companies, expressed in percentage
Sentiment percentage
All 10 Companies
AAPL
AMD
AMZN
CSCO
GOOGL
INTC
MSFT
NVDA
Positive
10.13%
18.06%
8.69%
5.24%
9.07%
12.08%
11.44%
13.25%
6.23%
Neutral
87.17%
79.02%
88.82%
91.87%
88.42%
86.50%
84.65%
83.77%
92.44%
Negative
2.43%
2.92%
2.49%
1.52%
2.51%
1.42%
3.91%
2.96%
1.33%
Uncategorized
0.27%
0%
0%
1.37%
0%
0%
0%
0.01%
0%
In terms of the workflow runtime, we see a significant 4.78x improvement from local mode to distributed environment in HDFS, and a further 0.14% improvement by leveraging NFS:
-bash-4.2$ time ~/anaconda3/bin/spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3 --master yarn --executor-memory 5g --executor-cores 1 --num-executors 160 --conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M" --conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M" /sparkusecase/tr-4570-nlp/sentiment_analysis_spark.py file:///sparkdemo/sparknlp/Transcripts/ > ./sentiment_analysis_nfs.log 2>&1 real 13m13.149s user 537m50.148s sys 4m46.173s
As the figure below shows, data and model parallelism improve the data processing and distributed TensorFlow model inferencing speed. Data location in NFS yields slightly better runtime due to the bottleneck of the workflow being in downloading pretrained models. If we increase the transcripts dataset size, the advantage of NFS will be more obvious.
Figure) Earnings call sentiment analysis runtime comparison
For other use cases and the complete Python scripts tested, please refer to TR-4570.
See Hybrid Cloud Solution with Apache Spark and NetApp AI for hybrid cloud applications in distributed Spark clusters with the NetApp ® storage portfolio.
If you have questions, feel free to contact @rickhuang or AI Solutions Team (ng-ai-inquiry@netapp.com). 😊
... View more