See TR-4570 NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results for Customer Challenges, Major AI/ML/DL Use Cases and Architectures, Hardware/software environment & versions, and other details.
For this validation, we used four worker nodes and one master nodes with an AFF-A800 HA pair. All cluster members are connected through 10GbE network switches.
For other NetApp Spark solution validation, we used three different storage controllers: the E5760, the E5724, and the AFF-A800. The E-Series storage controllers are connected to five data nodes with 12Gbps SAS connections. The AFF HA-pair storage controller provides exported NFS volumes through 10GbE connections to Hadoop worker nodes. The Hadoop cluster members are connected through 10GbE connections in the E-Series, AFF, and StorageGRID Hadoop solutions.
We used the TeraSort and TeraValidate scripts in the TeraGen benchmarking tool to measure the Spark performance validation with E5760, E5724, and AFF-A800 configurations. In addition, three major use cases were tested: Spark NLP pipelines and TensorFlow distributed training, Horovod distributed training, and multi-worker deep learning using Keras for CTR Prediction with DeepFM.
For both E-Series and StorageGRID validation, we used Hadoop replication factor 2. For AFF validation, we only used one source of data.
Table 1 presents the hardware configuration for the Spark performance validation, while Table 2 shows the software requirement.
Type
|
Hadoop Worker Nodes
|
Drive Type
|
Drives per Node
|
Storage Controller
|
SG5712
|
4
|
SAS
|
12
|
Single high-availability (HA) pair
|
E5760
|
4
|
SAS
|
60
|
Single HA pair
|
E5724
|
4
|
SAS
|
24
|
Single HA pair
|
AFF800
|
4
|
SSD
|
6
|
Single HA pair
|
Software
|
Version
|
RHEL
|
7.9
|
OpenJDK Runtime Environment
|
1.8.0
|
OpenJDK 64-Bit Server VM
|
25.302
|
Git
|
2.24.1
|
GCC/G++
|
11.2.1
|
Spark
|
3.2.1
|
PySpark
|
3.1.2
|
SparkNLP
|
3.4.2
|
TensorFlow
|
2.9.0
|
Keras
|
2.9.0
|
Horovod
|
0.24.3
|
We have published TR-4910: Sentiment Analysis from Customer Communications with NetApp AI, in which an end-to-end conversational AI pipeline was built using the NetApp DataOps Toolkit, AFF storage, and NVIDIA DGX System. The pipeline performs batch audio signal processing, automatic speech recognition (ASR), transfer learning, and sentiment analysis leveraging the DataOps Toolkit, NVIDIA Riva SDK and Tao framework. Expanding the sentiment analysis use case to financial services industry, we built a SparkNLP workflow, loaded three BERT models for various NLP tasks such as named entity recognition, and obtained sentence-level sentiment for NASDAQ Top 10 companies’ quarterly earnings calls.
The following script sentiment_analysis_spark.py uses FinBERT model to process transcripts in HDFS and produce positive, neutral, and negative sentiment counts, as shown in the table below:
-bash-4.2$ time ~/anaconda3/bin/spark-submit
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3
--master yarn
--executor-memory 5g
--executor-cores 1
--num-executors 160
--conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M"
--conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M"
/sparkusecase/tr-4570-nlp/sentiment_analysis_spark.py hdfs:///data1/Transcripts/
> ./sentiment_analysis_hdfs.log 2>&1
real 13m14.300s
user 557m11.319s
sys 4m47.676s
Table 3) Earnings call sentence-level sentiment analysis for NASDAQ Top 10 companies from 2016 to 2020.
Sentiment counts
|
All 10 Companies
|
AAPL
|
AMD
|
AMZN
|
CSCO
|
GOOGL
|
INTC
|
MSFT
|
NVDA
|
Positive counts
|
7447
|
1567
|
743
|
290
|
682
|
826
|
824
|
904
|
417
|
Neutral counts
|
64067
|
6856
|
7596
|
5086
|
6650
|
5914
|
6099
|
5715
|
6189
|
Negative counts
|
1787
|
253
|
213
|
84
|
189
|
97
|
282
|
202
|
89
|
Uncategorizedcounts
|
196
|
0
|
0
|
76
|
0
|
0
|
0
|
1
|
0
|
(total counts)
|
73497
|
8676
|
8552
|
5536
|
7521
|
6837
|
7205
|
6822
|
6695
|
In terms of percentages, most sentences spoken by the CEOs and CFOs are factual and therefore carry neutral sentiment. During an earnings call, analysts ask questions which might convey positive or negative sentiment. It is worth further investigating quantitatively how negative or positive sentiment affect stock prices on the same or next day of trading.
Table 3) Sentence-level sentiment analysis for NASDAQ Top 10 companies, expressed in percentage
Sentiment percentage
|
All 10 Companies
|
AAPL
|
AMD
|
AMZN
|
CSCO
|
GOOGL
|
INTC
|
MSFT
|
NVDA
|
Positive
|
10.13%
|
18.06%
|
8.69%
|
5.24%
|
9.07%
|
12.08%
|
11.44%
|
13.25%
|
6.23%
|
Neutral
|
87.17%
|
79.02%
|
88.82%
|
91.87%
|
88.42%
|
86.50%
|
84.65%
|
83.77%
|
92.44%
|
Negative
|
2.43%
|
2.92%
|
2.49%
|
1.52%
|
2.51%
|
1.42%
|
3.91%
|
2.96%
|
1.33%
|
Uncategorized
|
0.27%
|
0%
|
0%
|
1.37%
|
0%
|
0%
|
0%
|
0.01%
|
0%
|
In terms of the workflow runtime, we see a significant 4.78x improvement from local mode to distributed environment in HDFS, and a further 0.14% improvement by leveraging NFS:
-bash-4.2$ time ~/anaconda3/bin/spark-submit
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3
--master yarn
--executor-memory 5g
--executor-cores 1
--num-executors 160
--conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M"
--conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M"
/sparkusecase/tr-4570-nlp/sentiment_analysis_spark.py file:///sparkdemo/sparknlp/Transcripts/
> ./sentiment_analysis_nfs.log 2>&1
real 13m13.149s
user 537m50.148s
sys 4m46.173s
As the figure below shows, data and model parallelism improve the data processing and distributed TensorFlow model inferencing speed. Data location in NFS yields slightly better runtime due to the bottleneck of the workflow being in downloading pretrained models. If we increase the transcripts dataset size, the advantage of NFS will be more obvious.
Figure) Earnings call sentiment analysis runtime comparison
For other use cases and the complete Python scripts tested, please refer to TR-4570.
See Hybrid Cloud Solution with Apache Spark and NetApp AI for hybrid cloud applications in distributed Spark clusters with the NetApp® storage portfolio.
If you have questions, feel free to contact @rickhuang or AI Solutions Team (ng-ai-inquiry@netapp.com). 😊