Deep Learning with Apache Spark and NetApp AI (1) – Financial Sentiment Analysis results

rickhuang · ‎2023-01-23

See TR-4570 NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results for Customer Challenges, Major AI/ML/DL Use Cases and Architectures, Hardware/software environment & versions, and other details.

For this validation, we used four worker nodes and one master nodes with an AFF-A800 HA pair. All cluster members are connected through 10GbE network switches.

For other NetApp Spark solution validation, we used three different storage controllers: the E5760, the E5724, and the AFF-A800. The E-Series storage controllers are connected to five data nodes with 12Gbps SAS connections. The AFF HA-pair storage controller provides exported NFS volumes through 10GbE connections to Hadoop worker nodes. The Hadoop cluster members are connected through 10GbE connections in the E-Series, AFF, and StorageGRID Hadoop solutions.

We used the TeraSort and TeraValidate scripts in the TeraGen benchmarking tool to measure the Spark performance validation with E5760, E5724, and AFF-A800 configurations. In addition, three major use cases were tested: Spark NLP pipelines and TensorFlow distributed training, Horovod distributed training, and multi-worker deep learning using Keras for CTR Prediction with DeepFM.

For both E-Series and StorageGRID validation, we used Hadoop replication factor 2. For AFF validation, we only used one source of data.

Table 1 presents the hardware configuration for the Spark performance validation, while Table 2 shows the software requirement.

Type	Hadoop Worker Nodes	Drive Type	Drives per Node	Storage Controller
SG5712	4	SAS	12	Single high-availability (HA) pair
E5760	4	SAS	60	Single HA pair
E5724	4	SAS	24	Single HA pair
AFF800	4	SSD	6	Single HA pair

Software	Version
RHEL	7.9
OpenJDK Runtime Environment	1.8.0
OpenJDK 64-Bit Server VM	25.302
Git	2.24.1
GCC/G++	11.2.1
Spark	3.2.1
PySpark	3.1.2
SparkNLP	3.4.2
TensorFlow	2.9.0
Keras	2.9.0
Horovod	0.24.3

We have published TR-4910: Sentiment Analysis from Customer Communications with NetApp AI, in which an end-to-end conversational AI pipeline was built using the NetApp DataOps Toolkit, AFF storage, and NVIDIA DGX System. The pipeline performs batch audio signal processing, automatic speech recognition (ASR), transfer learning, and sentiment analysis leveraging the DataOps Toolkit, NVIDIA Riva SDK and Tao framework. Expanding the sentiment analysis use case to financial services industry, we built a SparkNLP workflow, loaded three BERT models for various NLP tasks such as named entity recognition, and obtained sentence-level sentiment for NASDAQ Top 10 companies’ quarterly earnings calls.

The following script sentiment_analysis_spark.py uses FinBERT model to process transcripts in HDFS and produce positive, neutral, and negative sentiment counts, as shown in the table below:

-bash-4.2$ time ~/anaconda3/bin/spark-submit
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3
--master yarn
--executor-memory 5g
--executor-cores 1
--num-executors 160
--conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M"
--conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M"  
/sparkusecase/tr-4570-nlp/sentiment_analysis_spark.py hdfs:///data1/Transcripts/

> ./sentiment_analysis_hdfs.log 2>&1

real   13m14.300s
user   557m11.319s
sys    4m47.676s

Table 3) Earnings call sentence-level sentiment analysis for NASDAQ Top 10 companies from 2016 to 2020.

Sentiment counts	All 10 Companies	AAPL	AMD	AMZN	CSCO	GOOGL	INTC	MSFT	NVDA
Positive counts	7447	1567	743	290	682	826	824	904	417
Neutral counts	64067	6856	7596	5086	6650	5914	6099	5715	6189
Negative counts	1787	253	213	84	189	97	282	202	89
Uncategorizedcounts	196	0	0	76	0	0	0	1	0
(total counts)	73497	8676	8552	5536	7521	6837	7205	6822	6695

In terms of percentages, most sentences spoken by the CEOs and CFOs are factual and therefore carry neutral sentiment. During an earnings call, analysts ask questions which might convey positive or negative sentiment. It is worth further investigating quantitatively how negative or positive sentiment affect stock prices on the same or next day of trading.

Table 3) Sentence-level sentiment analysis for NASDAQ Top 10 companies, expressed in percentage

Sentiment percentage	All 10 Companies	AAPL	AMD	AMZN	CSCO	GOOGL	INTC	MSFT	NVDA
Positive	10.13%	18.06%	8.69%	5.24%	9.07%	12.08%	11.44%	13.25%	6.23%
Neutral	87.17%	79.02%	88.82%	91.87%	88.42%	86.50%	84.65%	83.77%	92.44%
Negative	2.43%	2.92%	2.49%	1.52%	2.51%	1.42%	3.91%	2.96%	1.33%
Uncategorized	0.27%	0%	0%	1.37%	0%	0%	0%	0.01%	0%

In terms of the workflow runtime, we see a significant 4.78x improvement from local mode to distributed environment in HDFS, and a further 0.14% improvement by leveraging NFS:

-bash-4.2$ time ~/anaconda3/bin/spark-submit 
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3 
--master yarn 
--executor-memory 5g 
--executor-cores 1 
--num-executors 160 
--conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M" 
--conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M"  
/sparkusecase/tr-4570-nlp/sentiment_analysis_spark.py file:///sparkdemo/sparknlp/Transcripts/ 

> ./sentiment_analysis_nfs.log 2>&1

real   13m13.149s
user   537m50.148s
sys    4m46.173s

As the figure below shows, data and model parallelism improve the data processing and distributed TensorFlow model inferencing speed. Data location in NFS yields slightly better runtime due to the bottleneck of the workflow being in downloading pretrained models. If we increase the transcripts dataset size, the advantage of NFS will be more obvious.

Figure) Earnings call sentiment analysis runtime comparison

For other use cases and the complete Python scripts tested, please refer to TR-4570.

See Hybrid Cloud Solution with Apache Spark and NetApp AI for hybrid cloud applications in distributed Spark clusters with the NetApp® storage portfolio.

If you have questions, feel free to contact @rickhuang or AI Solutions Team (ng-ai-inquiry@netapp.com). 😊