Tech ONTAP Blogs

Deep Learning with Apache Spark and NetApp AI (1) – Financial Sentiment Analysis results

rickhuang
NetApp
1,427 Views

See TR-4570 NetApp Storage Solutions for Apache Spark: Architecture, Use Cases, and Performance Results  for Customer Challenges, Major AI/ML/DL Use Cases and Architectures, Hardware/software environment & versions, and other details.

 

For this validation, we used four worker nodes and one master nodes with an AFF-A800 HA pair. All cluster members are connected through 10GbE network switches.

 

For other NetApp Spark solution validation, we used three different storage controllers: the E5760, the E5724, and the AFF-A800. The E-Series storage controllers are connected to five data nodes with 12Gbps SAS connections. The AFF HA-pair storage controller provides exported NFS volumes through 10GbE connections to Hadoop worker nodes. The Hadoop cluster members are connected through 10GbE connections in the E-Series, AFF, and StorageGRID Hadoop solutions.

 

We used the TeraSort and TeraValidate scripts in the TeraGen benchmarking tool to measure the Spark performance validation with E5760, E5724, and AFF-A800 configurations. In addition, three major use cases were tested: Spark NLP pipelines and TensorFlow distributed training, Horovod distributed training, and multi-worker deep learning using Keras for CTR Prediction with DeepFM.

 

For both E-Series and StorageGRID validation, we used Hadoop replication factor 2. For AFF validation, we only used one source of data.

 

Table 1 presents the hardware configuration for the Spark performance validation, while Table 2 shows the software requirement.

 

Type

Hadoop Worker Nodes

Drive Type

Drives per Node

Storage Controller

SG5712

4

SAS

12

Single high-availability (HA) pair

E5760

4

SAS

60

Single HA pair

E5724

4

SAS

24

Single HA pair

AFF800

4

SSD

6

Single HA pair

 

Software

Version

RHEL

7.9

OpenJDK Runtime Environment

1.8.0

OpenJDK 64-Bit Server VM

25.302

Git

2.24.1

GCC/G++

11.2.1

Spark

3.2.1

PySpark

3.1.2

SparkNLP

3.4.2

TensorFlow

2.9.0

Keras

2.9.0

Horovod

0.24.3

 

We have published TR-4910: Sentiment Analysis from Customer Communications with NetApp AI, in which an end-to-end conversational AI pipeline was built using the NetApp DataOps Toolkit, AFF storage, and NVIDIA DGX System. The pipeline performs batch audio signal processing, automatic speech recognition (ASR), transfer learning, and sentiment analysis leveraging the DataOps Toolkit, NVIDIA Riva SDK and Tao framework. Expanding the sentiment analysis use case to financial services industry, we built a SparkNLP workflow, loaded three BERT models for various NLP tasks such as named entity recognition, and obtained sentence-level sentiment for NASDAQ Top 10 companies’ quarterly earnings calls.

The following script sentiment_analysis_spark.py uses FinBERT model to process transcripts in HDFS and produce positive, neutral, and negative sentiment counts, as shown in the table below:

-bash-4.2$ time ~/anaconda3/bin/spark-submit
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3
--master yarn
--executor-memory 5g
--executor-cores 1
--num-executors 160
--conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M"
--conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M" 
/sparkusecase/tr-4570-nlp/sentiment_analysis_spark.py hdfs:///data1/Transcripts/

> ./sentiment_analysis_hdfs.log 2>&1

real   13m14.300s
user   557m11.319s
sys    4m47.676s

 

Table 3) Earnings call sentence-level sentiment analysis for NASDAQ Top 10 companies from 2016 to 2020.

Sentiment counts

All 10 Companies

AAPL

AMD

AMZN

CSCO

GOOGL

INTC

MSFT

NVDA

Positive counts

7447

1567

743

290

682

826

824

904

417

Neutral counts

64067

6856

7596

5086

6650

5914

6099

5715

6189

Negative counts

1787

253

213

84

189

97

282

202

89

Uncategorizedcounts

196

0

0

76

0

0

0

1

0

(total counts)

73497

8676

8552

5536

7521

6837

7205

6822

6695

 

In terms of percentages, most sentences spoken by the CEOs and CFOs are factual and therefore carry neutral sentiment. During an earnings call, analysts ask questions which might convey positive or negative sentiment. It is worth further investigating quantitatively how negative or positive sentiment affect stock prices on the same or next day of trading.

 

Table 3) Sentence-level sentiment analysis for NASDAQ Top 10 companies, expressed in percentage

Sentiment percentage

All 10 Companies

AAPL

AMD

AMZN

CSCO

GOOGL

INTC

MSFT

NVDA

Positive

10.13%

18.06%

8.69%

5.24%

9.07%

12.08%

11.44%

13.25%

6.23%

Neutral

87.17%

79.02% 

88.82%

91.87% 

88.42% 

86.50%

84.65% 

83.77% 

92.44% 

Negative

2.43%

2.92% 

2.49% 

1.52% 

2.51% 

1.42%

3.91% 

2.96% 

1.33% 

Uncategorized

0.27% 

0% 

0%

1.37% 

0%

0%

0%

0.01% 

0%

 

In terms of the workflow runtime, we see a significant 4.78x improvement from local mode to distributed environment in HDFS, and a further 0.14% improvement by leveraging NFS:

-bash-4.2$ time ~/anaconda3/bin/spark-submit 
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3
--master yarn
--executor-memory 5g
--executor-cores 1
--num-executors 160
--conf spark.driver.extraJavaOptions="-Xss10m -XX:MaxPermSize=1024M"
--conf spark.executor.extraJavaOptions="-Xss10m -XX:MaxPermSize=512M" 
/sparkusecase/tr-4570-nlp/sentiment_analysis_spark.py file:///sparkdemo/sparknlp/Transcripts/

> ./sentiment_analysis_nfs.log 2>&1

real   13m13.149s
user   537m50.148s
sys    4m46.173s

As the figure below shows, data and model parallelism improve the data processing and distributed TensorFlow model inferencing speed. Data location in NFS yields slightly better runtime due to the bottleneck of the workflow being in downloading pretrained models. If we increase the transcripts dataset size, the advantage of NFS will be more obvious.

 

Figure) Earnings call sentiment analysis runtime comparison

rickhuang_0-1674517604523.jpeg

 

For other use cases and the complete Python scripts tested, please refer to TR-4570.

 

See Hybrid Cloud Solution with Apache Spark and NetApp AI for hybrid cloud applications in distributed Spark clusters with the NetApp® storage portfolio.

 

If you have questions, feel free to contact @rickhuang or AI Solutions Team (ng-ai-inquiry@netapp.com). 😊

 

Comments
Public