import React, {Component } from 'react'

import ArchDiagram from '../images/stockProject/appArchitecture.png';


class StockProject extends Component {
    render() {
        return(
            <div className="project-box">
                <h1 className="project-title-text">
                    Serverless Machine Learning – Stock Sentiment App
                </h1>
                <p className="project-main-text">
                    "Stocks In English" is an app that scrapes financial articles then uses machine learning to determine their sentiment. 
                    The sentiment trends are shown in the app and the whole dataset can be downloaded to be used with other machine learning models. 
                    <br/>
                    <br/>
                    I created this app because there weren’t many good datasets of real financial articles, and I 
                    wanted to learn more about serverless infrastructure. The main components of the app are the serverless infrastructure and machine learning. 
                </p>

                <p className="project-main-text" style={{display:'block'}}>
                You can wiew the app here: &nbsp;
                    <a href="http://stocks.prestonblackburn.com/" target="_blank" style={{textDecoration:"none", }}> Stocks In English </a>
                </p>


                <h2 className="project-title-text-2">
                    Serverless Infrastructure
                </h2>
                <p className="project-main-text">
                    Serverless infrastructure can be leveraged to run machine learning functions that are not often used. By hosting my machine learning
                     model serverlessly I only pay for the compute time that I need instead of paying by month for a virtual server. In the case of this 
                     app, I only pull articles every three days, so I save around $10-$20 per month by hosting it serverlessly. To create this app I used
                      several serverless resources listed below. To orchestrate all my serverless resources I used AWS Serverless Application Model (SAM).
                       SAM allows you to define your infrastructure as config using YAML files, which helps you manage all your resources in one place. For 
                       this project I used five different AWS resources including Amplify, DynamoDB, Lamdba functions, Elastic Container Repository (ECR),
                        and the API gateway. An AWS architecture diagram of the app is shown below. 
                </p>
                <div className="blog-pics">
                    <img src={ArchDiagram} alt = "AWS Architecture Diagram" width='70%'/>
                </div>


                    <p className="project-main-text">
                        1.	Database table that holds all the stock ticker symbols to be scraped
                        <br />
                        I added them in a database so in the future users could request different stocks that they wanted data on.
                        <br />
                        <br />

                        2.	Function that crawls and scrapes data based on the requested stock tickers
                        <br />
                        This function is triggered every 3 days to run and scrape financial article data
                        <br />
                        <br />


                        3.	Database table that holds the scraped data results
                        <br />
                        <br />

                        4.	Container repository that holds the image for machine learning inference function
                        <br />
                        <br />

                        5.	Function that uses machine learning to determine the sentiment of the scraped text
                        <br />
                        This function is triggered by the DynamoDB table (#3), so every time new data is added it is passed to this lambda function and the machine learning inference is ran. 
                        <br />
                        <br />

                        6.	Database table that holds the analyzed data
                        <br />
                        <br />

                        7.	Function to pull the data for the API
                        <br />
                        This function reads the DynamoDB table and serves it to the API gateway. I do some pre-processing at this level to make my life easier in the frontend.
                        <br />
                        <br />
                        8.	API gateway to serve the data to the front end
                        <br />
                        <br />
                        9.	Website hosting with amplify
                    </p>

                <p className="project-main-text" style={{display:'block'}}>
                    The SAM YAML file for this project can be found on my GitHub &nbsp;
                    <a href="https://github.com/PrestonBlackburn/stocks-in-english" target="_blank" style={{textDecoration:"none", }}> GitHub Repo </a>
                </p>



                <h2 className="project-title-text-2">
                    Machine learning
                </h2>
                <p className="project-main-text">
                        For this project I ran into a couple challenges due to the limitations of serverless hosting. The challenges I ran into were mostly 
                        due to package/model size. To be fair, I could have used AWS Sagemaker, which would have solved some of the issues I ran into.
                        However, using AWS Sagemaker would occur additional costs and add complexity to this project. In future projects I plan on playing around with Sagemaker more. 
                    <br />
                    <br />
                       First off machine learning models can’t be deployed in the normal way using AWS Lambda due to their size. The cutoff for Lambda package
                       deployments is 250MB, which is not much in relation to machine learning models. For perspective, the small BERT model that I used was
                       around 115 MB and the traditional BERT model is around 400MB, not including the 2GB TensorFlow library that is required to run the models.
                        The good news is, around a year ago AWS started supporting custom runtimes for lambda functions using Docker. With the docker runtimes 
                        up to 10GB containers can be used with the lambda function.
                    <br />
                    <br />
                        Second the original library I used (Hugging Face Transformers) was a whopping 9.2 GB. While up to 10 GB can be uploaded to AWS, 
                        I would not recommend it due to long upload times. The first time I uploaded the image it took about an hour. Since I’m not that patient 
                        I switched over to TensorFlow instead. TensorFlow didn’t have a pre-trained model out of the box like Hugging Face, so I ended up 
                        fine-tuning a pre-trained small BERT model. To pre-train the model I used the “Financial News Sentiment” dataset from Kaggle.
                        I chose to only train the model on the positive and negative statements and omit the mixed statement. For now, I’m operating 
                        on the assumption that mixed financial statements would not be newsworthy, but in the future, I plan on going back and adding them in and comparing. 
                </p>

                <p className="project-main-text" style={{display:'block'}}>
                The dataset for fine tuning small BERT on Kaggle: &nbsp;
                    <a href="https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news" target="_blank" style={{textDecoration:"none", }}> Financial News Dataset </a>
                </p>


                <h2 className="project-title-text-2">
                    Web Scraping
                </h2>

                <p className="project-main-text">
                    Web scraping was an important feature of this project. I didn’t have much experience with web scraping before 
                    this project, so it ended up being a challenge. To make things easy I piggyback on the search results for financial 
                    articles from Google Finance. From there I pull the URLs of the all the listed articles. Then I scrape the html content 
                    from each page. Once I have the html content, I pull the first couple sentences and that is then saved an a DynamoDB table. 
                    I could do more with the web scraping  
                </p>


                <h2 className="project-title-text-2">
                    Future Steps
                </h2>

                <p className="project-main-text" style={{display:'block'}}>
                    This project sparked my curiosity of other AWS services that I will experiment with in the future. Each of these features 
                    could have been used in this project: AWS Sage Maker, AWS Step functions, and AWS Cloud Development Kit (CDK). With Sage Maker
                     I could develop better ML pipelines and maintain better version control of my models. Step functions could help me organize 
                     my lambda functions, and it would be very useful if my workflow increased in complexity. Last CDK sounds interesting now that
                      I’m more familiar with some of the AWS services. It sounds interesting to define the serverless infrastructure in Python instead of a YAML file.  
                </p>
            </div>        
    )
    }
}

export default StockProject;
