class: inverse, middle #
Cloud
An Intro to Cloud Computing <br> <span style = 'font-size: 130%;'>Sam Csik | Data Training Coordinator</span> National Center for Ecological Analysis & Synthesis<br> <br> <span style = 'font-size: 130%;'>Master of Environmental Data Science | Spring 2021</span> Slides & source code available on [
GitHub
](https://github.com/UCSB-MEDS/cloud-computing) --- class: inverse, middle, center ##
Cloud
What is cloud computing? The buzzword we all hear, but maybe don't quite understand --- ### Cloud computing is a lot of things... <span style = 'font-size: 85%;'>*...but generally speaking:*</span> <span style = 'font-size: 85%;'>
WiFi
Cloud computing is the delivery of on-demand computer resources over the internet (i.e. "the cloud")</span> <span style = 'font-size: 85%;'>
Server
"The Cloud" isn't just an invisible entity -- it's powered by a network of data centers, found globally, which house the hardware (servers), power and backup systems, etc. These data centers and infrastructure are managed by cloud providers</span> <span style = 'font-size: 85%;'>
Dollar Sign
Cloud computing services are typically offered using a "pay-as-you-go" pricing model, which reduces capital expenses</span> .center[ <img src="media/cloud.png" width="60%" height="60%" /> ] .footnote[ <span style = 'font-size: 80%;'>Check out this article by Ingrid Burrington in The Atlantic, [Why Amazon's Data Centers are Hidden in Spy Country](https://www.theatlantic.com/technology/archive/2016/01/amazon-web-services-data-center/423147/), for some interesting insight into one of the largest cloud provider's data centers.</span> ] --- ### There are a lots of different cloud computing platforms, but the big ones are:
angle-right
[Amazon Web Services](https://aws.amazon.com/what-is-cloud-computing/) (AWS)
angle-right
[Google Cloud Platform](https://cloud.google.com/) (GCP)
angle-right
[Microsoft Azure](https://azure.microsoft.com/en-us/) <br> .center[ <img src="media/cloud_platforms.png" width="80%" height="80%" /> ] <span style = 'font-size: 70%;'>There are *many* other cloud service providers that offer varying degrees of infrastructure, ease of use, and cost. Check out [DigitalOcean](digitalocean.com), [Kamatera](https://www.kamatera.com/), and [Vultr](https://mamboserver.link/vultr) to start.</span> --- ### Different cloud service and deployment models offer a suite of options to fit client needs <span style = 'font-size: 80%;'>
Server
**Service Models:** When you work in "the cloud" you're using resources -- including servers, storage, networks, applications, services, (and more!) -- from a very large resource pool that is managed by you or the cloud service provider. Three cloud service models describe to what extent your resources are managed by yourself or by your cloud service providers.</span> <span style = 'font-size: 80%;'>
angle-right
Infrastructure as a Service (IaaS)</span> <span style = 'font-size: 80%;'>
angle-right
Platform as a Service (PaaS)</span> <span style = 'font-size: 80%;'>
angle-right
Software as a Service (SaaS)</span> <span style = 'font-size: 80%;'>
rocket
**Deployment Models:** Cloud deployment models describe the type of cloud environment based on ownership, scale, and access.</span> <span style = 'font-size: 80%;'>
angle-right
Private Cloud</span> <span style = 'font-size: 80%;'>
angle-right
Public Cloud</span> <span style = 'font-size: 80%;'>
angle-right
Hybrid Cloud</span> --- ### Service Models <span style = 'font-size: 70%;'>
angle-right
**Infrastructure as a Service (IaaS)** provides users with computing resources like processing power, data storage capacity, and networking. IaaS platforms offer an alternative to on-premise physical infrastructure, which can be costly and labor-intensive. In comparison, IaaS platforms are more cost-effective (pay-as-you-go), flexible, and scalable.</span> <span style = 'font-size: 70%;'>One example of IaaS is [Amazon EC2](https://aws.amazon.com/ec2/), which allows users to rent virtual computers on which to run their own computer applications (e.g. R/RStudio).</span> -- <span style = 'font-size: 70%;'>
angle-right
**Platform as a Service (PaaS)** provides developers with a framework and tools for creating unique applications and software. A benefit of SaaS is that developers don't need to worry about managing servers and underlying infrastructure (e.g. managing software updates or security patches). Rather, they can focus on the development, testing, and deploying of their application/software.</span> <span style = 'font-size: 70%;'>One example of SaaS is [AWS Lambda](https://aws.amazon.com/lambda/), a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers.</span> -- <span style = 'font-size: 70%;'>
angle-right
**Software as a Service (SaaS)** makes software available to users via the internet. With SaaS, users don't need to install and run software on their computers. Rather, they can access everything they need over the internet by logging into their account(s). The software/application owner does not have any control over the backend except for application-related management tasks.</span> <span style = 'font-size: 70%;'>Some examples of SaaS applications include [Dropbox](https://www.dropbox.com/), [Slack](https://slack.com/), and [DocuSign](https://www.docusign.com/).</span> --- ###
Pizza Slice
An Analogy: Pizza as a Service <br> .center[ <img src="media/pizza.png" width="80%" height="80%" /> .center[ <span style = 'font-size: 70%;'>Image Source: David Ng, Oursky</span> ] ] ??? * Made in House (On-premise): Choose all your pizza ingredients and make it from scratch, you will also want complete control of all the tools in your kitchen: the oven that has the exact settings and capabilities you want, the pizza stone many places don’t have, or let your dough sit overnight at the most optimal corner of your fridge. In real world, sometimes the effort is necessary if you can’t find a service that will let you customize for that one essential deployment setting, such as a firewall configuration or that specific network setting requirements. * Kitchen as a Service (IaaS): Like buying your own ingredients to prepare exactly the pizza you want, with the specific brands you like. But you don’t have to worry about the oven, rolling pin, whether you counter is big enough, or the pizza cutter. You have absolute control on what pizza you are going to make, but maybe the rental kitchen doesn’t have every type of tool you want. * Walk in and Bake (PaaS): Like walking into a kitchen with a ready-to-cook package of ingredients. IaaS and PaaS are when you want to make your own pizza, but you don’t need to worry about buying ingredients, prep work, or having the right tools. You can focus on rolling out the dough, assembling the pre-washed ingredients to your taste and factors like size and thickness before sticking it into someone’s provided oven. * Pizza as a Service (SaaS): enjoy a fully-finished product (pizza) after ordering; you call the pizza delivery so the pizza is already designed, prepared and baked by somebody else and arrives hot and ready to eat. --- ### Deployment Models <span style = 'font-size: 80%;'>
Bus
**Public Cloud:** The cloud infrastructure is made available to the general public over the internet and is owned by a cloud provider. The resources are shared among all users. With this model, users are only billed for the resources they use.</span> <span style = 'font-size: 80%;'>You can think of the **public cloud** as a **bus** -- it's accessible to everyone and you only pay for your seat/the time you're actually riding it.</span> -- <span style = 'font-size: 80%;'>
Car
**Private Cloud:** The cloud infrastructure and resources are dedicated to a single organization.</span> <span style = 'font-size: 80%;'>You can think of the **private cloud** as buying your own **car** -- you incur a larger up-front cost, but it's available only for you to use.</span> -- <span style = 'font-size: 80%;'>
Trailer
**Hybrid Cloud:** This model includes aspects of both private and public cloud deployment models -- meaning a company might be using the public cloud but also owns on-premise systems (or private cloud infrastructure) and there is a connection between the two.</span> <span style = 'font-size: 80%;'>You can think of a **hybrid cloud** as a U-Haul **trailer** -- you have your own vehicle, but need additional, temporary capacity for your move or a road trip.</span> --- class: inverse, middle, center ###
Question
Soooo...when should I use cloud computing? Well, it depends. --- ### It depends on your goals, needs, and budget...but the possibilities are pretty limitless <span style = 'font-size: 90%;'>Basically, anything that you'd want to use a computer for, you can do using cloud computing services:</span> * <span style = 'font-size: 80%;'>Big data analytics</span> * <span style = 'font-size: 80%;'>Cloud storage</span> * <span style = 'font-size: 80%;'>Disaster recovery</span> * <span style = 'font-size: 80%;'>Data backup</span> * <span style = 'font-size: 80%;'>Machine learning</span> * <span style = 'font-size: 80%;'>Web and mobile app development</span> * <span style = 'font-size: 80%;'>Networking & content delivery</span> * <span style = 'font-size: 80%;'>Security & identity management</span> * <span style = 'font-size: 80%;'>and more!</span> --- class: inverse, middle, center ###
Amazon Web Services (AWS)
But where are *we* going to start? Let's learn how to configure and launch an Amazon EC2 instance (i.e. virtual server) that can run R & RStudio! Cool, cool. What the heck does any of that mean? --- ### What is Amazon EC2? <span style = 'font-size: 80%;'>Amazon EC2 (Elastic Compute Cloud) is just one of Amazon Web Services (AWS) cloud platform offerings. It allows users to rent virtual, scalable computers on which to run their own computer applications (like R & RStudio).</span> .center[ <iframe width="520" height="415" src="https://www.youtube.com/embed/TsRBftzZsQo" frameborder="0" allowfullscreen></iframe> ] --- ### How do I set one of those up? <span style = 'font-size: 80%;'>Most (all?) cloud service providers, including
Amazon Web Services (AWS)
(should) have extensive documentation available online, but fair warning, it can be challenging for newcomers (like myself) to decipher. Below is my *best* attempt at synthesizing that information in a way that I could understand and follow along with.</span> <span style = 'font-size: 80%;'>I'll demonstrate these steps on my own computer, but you are welcome to [create an AWS account](https://portal.aws.amazon.com/billing/signup#/start/email) to follow along on your computer as well.</span> <span style = 'font-size: 80%;'>_**Note:** Creating an AWS account does require a credit card, though you are able to access their "Free Tier" services for 12 months after creating your account._</span> <br> <br> .center[ [Here are my instructions for launching an Amazon EC2 instance to run R & RStudio.](https://docs.google.com/document/d/1Bxp7OMq1_d1l7sMD7qoh8c8cA0rhYIGx-nr4IchH98E/edit?usp=sharing) If you have suggestions for how to improve these instructions and/or correct any potential misinformation, please [file a issue on GitHub](https://github.com/UCSB-MEDS/cloud-computing/issues). ] --- class: inverse, middle, center ###
Code
A use case: leveraging more computing resources to run computationally-intesive processes ~faster~ --- ### Slow code is less fun .center[ .center2[ Big data? Computationally-intensive workflows? Maybe you've been here before...? <img src="media/sloth.gif" width="100%" height="100%" /> ] ] --- ### Consider *parallelization* (aka parallel processing or parallel computing) <span style = 'font-size: 85%;'>A method where a process is broken up into smaller parts that can then be carried out **simultaneously**, i.e. **in parallel**. Traditionally, software is written for **serial processing**. As a default, **R runs serially**.</span> .footnote[ <span style = 'font-size: 90%;'>The contents of this slide were adapted from [Danielle Ferraro](https://github.com/danielleferraro)'s talk, [A Gentle Introduction to Parallel Processing in R](https://danielleferraro.github.io/rladies-sb-parallel/#1)</span> ] -- .pull-left[ Some computer architecture jargon: <span style = 'font-size: 80%;'>
angle-right
A **core** is the part of your computer's **CPU (central processing unit)** that performs computations (most modern computers have > 1 core)</span> <span style = 'font-size: 80%;'>
angle-right
A **process** is a single running task or program (like R) -- each core runs one process at a time</span> <span style = 'font-size: 80%;'>
angle-right
A **cluster** refers to a network of computers that work together (each with many cores), but it can also mean the collection of cores on your personal computer</span> ] .pull-right[ <img src="media/CPU.png" width="100%" height="100%" /> ] --- ### Serial vs. Parallel Processing .center[ <img src="media/processing_diagram.png" width="100%" height="100%" /> ] --- ###
Check
Use parallel processing when: <span style = 'font-size: 80%;'>
Check
Tasks are [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) i.e. when your analysis can be separated into many identical but separate tasks that do not rely on one another (e.g. bootstrapping, computations by group)</span> <span style = 'font-size: 80%;'>A case where parallel processing is **not** effective:</span> ```r function(a) { # each step depends on the previous step, so they can't be completed simultaneously b <- a + 1 c <- b + 2 d <- c + 3 return(d) } ``` <span style = 'font-size: 80%;'>
Check
Tasks are computationally intensive and take time to complete</span> .footnote[ <span style = 'font-size: 80%;'>The contents of this slide were adapted from [Danielle Ferraro](https://github.com/danielleferraro)'s talk, [A Gentle Introduction to Parallel Processing in R](https://danielleferraro.github.io/rladies-sb-parallel/#1)</span> ] -- ###
Times
Do not use parallel processing when: <span style = 'font-size: 80%;'>
Times
It may not be the right tool for the job (e.g. maybe you're memory limited)</span> <span style = 'font-size: 80%;'>
Times
It may not be efficient. There is computational overhead for setting up, maintaining, and terminating multiple processors -- for small processes, the overhead cost may be greater than the speedup</span> --- ### Implementation & R packages for parallelization <span style = 'font-size: 80%;'>How you implement parallelization is operating system-dependent:</span> * <span style = 'font-size: 80%;'>There are two ways code can be parallelized: **forking** vs. **sockets**</span> * <span style = 'font-size: 80%;'>**Forking** is faster, but doesn't work on Windows and *may* cause problems in IDEs</span> * <span style = 'font-size: 80%;'>**Sockets** are a bit slower, but work across operating systems and in IDEs</span> .footnote[ <span style = 'font-size: 90%;'>The contents of this slide were adapted from [Danielle Ferraro](https://github.com/danielleferraro)'s talk, [A Gentle Introduction to Parallel Processing in R](https://danielleferraro.github.io/rladies-sb-parallel/#1)</span> ] -- <span style = 'font-size: 80%;'>There are a number of different approaches for parallelization in R, but they use either the `parallel` package (a base R package) or `future` package (a newer and bit friendlier package) as a *backend*.</span> <span style = 'font-size: 80%;'>We'll cover the following packages:</span> * <span style = 'font-size: 80%;'>**`parallel`**, which supplies parallel versions of the base R `apply()` functions and can be implemented using either forking (Unix/Linux) or sockets (any OS)</span> * <span style = 'font-size: 80%;'>**`foreach`**, which enables the use of `for` loops in parallel and can be implemented using either forking (Unix/Linux) or sockets (any OS); the `foreach` package is used inconjunction with either `doParallel` or `doFuture`, which allows users to choose their desired backend (i.e. `parallel` vs. `future`)</span> -- .center[ <span style = 'font-size: 80%;'>Check out Danielle Ferraro's, talk [A Gentle Introduction to Parallel Processing in R](https://danielleferraro.github.io/rladies-sb-parallel/#1) to learn more about the `furrr` and `future.apply` packages, which provide parallel versions of `purrr:map()` and base R `apply()` functions, respectively, both using `future` as a backend.</span> ] --- ### What does this have to do with that Amazon EC2 instance we created?? <br> <br> <br> .center[ <span style = 'font-size: 90%;'>While we configured and launched a "Free Tier" instance ~ which includes only 1 core ☹️~ you can configure much more powerful instances (with up to 48 cores) to run parallelized code.</span> <span style = 'font-size: 90%;'>For demonstration purposes, we'll practice parallelizing code on our own multi-core computers (which are still incredibly powerful!).</span> <span style = 'font-size: 90%;'>To get started, open up RStudio and create a new RMarkdown (.Rmd) file. You can follow along with me and/or reference the [parallel_processing_KEY.md](https://github.com/UCSB-MEDS/cloud-computing/blob/main/R/parallel_processing_KEY.md) file.</span> ]