! pip install yfinance
import yfinance as yf
data = yf.download("AAPL IBM", start="2009-01-01", end="2019-12-31")
[*********************100%***********************] 2 of 2 completed
data['Open']
%%bash
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [p-e2004cac-a471-4cd4-bbff-85df40c8ef44]
Starting resourcemanager
Starting nodemanagers
from hdfs3 import HDFileSystem
hdfs = HDFileSystem(host='localhost', port=9000)
with hdfs.open('AAPL_IBM_open.csv', 'wb') as f:
data['Open'].to_csv(f,header=True)
hdfs.ls('.')
%%file stock_analysis.py
from mrjob.job import MRJob
import re
import sys
class StockAnalysis(MRJob):
def mapper(self, key, value):
date, apple_open, samsung_open = value.split(',')
#print(value, file=sys.stderr)
year = date[:4]
month = date[5:7]
if (month=='10' or month=='11' or month=='12'):
apple_key = 'apple_%s' % year
samsung_key = 'samsung_%s' % year
yield(apple_key, float(apple_open))
yield(samsung_key, float(samsung_open))
def reducer(self, key, values):
yield(key, max(values))
if __name__ == '__main__':
StockAnalysis.run()
Overwriting stock_analysis.py
!python stock_analysis.py -r hadoop hdfs:///user/root/AAPL_IBM_open.csv
No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /root/hadoop-3.3.1/bin...
Found hadoop binary: /root/hadoop-3.3.1/bin/hadoop
Using Hadoop version 3.3.1
Looking for Hadoop streaming jar in /root/hadoop-3.3.1...
Found Hadoop streaming jar: /root/hadoop-3.3.1/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar
Creating temp directory /tmp/stock_analysis.root.20210919.204945.260953
uploading working dir files to hdfs:///user/root/tmp/mrjob/stock_analysis.root.20210919.204945.260953/files/wd...
Copying other local files to hdfs:///user/root/tmp/mrjob/stock_analysis.root.20210919.204945.260953/files/
Running step 1 of 1...
packageJobJar: [/tmp/hadoop-unjar14646427661634042380/] [] /tmp/streamjob4819328614983767640.jar tmpDir=null
Connecting to ResourceManager at /0.0.0.0:8032
Connecting to ResourceManager at /0.0.0.0:8032
Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1632084584955_0001
Total input files to process : 1
number of splits:2
Submitting tokens for job: job_1632084584955_0001
Executing with tokens: []
resource-types.xml not found
Unable to find 'resource-types.xml'.
Submitted application application_1632084584955_0001
The url to track the job: http://p-e2004cac-a471-4cd4-bbff-85df40c8ef44:8088/proxy/application_1632084584955_0001/
Running job: job_1632084584955_0001
Job job_1632084584955_0001 running in uber mode : false
map 0% reduce 0%
map 100% reduce 0%
map 100% reduce 100%
Job job_1632084584955_0001 completed successfully
Output directory: hdfs:///user/root/tmp/mrjob/stock_analysis.root.20210919.204945.260953/output
Counters: 55
File Input Format Counters
Bytes Read=129826
File Output Format Counters
Bytes Written=704
File System Counters
FILE: Number of bytes read=47046
FILE: Number of bytes written=924947
FILE: Number of large read operations=0
FILE: Number of read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=130028
HDFS: Number of bytes read erasure-coded=0
HDFS: Number of bytes written=704
HDFS: Number of large read operations=0
HDFS: Number of read operations=11
HDFS: Number of write operations=2
Job Counters
Data-local map tasks=2
Killed map tasks=1
Launched map tasks=2
Launched reduce tasks=1
Total megabyte-milliseconds taken by all map tasks=20226048
Total megabyte-milliseconds taken by all reduce tasks=5371904
Total time spent by all map tasks (ms)=19752
Total time spent by all maps in occupied slots (ms)=19752
Total time spent by all reduce tasks (ms)=5246
Total time spent by all reduces in occupied slots (ms)=5246
Total vcore-milliseconds taken by all map tasks=19752
Total vcore-milliseconds taken by all reduce tasks=5246
Map-Reduce Framework
CPU time spent (ms)=2670
Combine input records=0
Combine output records=0
Failed Shuffles=0
GC time elapsed (ms)=241
Input split bytes=202
Map input records=2768
Map output bytes=44252
Map output materialized bytes=47052
Map output records=1394
Merged Map outputs=2
Peak Map Physical memory (bytes)=277606400
Peak Map Virtual memory (bytes)=2725294080
Peak Reduce Physical memory (bytes)=204193792
Peak Reduce Virtual memory (bytes)=2725457920
Physical memory (bytes) snapshot=739270656
Reduce input groups=22
Reduce input records=1394
Reduce output records=22
Reduce shuffle bytes=47052
Shuffled Maps =2
Spilled Records=2788
Total committed heap usage (bytes)=592445440
Virtual memory (bytes) snapshot=8171941888
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
job output is in hdfs:///user/root/tmp/mrjob/stock_analysis.root.20210919.204945.260953/output
Streaming final output from hdfs:///user/root/tmp/mrjob/stock_analysis.root.20210919.204945.260953/output...
"apple_2009" 7.611785888671875
"apple_2010" 11.650713920593262
"apple_2011" 15.062856674194336
"apple_2012" 23.97321319580078
"apple_2013" 20.451786041259766
"apple_2014" 29.8174991607666
"apple_2015" 30.782499313354492
"apple_2016" 29.545000076293945
"apple_2017" 43.77750015258789
"apple_2018" 57.69499969482422
"apple_2019" 72.77999877929688
"samsung_2009" 132.41000366210938
"samsung_2010" 146.72999572753906
"samsung_2011" 193.63999938964844
"samsung_2012" 211.14999389648438
"samsung_2013" 186.49000549316406
"samsung_2014" 189.91000366210938
"samsung_2015" 152.4600067138672
"samsung_2016" 168.97000122070312
"samsung_2017" 162.0500030517578
"samsung_2018" 154.0
"samsung_2019" 145.58999633789062
Removing HDFS temp directory hdfs:///user/root/tmp/mrjob/stock_analysis.root.20210919.204945.260953...
Removing temp directory /tmp/stock_analysis.root.20210919.204945.260953...
%%bash
$HADOOP_HOME/sbin/stop-yarn.sh
$HADOOP_HOME/sbin/stop-dfs.sh