Analysis of Crypto Assets Using Apache Spark

Suraj Sharma
8 min readFeb 13, 2022

This blog only scratches the surface and covers how Apache Spark (open-source distributed computational framework) can be used through Scala API to process the crypto assets dataset collected from Kaggle and how to build some basic visualizations using Zeppelin notebooks.

GitHub Link: https://github.com/strikersps/Exploratory-Data-Analysis/tree/main/Analysis-of-Crypto-Assets-Using-Apache-Spark

  1. Introduction

Before we delve into the explanation of Scala code snippets, it's important to have a background in crypto assets and blockchain technology, which is one of the fundamental building blocks of any crypto assets trading in the crypto exchanges.

Blockchain is a digitized, distributed, and decentralized append-only ledger that provides secure storage in terms of the chain of blocks of all the transactions which has happened over the network since the genesis block by establishing a distributed trust using a consensus algorithm, and it acts as a fundamental technology on top of which all the different kinds of decentralized applications are built upon such cryptocurrencies, decentralized finance apps (DeFi), decentralized autonomous organizations (DAO), etc.

Blockchain was first introduced on October 28, 2008, after the global financial crisis as a technology that was leveraged to build the world's first decentralized money i.e. Bitcoin through a white paper published by pseudonym Satoshi Nakamoto.

I think, in my opinion, the main reason why they have published the paper just after the global financial crisis of 2008 was that it was the best time to make the citizens aware and understand the loopholes of traditional financial systems on which we heavily rely upon and how the organizations/third parties who provide the infrastructures for the exchange of value became greedy that leads to various different types of the financial crisis, and Bitcoin was not only the solution to the loopholes of the traditional financial systems exposed through various financial crisis but it also provides an algorithmic way of removing the presence of central authority from any transactions by building the distributed trust through the ecosystem governing the Bitcoin.

Crypto assets has also become quite popular and controversial and various different types of crypto assets which are tradable in crypto exchanges has now become an alternative asset class in order to hedge against the risks of other asset classes such as equities, bonds, currencies, commodities, etc but do remember that all these crypto assets has yet not created a huge value system for the general citizens, hence its true potential is yet to be recognized in a meaningful way.

Let’s now focus upon how we have performed the analysis of mainly three crypto tokens i.e., Bitcoin, Ethereum, and Cardano, and even though it’s a basic exploratory data analysis article, there are still some interesting insights observed from the perspective of the trend observed in the prices and the market capitalization of the crypto tokens especially in the period of COVID-19.

2. Setting Up The Environment

To perform the analysis Apache Spark and Scala programming language is used, so make sure you have Apache Spark (2.3.0) with a Scala interface set up on your system and if you want to run Zeppelin notebook which consists of visualization then make sure you have installed the Zeppelin (0.10.0) on your system before you clone the GitHub repo.

3. Explanation of the Source-Code

3.1 Importing Libraries: We are first importing different type classes defined in the org.apache.spark.sql.types and Rowclass present in org.apache.spark.sql.Rowsubmodule which allows us to define or change the schema for our dataset, then we are importing the submodules org.apache.spark.sql.functions and org.apache.spark.sqk.expressions.Window to perform some basic time series-related processing.

Importing the required libraries and submodules
Importing the Required Libraries and Sub-modules

3.2 Setting up paths to the datasets: While setting up the paths to the required crypto-asset *.csvfile, make sure the paths are adjusted according to your work environments.

Setting up paths to different crypto asset datasets
Setting up paths to different crypto asset datasets

3.3. Reading the data from *.csv file to Apache spark data frames: Using spark.read.options().csv() we can read the data from a csv file and create a data frame and using the options() function we can specify an extra set of operations using the Map() data structure that we want to do while building the data frames.

In this case, we want the header and schema to be inferred automatically on the basis of the header present in the csv file and schema will be inferred on the basis of the default data types chosen for different columns of the data frame.

Creating three data frames to ingest the three cypto-asset datasets
Creation of Data-Frames in Scala

Once the data frames are created, we will now first print the schema before we start exploring the dataset using the different set of functions provided in the Apache Spark data frame class.

Printing the schema of the data-frames created

After printing the schema, we are printing the first 20 observations from the data frames which are created from the respective crypto-asset *.csv files.

Printing the contents of the data-frames using the show() method defined in DataFrame class

3.4 Time series Preprocessing: As we are dealing with a financial time-series dataset, we need to extract the different components such as year, month, and day of weekfrom the date column first which will allow us to understand the underlying crypto-asset on a monthly and yearly basis, and we will also add a new column of daily returns which will be calculated according to the equation shown below to get an idea about the volatility present in the underlying crypto-asset.

Daily Returns Calculation Equation
Basic Time-Series Preprocessing

Once the time-series preprocessing is done, we will again print the schema and data frames to look at the values of new columns when the above block of code is executed.

Printing Schema and Data Frames after performing the preprocessing of the time series dataset

Now we found out that there was a high positive correlation between prices and the market cap of Bitcoin (0.9997094987294977), Ethereum (0.9983514954142878), and Cardano (0.9980783002967658) because market cap is directly proportional to the price of these crypto-assets as market cap is calculated as:

The equation for Calculation of Market Capitalization of Crypto Assets

where circulating supply = The best approximation of the number of coins that are circulating in the market and in the public’s hands.

Please go through the link: https://coinmarketcap.com/alexandria/glossary/circulating-supply to understand circulating supply in more depth and in the case of crypto assets we have total supply, circulating supply, maximum supply, so don’t confuse between them.

Calculation of correlation coefficient between different crypto assets

3.5 Monthly and Yearly Analysis: To perform the monthly and yearly analysis we have performed a groupBy() followed by agg(mean("Close"), mean("MarketCap)) followed by sort() method to sort the values in increasing order of year or month, and then we have taken the leverage of Zeppelin notebook inbuilt visualization tools which is accessible using z.show() method.

Block of code for performing groupBy() on a monthly basis on Bitcoin data frame
Block of code for performing groupBy() on a yearly basis on Bitcoin data frame
Bitcoin Average Market Capitalization Change Over the Years
Block of code for performing groupBy() on a monthly basis on Ethereum data frame
Block of code for performing groupBy() on a yearly basis on Ethereum data frame
Ethereum Average Market Capitalization Change Over the Years
Block of code for performing groupBy() on a monthly basis on Cardano data frame
Block of code for performing groupBy() on a yearly basis on Cardano data frame
Cardano Average Market Capitalization Change Over the Years

The above set of visualization is built using Zeppelin, and you can observe clearly that in the period of 2021 the market capitalization of Bitcoin has significantly soared at one time it was worth around $1 Trillion and on average it was around $850 Billion and over the years there is an exponential increase in the market capitalization of the Bitcoin and the same is the case with Ethereum (average market cap in 2021 was $240 Billion) and Cardano (average market cap in 2021 was $36.1 Billion) as shown in the following visualizations.

Bitcoin Price Volatility Over Years
Ethereum Price Volatility Over Years
Cardano Volatility Over Years

From a volatility standpoint, there was a high level of volatility observed as per the above visualizations in all the three crypto-assets, and from the time series analysis perspective, the daily returns column exhibits a white noise process hence the underlying crypto asset time series is a random walk process.

3.6 Analysis from the year 2020: Now we were interested in finding out how the three crypto-assets has performed in the year starting from 2020 and for that, we have first performed the filtering operation using the filter()defined in the data frame class followed by creating a SQL view on top of those three data frames in order to execute SQL queries using sql() method defined in org.apache.spark.sql.SQLContext()class.

Block of code to perform filtering and creation of SQL view on top of the filtered data frame

After creating the SQL views on top of the filtered data frames as per the previous snippet of code, we are executing a SQL query to perform inner-join of all the three crypto-assets filtered data frames so that we can have access to all the three crypto-assets prices for a given date.

Block of code responsible for performing inner-join by leveraging SQL views
The output of the previous block of code after performing INNER JOIN

4. Conclusions

From the visualization, we can see clearly that the market capitalization of Bitcoin, Ethereum, and Cardano has gone up significantly in FY-2021 because of the following reasons:

  • More investors investing in crypto-assets to hedge against the downward risk involved in the equity market, which was visible in the COVID-19 crash happened in late March-2020.
  • Crypto is now thought of as another store of value in addition to gold.
  • USA Security and Exchange Commission (SEC) has introduced Exchange Traded Funds (ETFs) based on cryptocurrencies in order to attract more investors who wants to invest in the crypto assets or make it part of their portfolio.
  • There are many services and a handful of economies that have categorized cryptocurrency as a legal tender in addition to FIAT currency but there also exist big economies which has either banned or imposed high taxes on the cryptocurrencies related activities.

--

--