aleno-malicious-smart-contract-detection-ml

v0.1.6

Published

3 months ago

This bot detects when suspicious smart contract are deployed

Downloads

0High
0Medium
0Low

angeloca

Malicious Smart Contract ML

Description

This repository contains an advanced smart contract detection bot, designed by Aleno, to identify malicious smart contracts on the Ethereum blockchain. The bot achieves this by analyzing the opcodes of smart contracts. Our approach builds upon the foundation laid by Forta's ML Bot.

This bot uses an improved model from the original bot by enhancing data quality and addressing the dataset's inherent imbalance.

A full description of the model can be found in malicious-contract-detection.

Model Configuration

Data Used For Training

Malicious Dataset

Our malicious dataset is primarily derived from Forta's publicly available dataset on GitHub. We have augmented this dataset with recent hacking incidents while excluding fishing contracts. Data sources include Forta, DeFiMon, and DeFiHackLabs.

We've further refined the dataset by eliminating contracts with duplicated code, resulting in 154 malicious contract records.

Benign Dataset

Benign contract data was sourced from Ethereum smart contracts verified on Etherscan, accessible here. We extended this dataset with 5000 recently verified contracts from Etherscan to align with recent hacking incidents. To maintain consistency with Forta's work, we chose to use only 15000 records from this dataset.

Algorithm

For the analysis of smart contracts' opcodes and the extraction of common and important opcodes found in malicious and benign contracts, we employed a technique borrowed from natural language processing called TF-IDF (term frequency–inverse document frequency). This technique extracts numerical features from text (opcodes, in this case). These features are then fed into a LogisticRegression model to predict whether a contract is malicious or not.

TF-IDF extracts opcodes in chunks: unigrams, bigrams, trigrams, and 4-grams.
- Example of a unigram: PUSH1
- Example of a 4-gram: PUSH1 MSTORE PUSH1 CALLDATASIZE
Analyzing in chunks helps retain the relative position information of the smart contract opcodes.

NOTE: Compared to the original Forta work, we improved the model training phase using the SMOTE oversampling technique to synthesize malicious contract records and address the fact that the malicious contract class represents only 1% of the dataset.

Model Versions

The model has undergone multiple versions with corresponding performance metrics and a comparison with Forta's results:

| Model Version | Created Date | Avg Precision | Avg Recall | Avg F1-Score | Alert Rate | Notes | |---------------|--------------|---------------|------------|--------------|------------|--------------------------| | Forta V1 | 09/30/2022 | 88.6% | 59.4% | 69.6% | 222.125 | | | Forta V2 | 11/05/2022 | 73.36% | 48.37% | 53.97% | 112.75 | FP Mitigation for V1 | | Forta V3 | 02/06/2023 | 87.78% | 55.195% | 62.077% | TODO | FP Mitigation for V2 | | Aleno V1 | 18/12/2023 | 81.17% | 87% | 84% | 68 | No FP mitigation yet |

For Forta models, average precision and recall were calculated via stratified 5-fold cross-validation with a decision threshold set to 0.5, while Aleno results were obtained on a test dataset, as SMOTE didn't allow comparing results since datasets are not the same.
Alert-rate = the number of Ethereum alerts daily (average of 7 days).

Improvements

Chain-Specific Models: Currently, this model was trained exclusively on Ethereum smart contracts. To enhance its effectiveness, it may be beneficial to create machine learning models tailored to each blockchain, trained on chain-specific smart contracts. For instance, a dedicated model could be trained specifically for Binance Smart Chain (BSC) contracts, considering the unique characteristics of that blockchain.
Dynamic Management of Known Contracts: The model currently operates only on unknown contracts, while known contract types are identified based on their signature. However, the list of known contracts is static. To improve the model's adaptability, consider implementing a mechanism to dynamically update the list of known contract hashes. This can be achieved by regularly adding new frequently encountered contract hashes to common_contract_hash_set.json, ensuring that the model remains up-to-date with emerging contract types.
False Positive Mitigation: Improve False positive mitigation by looking at cross chain funding, and ping API to check if contract deployer's have known labels. How can we improve mitigation that uses etherscan api for verified contracts? (There can be a lag between etherscan and the bot that could initially trigger a CRITICAL alert and a second bot with a lag can lower).

Supported Chains

Ethereum
BSC
Polygon
Optimism
Arbitrum
Avalanche
Fantom

Alerts

SUSPICIOUS-CONTRACT-CREATION
- Fired when a new non-token and non-proxy contract is predicted as malicious.
- Metadata will include the following:
  - Link to OKO Contract Explorer to review decompiled contract code and ABI. This only works for Ethereum.
  - Function sighashes
  - ML model score and threshold
  - Addresses observed in the created contract (either through storage or static analysis)
  - Any wallet tags associated with the addresses. The bot queries the wallet tags from Luabase. This only works for Ethereum.
- Finding type: Suspicious
- Finding severity: High
- Attack Stage: Preparation
SUSPICIOUS-CONTRACT-CREATION-SUSPICIOUS-FUNDING
- Fired when a new non-token and non-proxy contract is predicted as malicious and that funding analysis of its deployer is suspicious (using privacy tools like tornado cash)
- Metadata will include the following:
  - Link to OKO Contract Explorer to review decompiled contract code and ABI. This only works for Ethereum.
  - Function sighashes
  - ML model score and threshold
  - Addresses observed in the created contract (either through storage or static analysis)
  - Any wallet tags associated with the addresses. The bot queries the wallet tags from Luabase. This only works for Ethereum.
- Finding type: Suspicious
- Finding severity: Critical
- Attack Stage: Preparation

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme