Phishing Doc Lures

Threat actors commonly use Microsoft Office documents with VBA macros for phishing attacks to infect an unexpecting end user. To get the user to enable macros, the documents commonly contain image lures. Some claim the document is

  • from an older version of office,
  • encrypted, or
  • protected.

Initial Attempt at OCR

InQuest wrote a blog post showing some common lures contained in malicious documents. One of the methods security researchers have attempted to detect documents with these lures, is through Optical Character Recognition. OCR is very good if the images are well formed and the threat actor hasn’t taken steps to confuse OCR systems. Some methods to make OCR harder are

  • lowering the image resolution,
  • introducing noise into the image, or
  • using similar colors for text and background.

Below is an image lure used by Emotet.
_config.yml

I use pytesseract for OCR to attempt to extract the text.

# quick script for using OCR
import sys 
import pytesseract
from PIL import Image

img = Image.open(sys.argv[1])
print pytesseract.image_to_string(img) # some images are too blurry

Below is the result of using OCR on the above image.

image_ocr.py 110eadfb5f462cfd22bfbcb0d8cc0b218cdb720a357997e4afeb636491f8ffaa_image.png 
ÿIOfiiaeSGS

mmwnmmummm

“mumummmmwmm
autumn-magnum“

Unfortunately, OCR was not able to recognize anything. This is likely due to the noise and blurring near the characters.
_config.yml

OCR With Pre-processed Image

My first thought was to use some pre-processing before passing the image to OCR. I have worked on something similar in the past, and so decided to take a similar approach. I used some pre-processing to separate the image into a two color image.
_config.yml

image_ocr.py 110eadfb5f462cfd22bfbcb0d8cc0b218cdb720a357997e4afeb636491f8ffaa_mod_image.png 
fl ,A ,
Ulhoirmt: 3m;

-2½.- m m-yun ‘ mi v gum-L: ‘/:)‘nm- a“ mlwwm

It looks even worse. I believed I had the right idea, but not the right approach. I thought maybe if I was able to determine that background color I could then adjust my method.

Image.open('edit.jpg')
colors = Counter(i.getdata())
print colors[max(colors, key=colors.get)]
# 124104

print i.size
# (707, 244)

print 707*244
# 172508

print 124104/172508.0
# 0.7194101143135391

Turns out that 71%+ is all one color, the light blue. I thought I could just list the highest used colors and only use the top 4.

t = [(0, 174, 234), (1, 174, 234), (0, 175, 234), (0, 174, 235)]

for x in range(i.width):
    for y in range(i.height):
        if i.getpixel((x,y)) not in t:
            i.putpixel((x,y), (0, 174, 234))            

i.save('edit_4_colors.jpg')

_config.yml

OCR with K-Means Clustering

Turns out the top 4 colors were all background, just slightly different shades. I needed a way to combine these groups no matter the color. So I turned to clustering.

# Modified from code by Peter Hansen, https://stackoverflow.com/questions/3241929/python-find-dominant-most-common-color-in-an-image
from __future__ import print_function
import binascii
import struct
from PIL import Image
import numpy as np
import scipy
import scipy.misc
import scipy.cluster
import sys

NUM_CLUSTERS = 2

print('reading image')
im = Image.open(sys.argv[1])
ar = np.asarray(im)
shape = ar.shape
ar = ar.reshape(scipy.product(shape[:2]), shape[2]).astype(float)

print('finding clusters')
codes, dist = scipy.cluster.vq.kmeans(ar, NUM_CLUSTERS)
print('cluster centres:\n', codes)

vecs, dist = scipy.cluster.vq.vq(ar, codes)         # assign codes
counts, bins = scipy.histogram(vecs, len(codes))    # count occurrences

index_max = scipy.argmax(counts)                    # find most frequent
peak = codes[index_max]
colour = binascii.hexlify(bytearray(int(c) for c in peak)).decode('ascii')
print('most frequent is %s (#%s)' % (peak, colour))

import imageio
c = ar.copy()
for i, code in enumerate(codes):
    c[scipy.r_[scipy.where(vecs==i)],:] = code
imageio.imwrite('clusters.png', c.reshape(*shape).astype(np.uint8))
print('saved clustered image')

I used k-means clustering to group the pixels into 5 clusters. I then went through each cluster setting the pixel to the avg color for that cluster. This reduces the colors in the image by using the actual colors in the image and not an arbitrary threshold.
_config.yml _config.yml

image_ocr.py 
flOfi'IGBBSS
M»mmmmmuwmw

nmummmmmwuw
mmmmu.umnmu-Iu

Still no luck, but I decided to try 2 clusters, to get the most contrast possible.
_config.yml

image_ocr.py 
I] Office 365

This document created In online version of Microsoft Office Word

 

To View Dr edn this document, please (llck "En-bl: edit n5" button
on me [up yellow bar, and men cnck "Enable :onzem"

YES!!! Progress. It still can’t recognize everything but we can clearly see indicators of a lure. Lets see if we can improve this.

OCR With Further Improvements

I looked into maybe trying to use bezier curves to estimate the chars and then redraw them somewhere else and pass that to the OCR. Wasn’t able to get anywhere, but I remembered some one on Stack Overflow mentioning to make the image bigger before OCR to help improve the output. So I decided to give this a try and

  • resized the image to 3 times its normal size,
  • clustered on colors,
  • reduced colors to 2 tones, and
  • passed to OCR.

_config.yml

image_ocr.py
I] Office 365

This document created in online version of Microsoft Office Word

To view or edit this document, please click "Enable editing" button
on the top yellow bar, and then click "Enable content"

And it WORKED! It improved the output drastically and now you can clearly read the lure text.

Here it is running on a lure from Zloader. _config.yml

image_ocr.py fig½ 4.png 
X Document created in previous version of
MS Office Excel

To wew W5 comer“ please CNEK «Enable Edmngn (rem me yellow Darand than
ÿluck uEnabVe Cement»

_config.yml

image_ocr.py clusters.png 
X Document created in previous version of
MS Office Excel

To view this content. please click «Enable Editing» from the yellow bar and then
click «Enable Content»

Results

I have updated my tool, doctools, created to extract images from DOC files. It still needs to be expanded to DOCX and Excel formats. Below is the output from running it on a recent Emotet sample.

python extract_img.py -f ../Z_ZG9552902820ZM.doc -o
most frequent is [254.55600768 254.74310052 254.55694112] (#fefefe)
I] Office 365

This document only available for desktop or laptop versions of Microsoft Office Word.
To open the document, follow these steps:

Click Enable editing button from the yellow bar above,
Once you have enabled editing, please click Enable content button.
Suspious words found in extracted text
['image sha256 hash is: 07e0449189d63cb013e70a44d60411965cf32e6d5880cb7c2ad64470130b453e']

Below is an example of running OCR on an image with a multicolored background. Image from Inquest a1cd9613ecd69483134f09d9794965396f224579feeb6aec58d4c11b76b19344 _config.yml

malspam@malspam:~/malspam_analytics/year=2020/month=04$ python image_ocr.py EX820fYXQAE059F\?format\=jpg 
This invoice is protected
by Microsoft Windows

1. Open the invoice in Microsoft Office. Seeing on the web isn't
accessible for ensured archives.

2. On the off chance that you'vejust opened it by means of Microsoft
Office and you see a brief to Enable Edmng as well as Enable Comom, it
would be ideal if you empower either or both

rmricrmwav >wM~Mm m mm. H . r 'l'it'n‘w mum“. in mm.

 

3. When you‘ve clicked Enable Conlem, the invoice will be safely
downloaded.

xiumnvmwlluc r. :‘mrlyl.aprcmira.l.n

_config.yml

image_ocr.py clusters.png 
1. Open the invoice in Microsoft Office. Seeing on the web isn't
accessible for ensured archives.

2. On the off chance that you'vejust opened it by means of Microsoft
Office and you see a brief to ' i ' i as well as i ‘ ' , it
would be ideal if you empower either or both.

i PROTECTED VIEW Se careful—files from the Internet can contain ‘-'lrUSCS Unless you need to edit. it‘s. safer to stay In Protected Vic-w Enable Editing

 

3. When you‘ve clicked 7 . e ,the invoice will be safely
downloaded.

 

Cl," SECURITY WARNING Memos. have been disabled, Enable ("mm
Written on June 9, 2020