Small blog about different CTFs completed or software I have worked on that I find interesting.
Phishing Doc Lures
Threat actors commonly use Microsoft Office documents with VBA macros for phishing attacks to infect an unexpecting end user.
To get the user to enable macros, the documents commonly contain image lures. Some claim the document is
from an older version of office,
Initial Attempt at OCR
InQuest wrote a blog post showing some common lures contained in malicious documents.
One of the methods security researchers have attempted to detect documents with these lures, is through Optical Character Recognition.
OCR is very good if the images are well formed and the threat actor hasn’t taken steps to confuse OCR systems. Some methods to make OCR harder are
lowering the image resolution,
introducing noise into the image, or
using similar colors for text and background.
Below is an image lure used by Emotet.
I use pytesseract for OCR to attempt to extract the text.
Below is the result of using OCR on the above image.
Unfortunately, OCR was not able to recognize anything. This is likely due to the noise and blurring near the characters.
OCR With Pre-processed Image
My first thought was to use some pre-processing before passing the image to OCR. I have worked on something similar in the past, and so decided to take a similar approach.
I used some pre-processing to separate the image into a two color image.
It looks even worse. I believed I had the right idea, but not the right approach. I thought maybe if I was able to determine that background color I could then adjust my method.
Turns out that 71%+ is all one color, the light blue. I thought I could just list the highest used colors and only use the top 4.
OCR with K-Means Clustering
Turns out the top 4 colors were all background, just slightly different shades. I needed a way to combine these groups no matter the color. So I turned to clustering.
I used k-means clustering to group the pixels into 5 clusters. I then went through each cluster setting the pixel to the avg color for that cluster. This reduces the colors in the image by using the actual colors in the image and not an arbitrary threshold.
Still no luck, but I decided to try 2 clusters, to get the most contrast possible.
YES!!! Progress. It still can’t recognize everything but we can clearly see indicators of a lure. Lets see if we can improve this.
OCR With Further Improvements
I looked into maybe trying to use bezier curves to estimate the chars and then redraw them somewhere else and pass that to the OCR.
Wasn’t able to get anywhere, but I remembered some one on Stack Overflow mentioning to make the image bigger before OCR to help improve the output.
So I decided to give this a try and
resized the image to 3 times its normal size,
clustered on colors,
reduced colors to 2 tones, and
passed to OCR.
And it WORKED! It improved the output drastically and now you can clearly read the lure text.
Here it is running on a lure from Zloader.
I have updated my tool, doctools, created to extract images from DOC files. It still needs to be expanded to DOCX and Excel formats. Below is the output from running it on a recent Emotet sample.