My failed LSB steganography algorithm

Published

Estimated Reading Time: 7 minutes

Last month (December 2014), I started developing a new GUI steganography software after building a simple steganography tool for my post at Infosec Institue. The simple tool (stegman) used a really simple approach that can be thought of and implemented by anyone in few minutes.

stego girl

Image exif data is stored in about the first 30 hex values of the image hexadecimal data. The number may not be accurate enough but the point remains exif data is stored at the top of the image at it should be kept untampered with to retain a proper and valid image format. The software appends data to the end of image files.

From my analysis as done on the infosec article, most JPEG file formats have a hexadecimal tail of 0xFFD9 and PNG images have 0x426082. I couldn’t play around with GIF or BMP as there was inconsistency with the hexadecimal structure of most of them I had examined. The tool had a little function to grab hexadecimal values with the binascii module:

import sys, re, binascii, string
def gethex(image):
  f = open(image, 'rb')
  data = f.read()
  f.close()
  hexcode = binascii.hexlify(data)
  return hexcode

Every steganography tool performs two main functions which are embed and extract. Before file is embeded, it checks to see if there is any extra data after the usual JPEG or PNG tail hex values

def extradatacheck(data, type):
  if type == 'png':
    pattern = r'(?<=426082)(.*)'
  elif type == 'jpg':
    pattern == r'(?<=FFD9)(.*)'
  match = re.search(pattern, data)
  if match:
    return match.group(0)
  else:
    false

The embed function

def embed(embedFile, coverFile, stegFile):
  filetype = coverFile[-3:]
  stegtype = stegFile[-3:]
  if filetype != 'png' and filetype != 'jpg':
    print 'Invalid format'
  elif filetype != stegtype:
    print 'Output file has to be in the same format as cover image (%s)' % string.swapcase(filetype)
  else:
    data = open(embedFile, 'r').read()
    info = gethex(coverFile)
    if extradatacheck(info, filetype):
      print 'File already contains embedded data'
    else:
      info += data.encode('hex')
      f = open(stegFile, 'w')
      f.write(binascii.unhexlify(info))
      f.close()
      print 'Storing data to', stegFile

The function ends up converting the manipulated hex back to ASCII and writes to the new output file. The extract function performs the same check for appended data after regular tails and converts found data to ASCII if found

def extract(stegFile, outFile):
  filetype = stegFile[-3:]
  data = gethex(stegFile)
  if extradatacheck(data, filetype):
    store = open(outFile, 'w')
    store.write( binascii.unhexlify(extradatacheck(data, filetype)) )
    store.close()
    print 'Extracted data stored to', outFile
  else:
    print 'File has no embedded data in it'

The program achieved its objective which is the Steganography process but the sophistication level was 10%. Not good enough I thought. Oh! It wasn’t just me, Daniel Lerch thought the same too


I explored Oni49’s stegoBlue and created a fork to understand how it worked and see how I could derive a new implementation from it. Took a while but I later found how he used the PIL (Python Imaging Library) to list image data in RGB tuples. StegoBlue is a manipulation of the blue pixels in a pixel list.

from PIL import Image
img = Image.open('cool.bmp')
pixelList = list(img.getdata())

The image data in RGB tuples were so much in the list but I managed to grab a few from the bottom to show what the data looked like.

[(14, 16, 15), (14, 16, 15), (11, 15, 14), (13, 17, 16), (13, 17, 16), (13, 17, 16), (15, 16, 18), (14, 15, 17), (13, 13, 15), (15, 15, 17), (16, 16, 16), (17, 17, 17), (18, 20, 19), (15, 17, 16), (13, 17, 18), (14, 18, 19), (16, 20, 21), (17, 21, 22), (17, 21, 20), (15, 19, 18), (16, 18, 13), (18, 20, 15), (18, 23, 19), (18, 23, 19), (20, 24, 23), (22, 26, 25), (23, 27, 26), (24, 28, 27), (25, 29, 28), (23, 27, 26), (26, 30, 29), (26, 30, 29), (26, 28, 27), (27, 29, 28), (29, 29, 29), (33, 33, 33), (33, 31, 32), (31, 31, 31), (28, 30, 29), (27, 31, 30), (26, 28, 27), (27, 29, 28), (26, 28, 27), (27, 29, 28), (23, 29, 27), (24, 30, 28), (20, 31, 27), (19, 30, 26), (19, 31, 27), (19, 30, 26), (15, 24, 21), (14, 18, 17), (14, 18, 17), (18, 20, 19), (21, 21, 23), (19, 19, 21), (18, 16, 19), (17, 15, 18), (18, 16, 19), (17, 15, 18), (17, 15, 20), (17, 15, 20), (17, 15, 18), (17, 15, 18), (15, 13, 16), (15, 13, 16), (14, 14, 14), (13, 13, 13), (11, 11, 11), (12, 12, 12)]

That’s just about 5% of the whole data from a 5M BMP image. I was digging his approach even though it breaks sometimes. I tried with JPGs and PNGs and it passed with some of the files that were tested.

###My New Algorithm

#!/usr/bin/env python
import binascii, os, base64, gnupg, hashlib
from Crypto.Cipher import AES
from Crypto import Random

def embed(file, text, key, output = 'output.jpg'):
  #==== Using GPG  ====
  gpg = gnupg.GPG()
  cipher = gpg.encrypt(text, recipients=None, symmetric='AES256', passphrase=key, armor=True)
  ctext = hashlib.md5( str(cipher) ).hexdigest()
  #==== Using AES ====
  #iv = Random.new().read(AES.block_size)
  #cipher = AES.new(key, AES.MODE_CFB, iv)
  #ctext = hashlib.md5( iv + cipher.encrypt(text) ).hexdigest()

  ctexthex = binascii.hexlify( ctext )
  ctextbin = bin( int(ctexthex, 16) )[2:]
  print len(ctextbin)
  try:
    f = open(file, 'r')
    filebin = f.read()
    hexdata = binascii.hexlify(filebin)
    # tuples of each byte in hex
    bytesTuple = zip(hexdata[::2], hexdata[1::2])
    # list of every byte in hexadecimal
    bytes = [''.join(tuple) for tuple in bytesTuple]
    # split bytes into two keeping first segment untouched to avoid metadata tampering
    byteDivisor = len(bytes) / 2
    byteSegment1, byteSegment2 = bytes[:byteDivisor], bytes[byteDivisor:]
    print 'Segment 2 length: '+ str(len(byteSegment2))
    for i in range( len(ctextbin) ):
      # modifying the LSB
      binary = bin(int( byteSegment2[i], 16) )[:-1] + ctextbin[i]
      hexback = hex(int(binary, 2))[2:] if len(hex(int(binary, 2))[2:]) == 2 else '0' + hex(int(binary, 2))[2:]
      byteSegment2[i] = hexback
    # rejoin both byte segments
    bytes = byteSegment1 + byteSegment2
    # converting bytes list back to string
    mergehex = ''
    for byte in bytes:
      mergehex += byte
    rawbin = binascii.unhexlify(mergehex)
    outdata = open(output, 'w')
    outdata.write(rawbin)
  except IOError:
    print "Failed to locate file"

def extract(file, key, output = 'output.txt'):
  try:
    f = open(file, 'r')
    filebin = f.read()
    hexdata = binascii.hexlify(filebin)
    bytesTuple = zip(hexdata[::2], hexdata[1::2])
    bytes = [''.join(tuple) for tuple in bytesTuple]
    byteDivisor = len(bytes) / 2
    byteSegment1, byteSegment2 = bytes[:byteDivisor], bytes[byteDivisor:]
    dataBytes = byteSegment2[:32] # md5 data occupied 32 chars
    mergehex = ''
    for byte in dataBytes:
      mergehex += byte
    f = open(output, 'w')
    f.write( binascii.unhexlify(mergehex) )
  except IOError:
    print 'Failed to locate file'


filename = raw_input("Enter the name of the file:")
embed(filename, 'some awesome stuff', 'abcdefghijklmnop', 'output.png')
#extract(filename, 'abcdefghijklmnop')

What the heck is going on here? I know that’s like bunch of crap but it seemed like a nice idea to me. Let me explain:

I built a GUI already with PyQt4 with hopes that the logic will work just as I’d thought it would. On completion of a functionless nice GUI, I thought it will be nice to create a separate module to handle the logic of the software implementation.

When I wrote about symmetric encryption in python, I mentioned how I had used AES from the Crypto module to try achieving a symmetric encryption. I might have not been at my best with that module but it didn’t seem to work fine enough for me. I had resorted to the GPG module which stored the characters

some awesome stuff

with the cover image file as this

Steganography input

The resulting output has had some major pixels tampered with

Steganography output

The distortion is from the middle to the bottom only because I splitted the image hex into two to leave the first half containing EXIF data untampered with. The other half whose LSB was modified now produces a malformed image output.

A steganography software is to modify the media file with no obvious changes but I have a modification with a way too obvious change. I’ve had to put this aside for a while to get on with other work. If you have any suggestions to this algorithm, I’ll appreciate them

show comments