Java byte-code obfuscation using Stratego

 Introduction

To prevent compiling a distribution of a program for every platform it has to run on,modern imperative language like Java and .NET compile in some intermediate form. For Java this intermediate form is called byte-code and .NET calls it Microsoft Intermediate Language (MSIL). Besides the advantages of compile-once-run-everywhere also secure execution of third-party programs in a sandbox environment is possible. But thedrawback is that this intermediate format has to contain all or most information that’s also present in the original source code. Since such codes are relative easy to decompile,they increase the risk of malicious reverse engineering attacks.There exists several ways to protect this code . One way to protect reverse engineering of your code is to not distribute it completely, but only distribute an interface. This can be in the form of a web-service, a web-interface or some other kind of client-server implementation. Although the advantages of this approach are numerous, it’s not always what you want, because of limitations on network bandwidth, latency and availability.Encryption of code is another possibility. However, unless the entire encryption and decryption process takes place in hardware, it is possible for the user to intercept and decrypt the code. Unfortunately, specialized hardware tends to limit the portability of programs.Distribute programs in a form less vulnerable to decompilation is another possibility. When native object codes are supplied instead of Java byte-code, the task of decompilation is made more difficult, although not impossible. But when native object codes have to be supplied the above described advantages of an intermediate form such as byte-code are thrown away.Code obfuscations transform a program to an equivalent one that it is more difficult to understand and reverse engineer. This approach preserves all the advantages of an intermediate compiled form such as byte-code or MSIL, but can never protect an application against all malicious reverse engineering attacks.

Some obfuscating transformations

After being obfuscated, the program still must produce the same results, although it may execute slower or have additional side effects, due to added code. There is a trade-off between the security provided by code obfuscation and the execution time-space penalty imposed on the transformed program. Obfuscation transformations can be classified in the following three categories .

 Layout obfuscations

Layout obfuscations are some simple transformations concerning the information provided by the naming of methods, variables, comments and debug info. Most current Java obfuscators target the lexical syntax of the application, such as source code formatting, commends removal, debug info removal and identifier scrambling. After this transformation the program still is executed in exact the same way. Data- and control obfuscation are more interesting transformations.

Data obfuscations

The obfuscation of data can be divided in three categories, data storage, data aggregation and data ordering. All three affect the manner of how data is accessed within the program.

 Storage and encoding

Data encoding involves the use of variables. For instance one could rewrite the use of an
variable i in terms of 2*i-1.

Untitled

Further, a local variable can be converted to a global and sometimes the other way around. A more advanced variant is the use of aliasing. Several global variables could be created pointing to the same object. Copying the references and using them under
different names or setting some of them to null makes it difficult to understand.Another use in this category is the use of cluttering up the use of standard API’s. The calls to the Java API are a good entry point to understand what happens in that part of the application. Obfuscating these standard calls are not possible without providing your ownimplementation of the API. However, it is possible to delegate all operations performed
on these standard objects. For instance

Untitled

Untitled

Placing the different operations all scattered throughout the program tends to be confusing.

Aggregation

An example of the aggregation of data is converting a two-dimensional-array to a onedimensional-array. OrderingIn stead of accessing a location of an array, the position also could be retrieved invoking an method returning the correct location.

Untitled

 Control flow obfuscations

Control obfuscations influences the order the statements are executed in. But this can also mean confusing the programmer by trying to let him think something has a meaning, while it actually has not. Inserting methods similar to each other, without ever invoking them or with invoking it and discarding the result, tend to be very difficult to understand.The cluttering prevents the programmer to be able to understand the whole picture of the application. Understanding an algorithm used is made virtually impossible using this kind
of transformations.

Aggregation

Aggregation influences the groupment of the statements. For example method inlining and interleaving methods. With method inlining a method call is replaced with the body of that method leading to lots of duplicate code.

Untitled

Interleaving methods means fusing methods. Methods with the same amount and type of parameters and with the same return type are easy to combine. Lets consider the following example.

The three methods have the same structural type and could therefore easily be combined into one method.

Untitled

As can be seen in the figure every call is being rewritten to an call with an extra parameter. This parameter is used in the if or switch to execute the right body.The last technique discussed in this category is the clone method obfuscation. It confuses
the programmer by simulating two different calls to two different methods, while in fact both the calls execute the same body.In the example shown below an class A contains an add method. This method is invoked multiple times. This class and method however could be split up in two. In the result shown below the method and class are split up. Class B however still is of type A.B.add() contains the correct implementation and ensures that A.add() never will be invoked. The if always evaluates to false creating always an B object. A.add() just is an incorrect bogus implementation. Notice that B.add() and A.add2() have the same correct implementation, and hence the name method cloning. Finally the two different methods,with the same implementation, are invoked leading to the same result as the original code.

Untitled

Untitled

Ordering

Affects the order statements are executed in. Loops can be made to iterate backwards instead of forward.

Computation

Control computations are trying to hide the real control flow of an program. This can be achieved by for instance dead code insertion.Another technique is affecting the flow with for instance loops. Java has no goto statement , however there is a Java bytecode goto instruction . Now we can make use of the bytecode goto statement to create a non-reducible control flow graph. Control flow graphs produced from Java programs will always be reducible, but the Java bytecode can express non-reducible flow graphs. Expressing non-reducible flow graphs becomes very awkward in languages without gotos. A decompiler would have to turn a non-reducible flow graph into one with either duplicate code or with extraneous booleans. Turning a structured loop into a loop with multiple headers will achieve this.

 Preventive transformation

This type of transformation doesn’t really obfuscate the source, but tries to attack the decompiler. One example is adding extra instructions after the RETURN byte code instruction. This transformation doesn’t affect the runtime behavior, for it is valid in byte code format, but makes the life of decompilers much harder.Also the reconstruction of nice for loops can be obstructed by adding extra useless
administration.

Untitled

Another annoying transformation is the use of overloading. In Java methods can have the same name as long as the required parameters are of an different type (or different order of the types) regardless of the return value type. In bytecode this requirement is not applicable, for the return type is also taken in consideration .Therefore the following example is valid in byte code but not in Java code.

Untitled

It is needless to say that it would be quite hard for the decompiler to detect this and
decide which one should be ignored.

 Implementation

This chapter describes the implementation of a simple Java byte-code obfuscator implemented in Stratego [7]. First the global architecture is discussed and later each of the different transformations applied is discussed. We have implemented all layout obfuscations and a aggregation control obfuscation named method interleaving.

Architecture

The global architecture consists of a pipeline of different Stratego programs. The input of that pipeline is a list of files and the output is a list of files, this is achieved with the strategy xtc-multi-io-wrap. Below the complete source-code of this pipeline.

main =
xtc-multi-io-wrap(map(xtc-transform(!”class2aterm”, ![“-c”]))
;map(xtc-transform(!”relative2label”))
;map(read-from)
;rename
;interleave-methods
;map(write-to)
;map(xtc-transform(!”label2relative”))
;map(xtc-transform(!”aterm2class”))
)

First the Stratego program class2aterm is mapped over each class-file, so each binary class-file is parsed into the aterm format. Each jump in the Java byte-code is absolute, class2aterm translates these jumps into relative jumps, but the transformations applied by the obfuscator needs labeled jumps. Hence we map the Stratego program relative2 label over each file which replaces each relative jump with a labeled jump. Next we read each class-files and apply the obfuscating transformations. After the transformations each class-file is written to the hard disk, and then the labeled jumps are replaced by relative ones with the label2relative. Finally the different aterms are transformed to binary class-files with aterm2class. The Stratego programs class2aterm and aterm2class are part of The Dryad Front,available on www.strategoxt.org.

Layout obfuscations

Some layout information, such as line-numbers, comments and local-variable names are already thrown away by the java-compiler or are only meta-data in the class-file for debugging purposes. Obfuscating this information is very trivial and can be achieved by just removing this meta-information. But global-variable- and method-names are part of the byte-code instructions and renaming those must be done very carefully.

Global-variable renaming

Global-variables in Java are called Fields in byte-code, so we use the name Field for a global-variable and FieldRef for the use of a global-variable in the rest of this chapter. Global variable renaming is easy, if you encounter a Field, just rename it and define a new dynamic rule [1] to rename all FieldRefs of the current class and all it’s subclasses.But this is to straight-forward because if there is a Field which shadows another Field, two rewrite rules are generated for the FieldRefs, one defined in some parent-class and one in the current class. The simple solution is that if you encounter a Field, first check if there exists a rewrite rule for a Field with the same name, if true use the same new-name,if not generate a new rewrite rule. But if the obfuscator behaves like this shadowing information is still present in the obfuscated code, so this obfuscator only generates rename-rules for sub-classes which doesn’t shadow the current Field. The Stratego-code
which achieves this is displayed below.

RenameField :
Field(AccessFlags(flags), Name(fldnm), tp, atrs) ->
Field(AccessFlags(flags), Name(newfldnm), tp, atrs)
where <CurrentClass>() => currentclass
; <new>() => newfldnm
; if (is-private(|flags))
then <rename-field-ref-dec(|fldnm, tp, newfldnm)>currentclass
else <shadow-rename-field-ref-dec(|fldnm
, tp
, newfldnm)>currentclass
end
shadow-rename-field-ref-dec(|fldnm, tp, newfldnm) =
?class
; <rename-field-ref-dec(|fldnm, tp, newfldnm)>class
; <filter(not(has-field-name(|fldnm)))>
<bagof-SubClasses>class => subclasses
; <all(shadow-rename-field-ref-dec(|fldnm
, tp
, newfldnm))>subclasses
rename-field-ref-dec(|fldnm, tp, newfldnm) =
?class
;rules(RenameFieldRef :
FieldRef(Class(class), Name(fldnm), tp) ->
FieldRef(Class(class), Name(newfldnm), tp))
has-field-name(|fldnm) =
?classname
;<NewFieldName>(classname, fldnm)

The name generated by the strategy new possibly shadows a Field from a third-party super-class. The chance is very small, but it is possible, the solution would be to generate a large random number as new-name, this would make the chance of a conflict smaller.

 Method renaming

Method renaming is more complicated. Because methods that override must get the same new name. The simple solution is again that if you encounter a method, first check if there exists a rewrite rule for a method with the same name ,if true use the same new name, if not generate a new rewrite rule. This solution is very powerful, but stillpreserves some information in the code, for example, it’s still easy to see which methodsoverriding each other.This obfuscator applies a more advanced solution, it first checks if there are some unknown classes or interfaces in the inheritance hierarchy, because then it can’t rename any non-private or non-static methods. If not, the obfuscator checks if there are subclasses which import interfaces, the current method implements, then we have to use the obfuscated name of that interface method. Finally the obfuscator checks if the current method overrides another method, if so we have to use the obfuscated name of the
overridden method. And otherwise we can generate a name for this method ourselves.

 Control obfuscations

Because of time-constraints the only control obfuscations this obfuscator implements is method-interleaving.

Method-interleaving

We have also implemented a control obfuscation transformation named methodinterleaving. The idea of method interleaving is that methods of the same signature are interleaved in one method with a kind of switch parameter.In our implementation only private methods with the same signature are grouped, these methods are then replaced by one method with an extra parameter added. This parameter is used to jump to the instructions of the formal method. The calls to the methods have to be rewritten to invoke the new interleaved function. Also an extra parameter has to be pushed on the stack to jump to the instructions of the formal method.

 Labelling

The call to an interleaved method must be rewritten, because an extra argument must be pushed on the stack to switch to the right instructions in the interleaved method.Therefore an extra load-constant instruction has to be inserted. If the function-call is inside some control flow construction, the relative jumps of that control-flow are not valid anymore.The bottom-up solution for this problem is to change the relative jumps to labelled jumps
at the begin of the pipeline an the other way around at the end. The two Strategoprograms to achieve this are described below.

Relative to Label

The first Stratego-program takes as input the aterm of a class and does a generic traversal. If it encounters a method declaration it scopes the dynamic rule [1] LabelIndex to that method. And if it encounters instructions it first replaces all the relative jumps to labelled jumps and stores the location where a label has to be introduced in the scoped dynamic
rule LabelIndex. Finally it introduces the labels into the instructions on the location
stored in the LabelIndex rule.
relative-to-label =
?Method(_, _, _, _)
; {| LabelIndex : all(relative-to-label) |}
<+ ?Instructions(_)
; ReplaceCounts
; Instructions(IntroduceLabels(|0))
<+ all(relative-to-label)

So this program iterates the instructions twice, first to rewrite the jumps and second to introduce the labels on the right locations. Because jumps can either jump forward or backward this has to be done in two iterations. The complete code of this program can be found and page 14 of this document.

 Label to Relative

The second Stratego-program takes as input the aterm of a class and does a generic traversal. If it encounters a method declaration it scopes the dynamic rule LabelLocation to that method. And if it encounters instructions it first stores every label and the corresponding index in the scoped dynamic rule LabelLocation and immediately removes the label. Finally it iterates the instructions again and replaces every labeled jump with a relative jump, by looking up the label in the LabelLocation
rule.
label-to-relative =
?Method(_, _, _, _)
; {| LabelLocation : all(label-to-relative) |}
<+ ?Instructions(_)
; Instructions(RemoveLabels(|0))
; ReplaceLabels
<+ all(label-to-relative)
The complete code of this program can be found at page 15 of this document.

 Disscusion

In a relative small amount of time we achieved a great part of what non-commercial obfuscator basically do. Most of them rename methods and attributes. Sometimes they also do something with the control flow but that’s about it.
One of the things that made us a bit curious is the fact that the university of Auckland has published a lot of papers about obfuscation. And in some of those papers they describe Kava, a Konfusing Java program. However an implementation is nowhere to be found. StrategoXT provided us means to separate functionality. All the different type of transformations easily could be bundled together, making it scalable and reusable.Even for testing purposes this sometimes was easy. Showing interleaving without doing variable and method renaming and without changing the Stratego code a lot sure saved us
a lot of time.Method inlining is a way to make a program less readable. To illustrate how transformations used in obfuscating could be reused, we show an optimisation used in the HotSpot VM.One of the important optimisations HotSpot performs is inlining frequently-called methods. That means that instead of being invoked with the overhead of a method call,
the method’s code is copied into the calling routine and executed as though it had been coded there. The smaller the method, the more sense inlining makes. A one-line method might spend more than twice as much time entering and exiting the routine as it does executing.One of the things that are very important in the subject of obfuscating is the insight of an
third party library. This functionality is provided by the Dryad library but is not yet used
in this simple obfuscator, adding this functionality should not be a lot of work. This would complement the obfuscation process in an important way.To complete the current small implementation there still is a lot of work to do. This
includes the evaluation of external libraries, class and package renaming, making the current interleaving strategy solid enough to handle all possible situations. Implementing some of the other obfuscation techniques according to metrics given as parameter allowing an amount of performance penalty.Our conclusion is that StrategoXT definitely is suitable for implementing an obfuscator.Especially the combination of Dryad makes it very useful and powerful. The idea in the
beginning was to perform the obfuscation transformations on the bytecode itself. Our conclusion afterwards is that many of the transformations could just as well be performed on the source itself. After doing this project, it is in our believe that all obfuscation transformations as described by the university of Auckland [2] could be realised using
StrategoXT.

 References

1] M. Bravenboer, A. van Dam, K. Olmos, and E. Visser. Program transformation with
scoped dynamic rewrite rules, Technical Report UU-CS-2005-005, Institute of
Information and Computing Science, Utrecht University, 2005.
[2] C. Collberg, C. Thomborson, and D. Low, A Taxonomy of Obfuscating
Transformations, Technical Report 148, Dept. of Computer Science, Univ. of
Auckland, July 1997
[3] C. Collberg, C. Thomborson, and D. Low, Manufacturing cheap, resilient, and
stealthy opaque constructs, Proceedings of the 25th ACM SIGPLAN-SIGACT
symposium on Principles of programming languages, p.184-196, January 19-21,
1998, San Diego, California, United States
[4] James Gosling, Bill Joy and Guy Steele. The Java Language Specification. Addison-
Wesley, 1996. ISBN 0-201-63451-1.
[5] Tim Lindholm and Frank Yellin. The Java Virtual Machine Specification.
http://java.sun.com/docs/books/vmspec/2nd-edition/html/VMSpecTOC.doc.html,
1999.
[6] Douglas Low, Java control flow obfuscation, Master thesis, Dept. of Computer
Science, Univ. of Auckland, June 3 1998
[7] E. Visser. Program transformation with Stratego/XT: Rules, strategies, tools, and
systems in StrategoXT-0.9. In C. Lengauer et al., editors, Domain-Specific Program
Generation, volume 3016 of LNCS, pages 216-238. Springer-Verslag, June 2004