C code "Juicer"

December 1, 2009 2 minutes read | 407 words by Ruben Berenguel

As I said in a previous post, I am looking for some way to detect copied code. I have now a working prototype of a “code juicer”. From a C file it creates a set of PostScript files (well, almost, as they need postprocessing) which are later processed and selected to print. As an example, here is the output applied to a Runge-Kutta-Fehlberg 4-5.

It is something like an execution tree, with function calls squared, and different slopes for if, for, while clauses. It is still somewhat buggy and fails to recognize a few closings. I intend to “juice” all the assignments, and extract the interesting functions (around 5, with about 50 lines of code each, 35 assignments, probably). Then I can print all these pages and just look at them to see matches. Yesterday I already did for 3 examples… they didn’t cheat, they are clearly different.My idea is that the structure of calls is more significant than looking at some diff. A student can change variable names, add or change comments, and then diff is dead. But this structure tree is more like a fingerprint: it maps the structure of the algorithm. It can change from student to student (indeed does), but detects minor changes that diff misses. Let’s see if I find cheaters!The code is a flex .lex file, looking something like:

“void” {def=0;}“char*” {def=key?1:0;}";" {def=0; fprintf(auxiliar, “%% Flags: inc %d def %d par %d for %d RecKey %d\n”,include, def, par, InsideFor, RecentKey); if(!par&&(InsideFor||InsideIf)&&!RecentKey){GetOut(Last);}RecentKey=0;}"(" {par++;}")" {par–;}"<" {}">" {include=0;}

A set of rules and the code to process it. It has a few auxiliar functions to process calls, (and a stack push-pop for tracking calls) and is getting harder and harder to read. as I find more bugs. I almost wrote an entire parser in the .lex source… That happens when you are not sure how to use Bison, or even what it does. But it is working, so I should not complain.In fact, this kind of analysis is probably done, and I can find some similar things in the Wikipedia entries for Syntax Trees. But it has been a really fun coding, and when finished… well, it has already been fun, and almost finished ;) Anyway, the structure I have done suits best several functions per page in landscape mode… and that is exactly what I need, so why bother? Let’s reinvent the wheel a few more times.

You may be also interested in:ParseList(ScrambleList(Relateds(Linux,Programming)))