Hello everybody, I received an assigment I’m struggling with. The assignment is as follows:
We’ve 5 <= N <= 1000 pieces of genomic DNA (strings) which have to be reassembled into a single string, which is a palindrome (might not be unique, any palindrome is a valid solution). The length of each piece is 1 <= L <= 255. The sequences are either as provided in the input, or reversed (TGGA → AGGT).
5 //number of strings
the reassembled palindrome reads:
3 -5 1 -2 4 // Id (line) from input
AGGTGC CCGA TGCCGT AGCCCG TGGA
For now no code, but the brute force solution (permutations of each combination) seems unfeasible. Any ideas leading toward reducing the needed permutations, palindrome properties, etc., will be greatly appreciated.
Interesting one. I have no finished solution. Have you thought about systematically trying to construct a palindrome outside-in with the pieces?
Like in the example you would sort the pieces
and a second sequence of the reverse strings.
Then you start with the first one
and look for a possible piece at the other end of the palindrome, which is only
AGCC … GCCCGA.
Okay, now the right piece is longer. You look for something starting with CG… to append to the left one. There’s only
AGCC CGTGGA … GCCCGA
and so on. You can do this recursively, and back up whenever you hit a dead end.
Since sorting is only O(n log n), and accessing an item in a sorted sequence is O(log n) with a fitting data structure (e.g. binary search trees) you should be able to reduce the running time significantly compared to the brute force solution. In the worst case you have something like O(N * log N) steps for the sorting and O(N * log N * L) steps for choosing and comparing substrings, but you’ll likely abort early more often than not, and it’s unlikely that the solution is the last permutation you ever test.
Thanks a lot for the idea. Conceptually I tried constructing the palindrome inside-out, as well as starting with the longest piece and trying to find the opposite, but this seems more robust.