Stack and Heap

Got questions? Got answers? Go here for both.

Moderator: MaxCoderz Staff

User avatar
Wesley
Regular Member
Posts: 137
Joined: Sun 12 Oct, 2008 1:42 am
Location: Boise, Idaho
Contact:

Stack and Heap

Post by Wesley »

I won't go into too much detail of what I'm doing right now for the sake of time.

I was wondering if anyone has any good ideas on memory usage regarding arrays. I am implementing the Needleman-Wunsch algorithm for comparing two strings. The first and most easy idea to implement is that of a 2D array storing values, then travelling back to recover the best string comparison.

My problem is that I have to create an array of size string1.length+1 by string2.length+1; which is massive and requires a lot of memory.

I was thinking of travelling with a 2*2 array to find the edit distance between two strings and then create a "Cell" object for each cell and have a pointer to where the "Cell" object can travel back to recover the best string comparison.

Any ideas? Any questions?
ImageImage
ImageImage
Image
polarBody
New Member
Posts: 23
Joined: Wed 02 Sep, 2009 3:01 pm
Location: Chicago, IL

Re: Stack and Heap

Post by polarBody »

Wesley wrote:I was thinking of travelling with a 2*2 array to find the edit distance between two strings and then create a "Cell" object for each cell and have a pointer to where the "Cell" object can travel back to recover the best string comparison.
Could you elaborate on this a little more? I'm not sure I understand your plan of attack :)
King Harold
Calc King
Posts: 1513
Joined: Sat 05 Aug, 2006 7:22 am

Re: Stack and Heap

Post by King Harold »

How long are the strings?
What language will this be in?
And where/how do stacks an heaps come into this - do you mean The Heap and The Stack? Since your array is dynamically allocated (dynamic size, so.., well you could use alloca.. but The Stack is typically only about 1MB) it would be on the heap but that shouldn't be a problem..

How about a sparse matrix?
User avatar
Wesley
Regular Member
Posts: 137
Joined: Sun 12 Oct, 2008 1:42 am
Location: Boise, Idaho
Contact:

Re: Stack and Heap

Post by Wesley »

Okay, a more detailed description of the goal is here.

1) The strings can be as long as 500,000 characters long (or longer). My goal is to make my program work for 500,000 character-long strings.
2) I am coding in Java. That was a good question, since Java handles memory quite differently from other languages (such as C++ and C).
3) I am unsure what you mean by
"stacks an heaps come into this - do you mean The Heap and The Stack?"

You see, my problem is this. Whenever I have to compare two strings I need a grid of integers: opt[N+1][M+1] where N and M are the sizes of the two strings being compared. Do you see where this can get really ugly?

When I want to compare a string of size N or M (assuming the two are the same) that is even 100,000 characters long, I need a 100,000*100,000 grid of int values. That means I need 100,000*100,000*(4 bytes) = 40,000,000,000 bytes! That's roughly 40,000 MB.

So, my goal is NOT to use a grid in this manner. So, I'm thinking of using "Cell" objects which have references to their previous cell. I'm not sure if this will work, because I would need to have separate "Cell" objects so they don't get collected by the garbage collector.
ImageImage
ImageImage
Image
darkstone knight
New Member
Posts: 67
Joined: Sun 09 Nov, 2008 1:56 pm

Re: Stack and Heap

Post by darkstone knight »

cant you just lz77 compress the entire.. thing?

modified lz77 seems like the perfect algorithm for comparing strings strings :D

(then again, im far too lazy to find out what proper string alignement is :P )
King Harold
Calc King
Posts: 1513
Joined: Sat 05 Aug, 2006 7:22 am

Re: Stack and Heap

Post by King Harold »

Well I don't see how lz77 would come in.. You'd still need to store those grid entries "somewhere", and that "somewhere" had better not get too big..

Anyway this is DP, and with DP you can always cheat - for example, instead of just filling a grid, you could do it the "old fashioned" way: recursion with memoization (and then treat the memoization "database" as a limited-size cache)
That's possible..

Anyway, this algorithm uses 2 passes, 1 to fill a matrix and 1 to go back and "follow the optimal path". But that doesn't mean that you actually have to do that, what's wrong with building the optimal path while you're filling that matrix? Nothing, right? And if you do that, you may find that you only need to store row i-1 and row i itself (and a little something extra to remember which "way you went" at every "crossroad")

Then again it is rather late, so I may just be spewing nonsense.

This trick applies to many DP algorithms though
User avatar
Wesley
Regular Member
Posts: 137
Joined: Sun 12 Oct, 2008 1:42 am
Location: Boise, Idaho
Contact:

Re: Stack and Heap

Post by Wesley »

King Harold, you are now on the same track as me. Wow. We are thinking the exact same thing!

I am trying to implement dynamic programming. That's where my idea of the "Cell" objects comes in. Where each cell progressively has a path back to the beginning (or in this case the end of the strings).

One of my questions lies in this cell communication. When does the Java garbage collector actually scoop up everything (it's after)? Is it after a method is finished? If that is so, this plan won't be too difficult.
ImageImage
ImageImage
Image
polarBody
New Member
Posts: 23
Joined: Wed 02 Sep, 2009 3:01 pm
Location: Chicago, IL

Re: Stack and Heap

Post by polarBody »

The Java garbage collector will only reclaim allocated data if there is no longer any reference to the data. For example:

Code: Select all

String s1 = new String("blah");
String s2 = new String("blah blah");
s1 = s2;
After the third statement, both s1 and s2 are pointing to the "blah blah" string, and nothing is pointing to the "blah" string, so the garbage collector will free up the "blah" string since it is unreachable to the programmer after the third line. Now imagine that the following code is executed after the code above:

Code: Select all

s2 = new String("flippy floppy");
Remember that s1 still points to "blah blah", so the "blah blah" string will not be deallocated.

As for whether the garbage collector will reclaim an object after a method ends, again it depends on whether or not a reference variable still points to the object at that point in the execution of the program.
User avatar
Wesley
Regular Member
Posts: 137
Joined: Sun 12 Oct, 2008 1:42 am
Location: Boise, Idaho
Contact:

Re: Stack and Heap

Post by Wesley »

Hmm... From what I'm finding, the Needleman-Wunsch algorithm seems to only compute in O(MN) (M and N being the size of the two strings being compared) time and takes up at least MN memory (depending on the type of objects being used).

I'm not sure what more I can do.

I guess I could use shorts or even bytes instead of integers. That would reduce the size some, but not enough for a 500,000 character string.

I think the cell idea is the only way, but I'm quite sketchy on how I'd go about implementing it.
ImageImage
ImageImage
Image
polarBody
New Member
Posts: 23
Joined: Wed 02 Sep, 2009 3:01 pm
Location: Chicago, IL

Re: Stack and Heap

Post by polarBody »

Perhaps Hirschberg will come to the rescue.
User avatar
Wesley
Regular Member
Posts: 137
Joined: Sun 12 Oct, 2008 1:42 am
Location: Boise, Idaho
Contact:

Re: Stack and Heap

Post by Wesley »

That's awesome and crazy all at the same time!

I originally had the idea presented in the Hirschberg algorithm. I didn't think it would work. It's basically implementing the same thing as my second idea, the "Cell" objects, but with only the arrays and no "Cell"s.

Great! I will get working on research again. Thanks again!
ImageImage
ImageImage
Image
User avatar
Wesley
Regular Member
Posts: 137
Joined: Sun 12 Oct, 2008 1:42 am
Location: Boise, Idaho
Contact:

Re: Stack and Heap

Post by Wesley »

So much for working ahead! My professor just posted Hirschberg's paper on his algorithm for computing the LCS. LCS is somewhat different from the Edit Distance problem, but the algorithm can be used in the same fashion.

Why, oh why, must the diligent suffer!?

See the article here. :cry:

I'm kind of confused on how he slices the grid in twos...
Last edited by Wesley on Fri 09 Oct, 2009 12:57 pm, edited 1 time in total.
ImageImage
ImageImage
Image
King Harold
Calc King
Posts: 1513
Joined: Sat 05 Aug, 2006 7:22 am

Re: Stack and Heap

Post by King Harold »

So.. you go to a Chinese college? :?
User avatar
Wesley
Regular Member
Posts: 137
Joined: Sun 12 Oct, 2008 1:42 am
Location: Boise, Idaho
Contact:

Re: Stack and Heap

Post by Wesley »

King Harold wrote:So.. you go to a Chinese college? :?
I'm not sure what you mean... I was simply looking around for how Hirschberg solved the problem. I spent an awful long time reading other articles and trying to figure out the algorithm, but to no avail. Until, I found Hirschberg's paper. But, by the time I found it, my professor announced that he'd post a paper about the Edit Distance problem, and he posted the very same article!

lol. I see what you mean now. That link isn't from my school. My professor used Blackboard to post the paper. I only posted a link to a public copy. I go to Boise State University. It's not saying much though. I'm finding more and more how uncompetitive BSU is... We are so behind in our computer science classes compared to other universities. I should've been learning about data structures and algorithms my freshman year, not my sophomore year.

EDIT: At least I have my own side web page! I just haven't had much time to add to it... I've never done anything with web development, but I like what I've done to it so far; at least it looks better than my CS professor's page.
ImageImage
ImageImage
Image
King Harold
Calc King
Posts: 1513
Joined: Sat 05 Aug, 2006 7:22 am

Re: Stack and Heap

Post by King Harold »

Oh I see, sorry

btw, you're allowed to learn in your own time
That's what I did, in fact all of the stuff I've posted about here is stuff I learned on my own :)
And of course it's the only way to get to be actually better than the rest..
Good luck

Oh and edit distance is nice, it easily allows you to do the "2 rows"-trick, so that's cool :)
Post Reply