[Cuis-dev] [squeak-dev] Re: [squeak-smalltalk/squeak-object-memory] Something wrong with integer hashes and/or set grow, Cuis is 1000x faster (Issue #129)

Wed Jan 15 09:18:13 PST 2025

Hi Martin,
Yes, I think we agree. I also make my hashes SmallIntegers and I try to
make them “random”. On the other hand, I think not everybody agreed with
the idea that hashes should try to look random, and in that case the
collection would need to “randomize” the hashes and try to produce
uniformly distributed indices in the internal array. Anyway, yes, it works
fine as it is in Cuis now.

On Wed, Jan 15, 2025 at 23:48 Martin McClure <martin at hand2mouse.com> wrote:

> Hi Luciano,
>
> I wonder what advantage you see in having integers be their own #hash. I'm
> having trouble thinking of any reason that would be more desirable than
> using something like hashMultiply.
>
> In general, a hash function maps some key to an index in a table. In
> Smalltalk, a hash function consists of two parts. The first part is the
> #hash method. The second part is whatever the collection decides to do to
> map the result of the #hash method to indices in its table.
>
> So sure, you could move hashMultiply from the hash method to the
> collections and get the same result -- for SmallIntegers. But classes like
> Set and Dictionary are intended to have keys of any kind of object, not
> just SmallIntegers. Some objects produce better-scrambled #hash values than
> integers do, and for those objects sending hashMultiply in the collection
> is an extra unnecessary step (and in some cases might even make things
> worse, but I haven't looked deeply into the theory for that).
>
> What has worked best for me is to make the #hash method of all objects
> answer SmallIntegers, spread across the entire non-negative SmallInteger
> range in a manner that "looks random" as much as possible. Then the
> collections can typically just modulo the hash value with their table size
> in order to complete the hash function, and get good results.
>
> Regards,
>
> -Martin
> On 1/14/25 11:10 PM, Luciano Notarfrancesco via Cuis-dev wrote:
>
> Yes, I came across that issue with hashed collections of consecutive
> integers some years ago, and that’s why we changed SmallInteger>>hash in
> Cuis. However, I think it might be better to change Set>>scanFor: instead
> (and in the appropriate subclasses like Dictionary, etc), put the call to
> hashMultiply there (or “shuffle” the hash in some other way), and make
> integers be their own hash again.
>
> On Wed, Jan 15, 2025 at 06:14 Nicolas Cellier <
> nicolas.cellier.aka.nice at gmail.com> wrote:
>
>> This time it seems a bit different though, the bench is not dominated by
>> GC, but by hash collisions.
>>
>> Cuis Set uses the old fashioned capacity:
>> sizeFor: nElements
>>      "Large enough size to hold nElements with some slop (see fullCheck)"
>>      nElements <= 0 ifTrue: [^ 1].
>>      ^ nElements+1*4//3
>> Squeak Set uses more advanced #*goodPrimeAtLeast:* for determining the
>> capacity
>> *. *So we would expect Squeak to be better... But no, it isn't.
>>
>> First observe that we have:
>>
>>     (hashes sorted = (1 to: hashes size) asArray).
>>
>> Hence, we try to put consecutives SmallInteger in a Set.
>>
>> If I bench Eliot's test, but with gradually growing size (1<<18-1),
>> (1<<19-1),(1<<20-1), (1<<21-1)
>> I get:
>>  '6.87 per second. 146 milliseconds per run. 13.48116 % GC time.'
>>  '3.23 per second. 310 milliseconds per run. 14.60311 % GC time.'
>>  '1.56 per second. 641 milliseconds per run. 15.1462 % GC time.'
>>  '0.087 per second. 11.5 seconds per run. 0.04346 % GC time.'
>>
>> So the cost grows linearly with size up to 2^20,
>> but 2^21 costs us a factor x20 instead of expected x2,
>> that's 10 times more than expected,
>> and GC cost drops in percentage...
>>
>> I won't even try (1<<22-1) on my slow machine...
>> What happens ?
>>
>> Simple : the answer is in SmallInteger>>hash ^self
>> The identity hashes are varying from 1 to 1<<22-1.
>> Up to 1<<20-1, we do have not so many consecutives hashes statistically.
>> But we have way more consecutive with 1<<21-1 elements (amonf 1<<22-1
>> possible hashes).
>> And since candidate index in scanFor: is simply object hash\\array size+1,
>> once we have a collision (and we will have until the array size grows
>> above 1<<22-2),
>> since we have already filled consecutive slots with consecutives hashes,
>> we are getting long collision chains.
>>
>> So why is Cuis better behaved?
>> Because Integer>>hash will return self hashMultiply...
>> Note that above 1<<54, consecutive Integer hashes will show quite some
>> collisions because first convert asFloat, and many consecutive LargeInteger
>> convert to the same Float.
>>
>> Just redefine SmallInteger hash>>^self hashMultiply, and you'll get more
>> decent benchmark.
>> Except, that you can't do it interactively, because you have to rehash
>> all hashed collection before Morphic takes the hand back.
>>
>>
>> Nicolas
>>
>> Le mar. 14 janv. 2025 à 21:55, Nicolas Cellier <
>> nicolas.cellier.aka.nice at gmail.com> a écrit :
>>
>>> That remind me this thread:
>>>
>>> https://lists.cuis.st/mailman/archives/cuis-dev/2024-March/008555.html
>>>
>>> Le mar. 14 janv. 2025 à 20:38, David T Lewis via Squeak-dev <
>>> squeak-dev at lists.squeakfoundation.org> a écrit :
>>>
>>>> Reported by Eliot on squeak-dev:
>>>>
>>>> https://lists.squeakfoundation.org/archives/list/squeak-dev@lists.squeakfoundation.org/thread/ABGLWILW6YJ2NPAHICWSLCOUIAKXKLFL/
>>>>
>>>> Copied from the original report:
>>>>
>>>> Hi All,
>>>>
>>>>    recently the identityHash implementation in the VM has been
>>>> upgraded.
>>>>
>>>> Open Smalltalk Cog[Spur] VM [CoInterpreterPrimitives
>>>> VMMaker.oscog-eem.3500] 5.20250103.0325
>>>> platform sources revision VM: 202501030325 eliot at Machado.local:oscogvm
>>>> Date: Thu Jan 2 19:25:19 2025 CommitHash: 5e05e38
>>>>
>>>> It now uses an xorshift register PNG that cycles through all possible
>>>> 2^22 - 1 valid identity hashes in 2^22 - 1 iterations. One way to test this
>>>> is to create a set, and loop creating 2^22-1 objects, adding their hashes
>>>> to the set, and answering the size of the set.
>>>>
>>>> So let's do this and see how long it takes at the same time. e.g. this
>>>> is on my 2021 M1 Max MBP
>>>>
>>>> | size |
>>>> { [| hashes |
>>>>     hashes := Set new: (2 raisedTo: 22).
>>>>     1 to: (2 raisedTo: 22) - 1 do:
>>>>         [:ign| hashes add: Object new identityHash].
>>>>     size := hashes size] timeToRun.
>>>>     size.
>>>>     (2 raisedTo: 22) - 1 }. #(450 4194303 4194303)
>>>>
>>>> also #(483 4194303 4194303)
>>>>
>>>> So avoiding grows this takes less than half a second.
>>>>
>>>> However, if we just use Set new and rely on it growing we get
>>>> approximately a 1,500 to 1,800 fold slowdown:
>>>>
>>>> | size |
>>>> { [| hashes |
>>>>     hashes := Set new.
>>>>     1 to: (2 raisedTo: 22) - 1 do:
>>>>         [:ign| hashes add: Object new identityHash].
>>>>     size := hashes size] timeToRun.
>>>>     size.
>>>>     (2 raisedTo: 22) - 1 }. #(768992 4194303 4194303)
>>>>
>>>> also #(800874 4194303 4194303).
>>>>
>>>> 768992 / 483.0 = 1592
>>>> 800874 / 450.0 1780
>>>>
>>>> Now Cuis has moved away from allowing stretchy collections to have
>>>> their capacity initialized with new:. One has to
>>>> use newWithRoomForMoreThan:. So
>>>>
>>>> | size |
>>>> { [| hashes |
>>>> hashes := Set newWithRoomForMoreThan: (2 raisedTo: 22).
>>>> 1 to: (2 raisedTo: 22) - 1 do:
>>>> [:ign| hashes add: Object new identityHash].
>>>> size := hashes size] timeToRun.
>>>> size.
>>>> (2 raisedTo: 22) - 1 }. #(506 4194303 4194303)
>>>>
>>>> BUT!!!! If we just use new and allow Cuis to grow the set we get e.g.
>>>>
>>>> | size |
>>>> { [| hashes |
>>>> hashes := Set new.
>>>> 1 to: (2 raisedTo: 22) - 1 do:
>>>> [:ign| hashes add: Object new identityHash].
>>>> size := hashes size] timeToRun.
>>>> size.
>>>> (2 raisedTo: 22) - 1 }. #(725 4194303 4194303) .
>>>>
>>>> This is at least 1,000 times faster than Squeak.  Something is
>>>> seriously wrong with the Squeak code.
>>>>
>>>> 768992 / 725.0 = 1061
>>>>
>>>> *,,,^..^,,,*
>>>> Happy New Year! Eliot
>>>>
>>>> —
>>>> Reply to this email directly, view it on GitHub
>>>> <https://github.com/squeak-smalltalk/squeak-object-memory/issues/129>,
>>>> or unsubscribe
>>>> <https://github.com/notifications/unsubscribe-auth/BFYAK6LM7BP4BHBELJMTK4L2KVRQPAVCNFSM6AAAAABVFUFTZWVHI2DSMVQWIX3LMV43ASLTON2WKOZSG44DQMJTG42DCMA>
>>>> .
>>>> You are receiving this because you are subscribed to this thread.Message
>>>> ID: <squeak-smalltalk/squeak-object-memory/issues/129 at github.com>
>>>> Squeak-dev mailing list -- squeak-dev at lists.squeakfoundation.org
>>>> To unsubscribe send an email to
>>>> squeak-dev-leave at lists.squeakfoundation.org
>>>
>>> Squeak-dev mailing list -- squeak-dev at lists.squeakfoundation.org
>> To unsubscribe send an email to
>> squeak-dev-leave at lists.squeakfoundation.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cuis.st/mailman/archives/cuis-dev/attachments/20250116/030e7ad5/attachment-0001.htm>