Yes, the results were merged, and to the user it appeared that nothing had happened except that the code ran faster. In actuality the threads ("microtasks") had come into existence previously and hung around in a sleep state waiting to be called into a fray. At the end of a given fray the results were merged via semaphore mechanisms and the tasks would depart, returning to a wait state ("spinwait"), and eventually to sleep if not promptly summoned for more work (yielding up the physical CPU for rescheduling by the OS), and await another summons to a fray. For microtasking and autotasking the underlying tasks weren't continually destroyed and recreated because the instantiation cost was too high (OS call, resource allocation, rescheduling). Much more efficient to do it once at the beginning of a program, or for a chunk where parallelism is going to be used, and have them ready to go either spin-waiting or at worst, waiting for the attaching of an available CPU.
But the implementation details are not what the user saw (or rather, what they weren't supposed to have to see...).
In the case of register access, register "scoreboarding" was used to indicate that a register was "waiting" for the output from some operation, suspending any operations wanting to use that register as input. I don't think there was any form of hardware "scoreboarding" for memory, so that required a software handshake (mutex, semaphore, ...).
No, no scoreboarding for memory, but there were shared registers and semaphore registers that could be used for managing memory based structures.