Programming robots for a safe interaction with humans is extremely complex especially in collaborative tasks. One reason is the unpredictable behaviour of humans that may have an intention which is not clear to the robot. We present a novel architecture for a safe human-robot collaboration scenario in a shared tabletop workspace based on intuitive multimodal language and gesture instructions and behaviour recognition. In our example scenario, a human and a robot arm collaboratively have to assemble a Tangram puzzle. The configuration space of the robot is constrained by a combination of learned behaviour patterns of the user by tracking its arm and direct audio-visual instructions regarding the sharing of the workspace. This ensures a safe and non-obstructive collaboration behavior of the robot which can constantly be updated during task execution. In this paper, we present initial results with a focus on instruction understanding.