Add missing locks in agent and service code#1570
Conversation
|
@aboch could you also take care of other missing locks for moby/moby#28712 ? I think we need a sweep of the swarm related changes in libnetwork and look for missing locks and other concurrency issues. cc @sanimej |
|
@mavenugo Sure thing, I was not aware of that issue. |
| func (nDB *NetworkDB) SetKey(key []byte) { | ||
| nDB.Lock() | ||
| defer nDB.Unlock() | ||
| logrus.Debugf("Adding key %s", hex.EncodeToString(key)[0:5]) |
There was a problem hiding this comment.
Debug logging should be outside of the lock. Same comment for SetPrimaryKey and RemoveKey as well.
There was a problem hiding this comment.
I intentionally put it inside the lock so that it gets printed when this function effectively does the job
There was a problem hiding this comment.
That is not the purpose of lock. After printing the debug if the lock can't be acquired the go routine is going to be stuck. If its a deadlock the stack decode will capture the state when the user dumps it.
| } | ||
| } | ||
| } | ||
| c.Unlock() |
There was a problem hiding this comment.
Calling a.networkDB.SetKey(key.Key) inside the controller lock doesn't look correct; its not necessary. Also,SetKey takes nDB.Lock which can potentially lead to deadlocs.. We should save the new key and call a.networkDB.SetKey(key.Key) outside of the controller lock.
There was a problem hiding this comment.
How can it deadlock ?
There was a problem hiding this comment.
Two concurrent go routines, one takes c.Lock() and tries nDB.Lock(); second go routine takes nDB.Lock() and tries c.Lock(). We had such deadlocks before, between network and endpoint locks IIRC.
There was a problem hiding this comment.
Understand, but the nDB object does not have any reference to the agent or controller. It's a leaf object. From what I see, networkDB methods are properly coded so that they are the only one to exercise the nDB lock over the nDB data.
I don't think the situation you are describing can happen.
But I can take care of moving the function call outside of the controller lock, if that makes it less confusing.
There was a problem hiding this comment.
I took care of it. PTAL
There was a problem hiding this comment.
yeah, networkdb is currently a separate package without any dependency on libnetwork core. But its safer to avoid nested locks unless its really required. In this case there is no need to call networkDB.SetKey under the controller lock.
There was a problem hiding this comment.
And yes. I agree, being conservative when it comes to holding locks is preferred. Especially when we have locks around a function call (with closely related code), it is much better to avoid it.
| return | ||
| } | ||
|
|
||
| agent.Lock() |
There was a problem hiding this comment.
Can we get the slice of agent.driverCancelFuncs under the lock and do the actual invocation outside of the lock ? Only the access to agent.driverCancelFuncs is racy here. But we are locking around the execution of those functions.
There was a problem hiding this comment.
I will take care of it. Thanks.
There was a problem hiding this comment.
I took care of it. PTAL
Signed-off-by: Alessandro Boch <aboch@docker.com>
|
Thanks @aboch. LGTM |
| c.keys = append(c.keys, key) | ||
| if key.Subsystem == subsysGossip { | ||
| a.networkDB.SetKey(key.Key) | ||
| added = key.Key |
There was a problem hiding this comment.
am not too confident on this code-path. So consider this as a question... It looks like SetKey can be potentially called multiple times based on how many unique keys of type "subsysGossip" is present ?
With this change, it seems to be assumed that it will be only once and it is cached in added variable which is later used to set Key again. Is that the expectation ?
There was a problem hiding this comment.
Yeah, it follows the existing logic for the deleted key (look for the deleted variable).
This is the code path for the key rotation, where we know there will only be one deleted and one added key. But @sanimej can correct us here, he initially wrote this logic.
There was a problem hiding this comment.
@mavenugo What aboch said is correct.The key rotation logic adds one new key to the set and removes one. So before this change also only one key would change on a rotation.
| return n.ctrlr.agent.networkDB.Peers(n.id) | ||
| } | ||
| return []networkdb.PeerInfo{} | ||
| return agent.networkDB.Peers(n.ID()) |
There was a problem hiding this comment.
Is it safe to assume that networkDB will never be nil ?
The current code might behave this way. But why should we have such assumptions built in (which might cause panics in the future) ? I think it is safe to add the nil check.
There was a problem hiding this comment.
Yes it is safe, because the networkDB field is instantiated at agent creation, and never reset.
This is why there is no check over agent.networkDB == nil throughout the agent code.
|
LGTM |
|
@mavenugo : Q: Does this locking changes effective on global/swarm scope or just local scope? |
|
@bklau it should be effective mostly for global/swarm scoped networks. |
|
@mavenugo : Is my understanding correct that if I have a globally-scoped overlay network based on Consul, say, (not 1.12 Swarm mode); then if I issued two network overlay to create/delete the SAME network "my_net" concurrently, a global lock would be contested and held first before doing any create/delete for "my_net"? |
Related to moby/moby#28697, moby/moby#28845 and moby/moby#28712,
Signed-off-by: Alessandro Boch aboch@docker.com