close
Skip to content

Add missing locks in agent and service code#1570

Merged
mavenugo merged 1 commit into
moby:masterfrom
aboch:lck
Nov 30, 2016
Merged

Add missing locks in agent and service code#1570
mavenugo merged 1 commit into
moby:masterfrom
aboch:lck

Conversation

@aboch

@aboch aboch commented Nov 22, 2016

Copy link
Copy Markdown
Contributor

Related to moby/moby#28697, moby/moby#28845 and moby/moby#28712,

Signed-off-by: Alessandro Boch aboch@docker.com

@mavenugo

Copy link
Copy Markdown
Contributor

@aboch could you also take care of other missing locks for moby/moby#28712 ?

I think we need a sweep of the swarm related changes in libnetwork and look for missing locks and other concurrency issues. cc @sanimej

@aboch

aboch commented Nov 22, 2016

Copy link
Copy Markdown
Contributor Author

@mavenugo Sure thing, I was not aware of that issue.

@aboch aboch changed the title Add missing locks in service code Add missing locks in agent and service code Nov 22, 2016
@aboch

aboch commented Nov 29, 2016

Copy link
Copy Markdown
Contributor Author

ping @mavenugo @sanimej This is ready for review

Comment thread networkdb/cluster.go Outdated
func (nDB *NetworkDB) SetKey(key []byte) {
nDB.Lock()
defer nDB.Unlock()
logrus.Debugf("Adding key %s", hex.EncodeToString(key)[0:5])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug logging should be outside of the lock. Same comment for SetPrimaryKey and RemoveKey as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally put it inside the lock so that it gets printed when this function effectively does the job

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not the purpose of lock. After printing the debug if the lock can't be acquired the go routine is going to be stuck. If its a deadlock the stack decode will capture the state when the user dumps it.

Comment thread agent.go
}
}
}
c.Unlock()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling a.networkDB.SetKey(key.Key) inside the controller lock doesn't look correct; its not necessary. Also,SetKey takes nDB.Lock which can potentially lead to deadlocs.. We should save the new key and call a.networkDB.SetKey(key.Key) outside of the controller lock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can it deadlock ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concurrent go routines, one takes c.Lock() and tries nDB.Lock(); second go routine takes nDB.Lock() and tries c.Lock(). We had such deadlocks before, between network and endpoint locks IIRC.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand, but the nDB object does not have any reference to the agent or controller. It's a leaf object. From what I see, networkDB methods are properly coded so that they are the only one to exercise the nDB lock over the nDB data.

I don't think the situation you are describing can happen.
But I can take care of moving the function call outside of the controller lock, if that makes it less confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took care of it. PTAL

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, networkdb is currently a separate package without any dependency on libnetwork core. But its safer to avoid nested locks unless its really required. In this case there is no need to call networkDB.SetKey under the controller lock.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes. I agree, being conservative when it comes to holding locks is preferred. Especially when we have locks around a function call (with closely related code), it is much better to avoid it.

Comment thread agent.go
return
}

agent.Lock()

@sanimej sanimej Nov 29, 2016

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get the slice of agent.driverCancelFuncs under the lock and do the actual invocation outside of the lock ? Only the access to agent.driverCancelFuncs is racy here. But we are locking around the execution of those functions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take care of it. Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took care of it. PTAL

Signed-off-by: Alessandro Boch <aboch@docker.com>
@sanimej

sanimej commented Nov 29, 2016

Copy link
Copy Markdown

Thanks @aboch. LGTM

Comment thread agent.go
c.keys = append(c.keys, key)
if key.Subsystem == subsysGossip {
a.networkDB.SetKey(key.Key)
added = key.Key

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am not too confident on this code-path. So consider this as a question... It looks like SetKey can be potentially called multiple times based on how many unique keys of type "subsysGossip" is present ?
With this change, it seems to be assumed that it will be only once and it is cached in added variable which is later used to set Key again. Is that the expectation ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it follows the existing logic for the deleted key (look for the deleted variable).
This is the code path for the key rotation, where we know there will only be one deleted and one added key. But @sanimej can correct us here, he initially wrote this logic.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mavenugo What aboch said is correct.The key rotation logic adds one new key to the set and removes one. So before this change also only one key would change on a rotation.

Comment thread network.go
return n.ctrlr.agent.networkDB.Peers(n.id)
}
return []networkdb.PeerInfo{}
return agent.networkDB.Peers(n.ID())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to assume that networkDB will never be nil ?
The current code might behave this way. But why should we have such assumptions built in (which might cause panics in the future) ? I think it is safe to add the nil check.

@aboch aboch Nov 29, 2016

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is safe, because the networkDB field is instantiated at agent creation, and never reset.
This is why there is no check over agent.networkDB == nil throughout the agent code.

@mavenugo

Copy link
Copy Markdown
Contributor

LGTM

@mavenugo mavenugo merged commit 6cfa15e into moby:master Nov 30, 2016
@bklau

bklau commented Dec 16, 2016

Copy link
Copy Markdown

@mavenugo : Q: Does this locking changes effective on global/swarm scope or just local scope?
Thx

@mavenugo

Copy link
Copy Markdown
Contributor

@bklau it should be effective mostly for global/swarm scoped networks.

@bklau

bklau commented Dec 16, 2016

Copy link
Copy Markdown

@mavenugo : Is my understanding correct that if I have a globally-scoped overlay network based on Consul, say, (not 1.12 Swarm mode); then if I issued two network overlay to create/delete the SAME network "my_net" concurrently, a global lock would be contested and held first before doing any create/delete for "my_net"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants