One possible way is by using segmentation. Basically segmentation lets you set a variable for each user that you can use to group different types of users. You might have a segmentation variable with values like "employee", "customer", and "general public".
You set the segmentation variable for Google Analytics by adding a line to your tracking code that looks something like:
pageTracker._setVar("segmentation value"); [using new GA tracking code]
__utmSetVar('segmentation value'); [using old GA tracking code]
The other tag-based analytics programs have a similar feature (often more advanced with the ability to set multiple variable values).
That said, at least in some of the cases I've seen, the bots are able to set segmentation variables, which makes me think the bots use embedded browsers in the same manner as the screen capture program WebShots (which is controllable from the command line).
Segmentation + A Honey Pot
Essentially, you're setting up a page and hoping the bad bot goes to the page. When the bad bot goes to the page you set the segmentation variable to something like "bad bot" and you can then see the bad bot in your analytics reports. Because you're using a segmentation variable, just about any report can be broken down by the presence of this value.
You'll want to make sure you don't then overwrite the segmentation variable. If you use a honey pot you may want to only set the segmentation variable on the honey pot page, that way you won't wipe out the value on a subsequent page. Or at least test the cookie with the segmentation value in it before setting other segmentation values.
This solution is far from perfect. For starters, the bad bot may not choose to go to your honey pot page. Some may and some may not. These bad bots are not spiders - they're not trying to find pages on your site. They're programmed to act like humans, so they'll just click on a few links and leave. However, if you see "bad bot" in your segmentation values, it will tell you you may have bigger problems.
A Honey Pot Without Segmentation
You can also use the honey pot concept without segmentation. In this case you'd simply look for the honey pot page in your content reports and see examine entrance paths to the page to determine who's sending you bad traffic.
This may be sufficient. While segmentation variables can be used on many different types of reports, because the segmentation variable won't be set every time a bad bot crawls your site, they may not be as helpful as you might think. The general knowledge that there are bad bots crawling your site and their entrance paths may be as good as it gets.
But do realize that just because a bot says it's coming from xyz.com, doesn't mean it's actually come from xyz.com - that can be faked too. The particulars of your situation will determine how best to interpret what you see in the data.
Segmentation + Honey Pot, Take 2 (best option)
Using both of the approaches above separately is probably the best solution.
That gives you coverage on both types of possible bad bot implementations.
Let me reiterate that there's no way to catch every bad bot - at least I can't think of a way. But if you use some of the strategies I've mentioned here and see that bad bots are crawling your site and affecting your analytics, then you're ahead of the game. The worst case scenario is that bad bots cause you to make bad decisions by changing your data and leading you to false conclusions. If you know your data have been affected, you're more likely to think twice and not rely on faulty data to make decisions.
And also remember, as I've discussed in a previous post, there are ways to use segmentation to set up bot-free zones in your web analytics. Segmentation on things like "Paying customers" or "registered users" can all be relatively immune to bot traffic.