I'm sharing for anyone else who has run across a situation where dedup runs far slower than expected, and there are not any actual bottlenecks on any of your hardware.
In my situation, I can push more than 500 MB/sec doing a simple file copy of VHD between a hyper-V host and my backup server target. But, backup job rates do not approach that at all. During backups no obvious bottlenecks - net even a single-core CPU bottleneck of dedup is present on source server or target server.
The short story is that the Hyper-V agent only keeps a queue depth of 1 outstanding IO reading the source, so source storage array never sees IO pressure that would cause it to scale read-ahead to get better sequential read throughput. A normal file copy keeps queue depth at 4, source storage sees IO pressure, and scales read-ahead so you get much better throughput.
My particular storage allows me to manually specify a large read-ahead, which I am now doing in pre-commands as a workaround, and increases job rates more than 50% and prevents cases where some other IO causes inexplicable variations in job rates. Full details are at the link below.
Frankly I think this should be embarrassing to Symantec - the fact the IO queue depth matters is such a basic storage concept. It's incredibly frustrating to have good hardware completely underutilized because of this software coding idiocy. If they implemented this change, I could probably double or triple my backup rates with some more concurrency. So far, no one at Symantec has cared at all.
Anyway, if after investigation you find this could be a possible cause, please thumbs-up my idea at the link above. Maybe others will find forcing higher read-ahead on source storage during backup helps job rates. (BTW you also want to make sure your VHD access is truly sequential, not fragmented etc). Maybe someone at Symantec pays attention. But, maybe it's time to look at other products.