Ok, so if our main computing load these days is this kind of highly parallelizable operation, does each operation have to be fast? Instead of adding 16 values to 16 other values in parallel using 16 adder circuits 16 times, might we not add 256 values to 256 other values in parallel one time using only simple logical operations? Just calculate each result one bit at a time.
A CPU with only vector logical operations could be very much simpler, allowing either processing of very large vectors or multiple independant processors on a single chip.
This also has the cute effect of allowing arithmetic on arbitrary sized numbers as easily as on numbers that fit into registers.