for the pixel intensities in M3 would be... a single scalar value between 0 and 1? (That is, a "vector" of length 1, consisting of just the original pixel intensity.)
Am I misunderstanding something? Or is this maybe a case of "Yes, you could just do that much simpler thing in this simplified example, but in general you need the complicated approach that we're trying to demonstrate"?
> learned token embeddings
for the pixel intensities in M3 would be... a single scalar value between 0 and 1? (That is, a "vector" of length 1, consisting of just the original pixel intensity.)
Am I misunderstanding something? Or is this maybe a case of "Yes, you could just do that much simpler thing in this simplified example, but in general you need the complicated approach that we're trying to demonstrate"?