1
00:00:03,340 --> 00:00:09,400
Okay so let's switch topics here.
We finished our vector processors,

2
00:00:11,740 --> 00:00:14,800
And we're going to start talking about
technique.

3
00:00:15,100 --> 00:00:20,009
That's looking at how to exploit a
different, different forms of parallelism,

4
00:00:20,009 --> 00:00:24,660
or thread-level parallelism, than we've
been looking at up to this point.

5
00:00:25,080 --> 00:00:31,020
So this is not data level parallelism.
This is actual thread level parallelism or

6
00:00:31,020 --> 00:00:39,757
process level parallelism in your machine.
So what's the motivation for changing your

7
00:00:39,757 --> 00:00:43,773
computer architectures to run threads very
effectively?

8
00:00:43,773 --> 00:00:48,979
So we've talked about threading or
multi-programming in the past, where you

9
00:00:48,979 --> 00:00:54,907
time slice between different processes.
That's not what this is about.

10
00:00:54,907 --> 00:01:00,594
This is about either, this is about
executing multiple programs at the same

11
00:01:00,594 --> 00:01:05,512
time, or time multiplexing between
processes on the time slice of the

12
00:01:05,512 --> 00:01:10,520
instruction at a time.
So where this really came from is that

13
00:01:10,980 --> 00:01:15,823
sometimes you can't really extract
parallelism in the instructional

14
00:01:15,823 --> 00:01:20,950
parallelism. So this is where out of
border processors, try to exploit,

15
00:01:20,950 --> 00:01:25,788
And super Scalars try to exploit.
And sometimes you can't necessarily

16
00:01:25,788 --> 00:01:31,117
exploit data level parallelism in a
program because data level parallelism

17
00:01:31,117 --> 00:01:37,300
will say it doesn't necessarily work out,
or you don't have dense matrix sorts of

18
00:01:37,300 --> 00:01:41,279
codes, which is very easy to find data
level parallelism.

19
00:01:41,279 --> 00:01:45,756
Instead, sometimes your workloads have
thread level parallelism,

20
00:01:45,756 --> 00:01:51,037
And what we mean by thread level
parallelism is that either there is

21
00:01:51,037 --> 00:01:54,742
independent sequential jobs that need to
happen,

22
00:01:54,742 --> 00:02:00,300
And an example of these would be you have
to operate on a million things,

23
00:02:00,680 --> 00:02:04,995
But the control flow for each of those
million things is wildly different.

24
00:02:04,995 --> 00:02:09,776
So you, the, the most efficient way to go
about doing this is just to actually have

25
00:02:09,776 --> 00:02:15,097
a process or a job per work unit that you
need to work on.

26
00:02:15,097 --> 00:02:18,953
So, a good example of this is network
processing.

27
00:02:18,953 --> 00:02:25,300
You are trying to build a fire wall and a
packet comes in on your network card.

28
00:02:26,040 --> 00:02:30,652
And your firewall wants to inspect this
packet and look for, I don't know,

29
00:02:30,652 --> 00:02:34,127
malicious tacks,
Or it looks for, looks, want, it wants to

30
00:02:34,127 --> 00:02:39,435
look for something like viruses or it
wants to look for attack code sequences or

31
00:02:39,435 --> 00:02:48,180
trojans or, or something else like that.
This isn't data level parallelism, because

32
00:02:48,180 --> 00:02:53,430
you can't try to have each of the
operations working on these packets be the

33
00:02:53,430 --> 00:02:56,045
same.
So depending on the type of packet, you're

34
00:02:56,045 --> 00:02:59,049
going to make wildly different control
flow decisions.

35
00:02:59,049 --> 00:03:01,720
So for instance, if the packet is TCP
versus UDP,

36
00:03:01,720 --> 00:03:04,921
Is, is, like a big choice in the
beginning.

37
00:03:04,921 --> 00:03:10,856
And then, is it, what port is it on, is it
SSH, is it HTTP, is it web traffic?

38
00:03:10,856 --> 00:03:16,478
You know, you're going to do lot's of
different processing in your firewall

39
00:03:16,478 --> 00:03:21,944
based on that decision tree.
So, typically the most efficient way to go

40
00:03:21,944 --> 00:03:27,801
out and try to attack this is that you
just have a job or a process or possibly a

41
00:03:27,801 --> 00:03:35,766
thread per data element that comes in.
Another, reason you might want to do this

42
00:03:35,766 --> 00:03:40,972
is there are examples of applications
where you actually having parallel threads

43
00:03:40,972 --> 00:03:46,189
will solve the problem faster.
So believe it or not the, the traditional

44
00:03:46,189 --> 00:03:50,080
traveling salesman problem.
If you throw threads at that,

45
00:03:50,080 --> 00:03:54,804
You can actually have this problem get
super linear speed-ups, because you can

46
00:03:54,804 --> 00:03:59,004
have basically other threads sort of
parsing off the search tree faster.

47
00:03:59,004 --> 00:04:03,378
And then, the threads can share data
between each other, and when another

48
00:04:03,378 --> 00:04:07,519
thread comes to a certain point, it
doesn't have to recompute a certain

49
00:04:07,519 --> 00:04:12,186
location. It can just use the previous
result of another thread that already got

50
00:04:12,186 --> 00:04:16,268
to that point and know that it doesn't
have to go down a certain path.

51
00:04:16,268 --> 00:04:20,526
So this is kind of like dynamic
programming, if you will, but for some

52
00:04:20,526 --> 00:04:24,575
sort of large search.
That's one example, as a very plain

53
00:04:24,575 --> 00:04:28,704
example having threads, get you super
linear speed up.

54
00:04:28,938 --> 00:04:35,171
But you can also just use threads to go
after a certain types of parallelism that

55
00:04:35,171 --> 00:04:41,248
are hard to get at data level parallelism.
Also in the, you could actually hide

56
00:04:41,248 --> 00:04:46,780
latencies by using threads.
So an example hiding latency using threads

57
00:04:46,780 --> 00:04:52,780
is, let's say you have a program that you
know typically misses in your cache,

58
00:04:53,320 --> 00:04:59,275
But you have parallelism in this program.
One way to attack it is to cut up the

59
00:04:59,275 --> 00:05:05,081
problem and have a thread per different
problem, and when one of the programs

60
00:05:05,081 --> 00:05:08,700
blocks on the cache, switch to the other
program.

61
00:05:09,180 --> 00:05:13,580
So while you're waiting for memory to
respond back you can do some useful work.

62
00:05:16,520 --> 00:05:22,425
Okay, so let's look at this from a
pipeline perspective and look at ways to

63
00:05:22,425 --> 00:05:25,755
recover cycles.
So here we have some loads.

64
00:05:25,755 --> 00:05:32,177
Load, load, something that's dependent on,
We've a, we've a load that's dependent on

65
00:05:32,177 --> 00:05:36,854
another load and an add that's dependent
on the second load.

66
00:05:36,854 --> 00:05:41,135
And then finally, a store which is
dependent on the add.

67
00:05:41,135 --> 00:05:47,002
So the down, down side of doing this is,
you got all this sort of dead time.

68
00:05:47,002 --> 00:05:51,680
All this purple time here is, is dead time
on the processor.

69
00:05:54,040 --> 00:05:58,047
We could throw an out-of-order super
Scalar at this.

70
00:05:58,047 --> 00:06:08,986
It's not going to do any better.
Well yeah? So you're right this could,

71
00:06:08,986 --> 00:06:13,723
If we're doing bypassing, we can pull
these one, one cycle earlier.

72
00:06:13,723 --> 00:06:21,091
But we still have a lot of dead time.
And if you were to have an out of order

73
00:06:21,091 --> 00:06:25,093
super Scalar,
Out of order would not actually help you

74
00:06:25,093 --> 00:06:31,022
out in this case either because you, is an
actual dependency string through all these

75
00:06:31,022 --> 00:06:36,940
instructions.
Hm, now that's, that's not great.

76
00:06:38,080 --> 00:06:46,117
So can we,
Can we come up with some ideas to, to cope

77
00:06:46,117 --> 00:06:48,852
with this.
So one thing we were said is we can add

78
00:06:48,852 --> 00:06:53,174
bypassing to, to decrease the time here.
But having out of order super Scalar is

79
00:06:53,174 --> 00:06:58,314
not going to make this go faster.
So that technique doesn't work. what other

80
00:06:58,314 --> 00:07:00,948
techniques have we talked about?
Vector processors?

81
00:07:00,948 --> 00:07:04,342
Well, there's no, no vectors of data here.
We can go wide.

82
00:07:04,342 --> 00:07:08,323
Well, that doesn't hap-, help.
We're still going to only be executing one

83
00:07:08,323 --> 00:07:10,196
instruction per cycle.
So we can,

84
00:07:10,196 --> 00:07:12,888
We can try to do VLIW.
Doesn't, doesn't help.

85
00:07:12,888 --> 00:07:17,161
If out of order superscalar can't do it,
the VLIW probably can't do it.

86
00:07:17,161 --> 00:07:21,317
So we have all these dead cycles,
We want to try to recover some of these

87
00:07:21,317 --> 00:07:22,020
dead cycles.