Previously, all processes were single threaded. It was so easy to schedule the processes. But all modern operating system uses multi-threading approach due to which scheduling gets complicated. We need to consider the position of thread whether it is kernel threaded or user threaded.
If threading is done by userspace, and the kernel is unknown about it, then per-process scheduling occurs. But in the case of kernel processes, there is a different situation. Here, the kernel is well known about all threads and choose the thread from the process. In multiprocessor scheduling, only the issue is that it is complicated to judge which thread should be run next.
Uni-processor scheduling is single dimensional while multiprocessor scheduling is two dimensional. In multiprocessor scheduling, Scheduler has to decide which thread to run first and which CPU to run for that thread. So, the extra dimension makes scheduling too complicated.
Another complicated issue is all the threads are unrelated to each other i.e. they nothing to do with one another. But in other systems, threads come together and works together.
There are many ways in which we can schedule the multiprocessing threads are given below.
Consider thread A0 and A1 are belonging to process A and B0 and B1 to process B. Threads A0 and B0 are time shared on CPU 0. Same way A1 and B1 on CPU 1. Thread A0 and A1 need to communicate often. A0 sends message to A1 and A1 replied to A0.
Suppose A0 and B1 start first, as shown in the figure below.
In time slice 0, A0 sends a request to A1. But A1 will not reply until time slice 1 starts at 100 msec. Once it enters into 100 msec, it sends reply immediately but A0 doesn’t get a reply until it runs again at 200 msec. The conclusion is that everyone’s request-response sequence is done at every 200 msec. But this doesn't give a good performance.
The solution to the above problem is gang scheduling, which is the next, Co-scheduling.
Gang scheduling has 3 parts.
Groups of related threads are scheduled as a unit or gang.
All threads in a gang run at a time on a different time shared CPUs.
All threads in a gang start and end their time slice together.
We can schedule all CPUs synchronously in gang scheduling. Time is divided into different quantum (parts). All CPUs are rescheduled when a new thread is started on each CPU. At the next quantum, another event gets the schedule. In between, no scheduling is done. If thread blocks, it’s CPU stay idle until the quantum finish.
An example of how gang scheduling is done is given in the figure below. Here we have multiprocessor with 6 CPUs being used by 5 processes. From A to E with a total of 24 ready threads. During time slot 0, threads A0 through A6 are scheduled and run. During time slot 1, Threads B0, B1, B2, C0, C1, and C2 are scheduled and run. During time slot 2, D’s five threads and E0 get to run. The remaining six threads belonging to process E run in time slot 3. Then the cycle repeats, with slot 4 being the same as slot 0 and so on.
The idea of gang scheduling is to have all the threads of a process run together so that if one of them sends a request to another one, it will get the message almost immediately and be able to reply almost immediately. In Figure above, since all the A threads are running together, during one quantum, they may send and receive a very large number of messages in one quantum.
Co-scheduling of the Medusa OS
Co-scheduling was proposed by Ouster-bout for the Medusa operating system for Cm*. In Co-scheduling, all processes of an application which are runnable are scheduled on the processors simultaneously. Whenever any process wants to be preempted, all the tasks or processes of that application are preempted. Effectively, co-scheduling does context switching between tasks of several different applications. That is, all the tasks in an application are run for a time slice, then all the tasks in another application are run for a time slice, and so on.
Co-scheduling reduces the problem of wasting resources of processes in lock spinning while they wait for the preempted task to release critical section. But it does neither reduces the overhead due to context switching nor performance degradation due to cache corruption. The cache corruption problem may even be aggravated by co-scheduling; by the time an application is rescheduled, it is very likely that its working set has been flushed out of all the caches. Note that the designers of the Medusa operating system did not face this problem because the Cm* multiprocessor did not employ caches.
2. Smart Scheduling
First of all, we will consider scheduling the single unrelated thread then after we will schedule the related threads. To deal with unrelated thread there is a simple threading algorithm available. In this, a single system-wide data structure is created for ready threads, like a list. The lists structure for the thread is given below in the figure. Here 16 CPU are working currently and 14 threads are waiting for CPU being idle. After finishing the task of CPU 4, it takes the thread from the priority queue and selects the highest priority thread i.e. A. In this way CPU to take thread one by one for the run and provides automatic load balancing. Because it's never happened that one CPU is idle while other is overloaded.
Multiprocessor scheduling using list data structure
Two disadvantages of this type scheduling: There may occur potential contention for the scheduling data structure when we add the numbers of CPUs and the usual overhead in doing a context switch when a process blocks for I/O.
When a process’s or thread's quantum expires there is a possibility of the context switch. A multiprocessor has some properties that are not present in a uni-processor. Suppose that the process holds a spinlock, not unusual on multiprocessors, as discussed above. Other CPUs waiting on the spinlock just waste their time spinning until that process is scheduled again and releases the lock. On a uni-processor, spinlocks are rarely used so if a process is suspended while it holds a mutex, and another process starts and tries to acquire the mutex, it will be immediately blocked, so little time is wasted.
To solve this problem, smart scheduling is used. It is proposed by Zahorjan et. Al.
Smart scheduler has 2 nice features.
It avoids preemption of process inside the critical section.
It avoids rescheduling of the process that was busy in waiting for the process which is preempted due to spinning in the critical section.
It uses the concept of the flag to show that the process currently has a spinlock. When it releases the lock, flag value will clear. The scheduler then does not stop a process holding a spinlock, but instead gives it a little more time to complete its critical section and release the lock.
In this scheme, the process holds a spinlock and sets a flag which will show that it has a spinlock. A scheduler doesn’t preempt the process if it’s flag is set. As a critical section exits, the process resets the flag.
Smart scheduler eliminates the resource waste due to spinning a lock in a critical section.
Scheduling in NYU Ultra-computer
NYU follows the technique of co-scheduling and smart scheduling algorithm. In this, tasks can be formed into groups and task in a group can be scheduled in the following way,
• A task or process can be scheduled or preempted in a normal manner.
• All the processes can be scheduled or preempted simultaneously just as in co-scheduling.
• Process in a group is never preempted.
A process can prevent its preemption irrespective of scheduling policy of its group. This policy can use to efficiently implement a spinlock as in smart scheduling.
3. Affinity-Based scheduling
It is the first scheduling policy which targets the problem of cache corruption. This scheduling is proposed by Lazowska and Squillante. In this, the process is scheduled on the processor where it last executed. Suppose process A has run for a long time on CPU k, CPU k’s cache will be full of A’s blocks. If A gets to run again soon, it may perform better if it is run on CPU k, because k’s cache may still contain some of A’s blocks. Having cache blocks preloaded will increase the cache hit rate and thus the processing speed. In addition, the TLB may also contain the right pages, reducing TLB faults.
The basic idea here is to make a serious effort to have a process run on the same CPU it ran on last time. One way to create this affinity is to use a two-level scheduling algorithm. When a process is created, it is assigned to a CPU, for example, based on which one has the smallest load at that moment. This assignment of processes to CPUs is the top level of the algorithm. As a result, each CPU acquires its own collection of processes.
The actual scheduling of the processes is the bottom level of the algorithm. It is done by each CPU separately, using priorities or some other means. By trying to keep a process on the same CPU, cache affinity is maximized. However, if a CPU has no processes to run, it takes one from another CPU rather than go idle. Two-level scheduling has three benefits. First, it distributes the load roughly evenly over the available CPUs. The second advantage is taken of cache affinity where possible. Third, by giving each CPU its own ready list, contention for the ready lists is minimized because attempts to use another CPU’s ready list are relatively infrequent.