### Различия между многопроцессорностью и многопоточностью
| Параметр | Multiprocess | Multithreading |
|---|---|---|
| Development | Проще для использования. Использует fork (). | Использует threads API. |
| Memory overhead | Separate address space per process consumes some memory resources.| Small. Requires only extra stack and register space.|
| CPU overhead |Cost of fork()/exit(), which includes MMU work to manage address spaces. |Small. API calls. |
| Communication | Via IPC. This incurs CPU cost including context switching for moving data between address spaces, unless shared memory regions are used. | Fastest. Direct access to share memory. Integrity via synchronization primitives (e.g., mutex locks). |
| Memory usage | While some memory may be duplicated, separate processes can exit() and return all memory back to the system. | Via system allocator. This may incur some CPU contention from multiple threads, and fragmentation before memory is reused. |
Изменение приорите запуска программы
Unix has always provided a nice() system call for adjusting process priority, which sets a nice-ness value. Positive nice values result in lower process priority (nicer), and negative values—which can be set only by the superuser (root)—result in higher priority. A nice(1) command became available to launch programs with nice values, and a renice(1M) command was later added (in BSD) to adjust the nice value of already running processes. The man page from Unix 4th edition provides this example [4]: The value of 16 is recommended to users who wish to execute long-running programs without flak from the administration.
CPU Analysis Tools
| Linux| Description|
|---|---|
| uptime| load averages|
| vmstat| includes system-wide CPU averages|
| mpstat| per-CPU statistics|
| sar| historical statistics|
| ps | process status|
| top | monitor per-process/thread CPU usage|
| pidstat| per-process/thread CPU breakdowns|
| time| time a command, with CPU breakdowns|
| DTrace, perf| CPU profiling and tracing|
| perf| CPU performance counter analysis|
The load average indicates the demand for CPU resources and is calculated by summing the number of threads running (utilization) and the number that are queued waiting to run (saturation). A newer method for calculating load averages uses utilization plus the sum of thread scheduler latency, rather than sampling the queue length, which improves accuracy. For reference, the internals of these calculations on Solaris-based kernels are documented in [McDougall 06b]. To interpret the value, if the load average is higher than the CPU count, there are not enough CPUs to service the threads, and some are waiting. If the load average is lower than the CPU count, it (probably) means that there is headroom, and the threads could run on-CPU when they wanted. The three load average numbers are exponentially damped moving averages, which reflect load beyond the 1-, 5-, and 15-minute times (the times are actually constants used in the exponential moving sum [Myer 73]). Figure 6.14 shows the results of a simple experiment where a single CPU-bound thread was launched and the load averages plotted.
vmstat
The virtual memory statistics command, vmstat(8), prints system-wide CPU
averages in the last few columns, and a count of runnable threads in the first column.
Here is example output from the Linux version:
```
vmstat 1
```
r: run-queue length—the total number of runnable threads (see below)
us: user-time
sy: system-time (kernel)
id: idle
wa: wait I/O, which measures CPU idle when threads are blocked on disk I/O
st: stolen (not shown in the output), which for virtualized environments
shows CPU time spent servicing other tenants
mpstat
The multiprocessor statistics tool, mpstat, can report statistics per CPU. Here is some example output from the Linux version:
```
mpstat -P ALL 1
```
The -P ALL option was used to print the per-CPU report. By default, mpstat(1)
prints only the system-wide summary line (all). The columns are
CPU: logical CPU ID, or all for summary
%usr: user-time
%nice: user-time for processes with a nice’d priority
%sys: system-time (kernel)
%iowait: I/O wait
%irq: hardware interrupt CPU usage
%soft: software interrupt CPU usage
%steal: time spent servicing other tenants
%guest: CPU time spent in guest virtual machines
%idle: idle
pidstat
Prints CPU usage by process or thread, including user- and system-time breakdowns. By default, a rolling output is printed of only active processes. For example:
```
pidstat 1
```
time
The time(1) command can be used to run programs and report CPU usage. It is provided either in the operating system under /usr/bin, or as a shell built-in.
```
/usr/bin/time -v cp fileA fileB
```
DTrace
Can be used to profile CPU usage for both user- and kernel-level code, and to trace the execution of functions, CPU cross calls, interrupts, and the kernel scheduler. These abilities support workload characterization, profiling, drill-down analysis, and latency analysis.
```
dtrace -n 'profile-997 /arg0/ { @[stack()] = count(); }'
```
### User Profiling
```
dtrace -n 'profile-97 /arg1 && execname == "mysqld"/ { @[ustack()] = count(); }'
```
### Function Tracing
```
dtrace -n 'fbt::zio_checksum_generate:entry { self->v = vtimestamp; } fbt::zio_checksum_generate:return /self->v/ { @["ns"] = quantize(vtimestamp - self->v); self->v = 0; }'
```
### CPU Cross Calls
```
dtrace -n 'sysinfo:::xcalls { @[stack()] = count(); }'
```
### Interrupts
```
intrstat 1
```
perf
Originally called Performance Counters for Linux (PCL), the perf(1) command
has evolved and become a collection of tools for profiling and tracing, now called
Linux Performance Events (LPE). Each tool is selected as a subcommand. For
example, perf stat executes the stat command, which provides CPC-based statistics.
These commands are listed in the USAGE message, and a selection is
reproduced here in Table 6.8 (from version 3.2.6-3).
|Command |Description|
|----|-----|
|annotate| Read perf.data (created by perf record) and display annotated code.|
|diff| Read two perf.data files and display the differential profile|
|evlist| List the event names in a perf.data file.|
|inject |Filter to augment the events stream with additional information.|
|kmem |Tool to trace/measure kernel memory (slab) properties.|
|kvm |Tool to trace/measure kvm guest OS.|
|list| List all symbolic event types.|
|lock| Analyze lock events.|
|probe| Define new dynamic tracepoints.|
|record| Run a command and record its profile into perf.data.|
|report| Read perf.data (created by perf record) and display the profile.|
|sched| Tool to trace/measure scheduler properties (latencies).|
|script| Read perf.data (created by perf record) and display trace output.|
|stat |Run a command and gather performance counter statistics.|
|timechart| Tool to visualize total system behavior during a workload.|
|top| System profiling tool.|
System Profiling
perf(1) can be used to profile CPU call paths, summarizing where CPU time is
spent in both kernel- and user-space. This is performed by the record command,
which captures samples at regular intervals to a perf.data file. A report command
is then used to view the file.
```
perf record -a -g -F 997 sleep 10
```
Process Profiling
```
perf sched record sleep 10
```
```
perf record -f -g -a -e context-switches sleep 10
```
cpustat
```
cpustat -tc PAPI_tot_cyc,PAPI_tot_ins,sys 1
```
Other Linux CPU performance tools include
oprofile: the original CPU profiling tool by John Levon.
htop: includes ASCII bar charts for CPU usage and has a more powerful
interactive interface than the original top(1).
atop: includes many more system-wide statistics and uses process accounting
to catch the presence of short-lived processes.
/proc/cpuinfo: This can be read to see processor details, including clock
speed and feature flags.
getdelays.c: This is an example of delay accounting observability and
includes CPU scheduler latency per process. It was demonstrated in
Chapter 4, Observability Tools.
valgrind: a memory debugging and profiling toolkit [6]. It contains callgrind,
a tool to trace function calls and gather a call graph, which can be visualized
using kcachegrind; and cachegrind for analysis of hardware cache
usage by a given program.
# Tuning
### Scheduling Priority and Class
The nice(1) command can be used to adjust process priority. Positive nice values decrease priority, and negative nice values increase priority, which only the superuser can set. The range is from -20 to +19. For example:
```
nice -n 19 command
```
### Exclusive CPU Sets
Linux provides cpusets, which allow CPUs to be grouped and processes assigned to them. This can improve performance similarly to process binding, but performance can be improved further by making the cpuset exclusive—preventing other processes from using it. The trade-off is a reduction in available CPU for the rest of the system. The following commented example creates an exclusive set:
```
# mkdir /dev/cpuset
# mount -t cpuset cpuset /dev/cpuset
# cd /dev/cpuset
# mkdir prodset # create a cpuset called "prodset"
# cd prodset
# echo 7-10 > cpus # assign CPUs 7-10
# echo 1 > cpu_exclusive # make prodset exclusive
# echo 1159 > tasks # assign PID 1159 to prodset
````
### Resource Controls
For Linux, there are container groups (cgroups), which can also control resource usage by processes or groups of processes. CPU usage can be controlled using shares, and the CFS scheduler allows fixed limits to be imposed (CPU bandwidth), in terms of allocating microseconds of CPU cycles per interval. CPU bandwidth is relatively new, added in 2012 (3.2).