1个CPU核心上的1M HTTP rps。 DPDK代替nginx + linux内核TCP / IP

我想谈谈诸如DPDK之类的东西-这是一个绕过内核使用网络的框架。 即 您可以直接从userland \写入以读取网卡的队列,而无需任何系统调用。 这样可以为您节省大量的复制开销。 作为示例,我将编写一个通过http提供测试页并将其速度与nginx进行比较的应用程序。

可以在此处下载DPDK。 不要选择稳定-在EC2上它对我不起作用,需要18.05-一切都从它开始。 开始之前,您需要在系统中保留大量页面以供框架正常运行。 原则上,可以启动测试应用程序,并且可以选择不使用大页面来工作,但是我总是将它们包括在内。 *不要忘记在grub-mkconfig之后执行update-grub *完成大页面后,立即转到./usertools/dpdk-setup.py-这东西将收集并配置其他所有内容。 在Google中,您可以找到有关建议收集和设置某些内容以绕过dpdk-setup.py的说明-请勿这样做。 好吧,您只能和我一起做,而我没有使用dpdk-setup.py,但没有任何效果。 简要地说,dpdk-setup.py中的动作序列:

  • 建立x86_x64 Linux
  • 加载igb uio内核模块
  • 将大页面映射到/ mnt / huge
  • 在uio中绑定所需的nic(不要忘了先关闭ifconfig ethX)

之后,您可以通过在目录中运行make来构建示例。 只需要创建一个环境变量RTE_SDK即可指向带有DPDK的目录。

这里是完整的示例代码。 它由初始化,tcp / ip原始版本的实现和原始http解析器组成。 让我们从初始化开始。

int main(int argc, char** argv) { int ret; //  dpdk ret = rte_eal_init(argc, argv); if (ret < 0) { rte_panic("Cannot init EAL\n"); } //  memory pool g_packet_mbuf_pool = rte_pktmbuf_pool_create("mbuf_pool", 131071, 32, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id()); if (g_packet_mbuf_pool == NULL) { rte_exit(EXIT_FAILURE, "Cannot init mbuf pool\n"); } g_tcp_state_pool = rte_mempool_create("tcp_state_pool", 65535, sizeof(struct tcp_state), 0, 0, NULL, NULL, NULL, NULL, rte_socket_id(), 0); if (g_tcp_state_pool == NULL) { rte_exit(EXIT_FAILURE, "Cannot init tcp_state pool\n"); } //  hash table struct rte_hash_parameters hash_params = { .entries = 64536, .key_len = sizeof(struct tcp_key), .socket_id = rte_socket_id(), .hash_func_init_val = 0, .name = "tcp clients table" }; g_clients = rte_hash_create(&hash_params); if (g_clients == NULL) { rte_exit(EXIT_FAILURE, "No hash table created\n"); } //    1     uint8_t nb_ports = rte_eth_dev_count(); if (nb_ports == 0) { rte_exit(EXIT_FAILURE, "No Ethernet ports - bye\n"); } if (nb_ports > 1) { rte_exit(EXIT_FAILURE, "Not implemented. Too much ports\n"); } //     struct rte_eth_conf port_conf = { .rxmode = { .split_hdr_size = 0, .header_split = 0, /**< Header Split disabled */ .hw_ip_checksum = 0, /**< IP checksum offload disabled */ .hw_vlan_filter = 0, /**< VLAN filtering disabled */ .jumbo_frame = 0, /**< Jumbo Frame Support disabled */ .hw_strip_crc = 0, /**< CRC stripped by hardware */ }, .txmode = { .mq_mode = ETH_MQ_TX_NONE, }, }; //          port_conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM; port_conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM; ret = rte_eth_dev_configure(0, RX_QUEUE_COUNT, TX_QUEUE_COUNT, &port_conf); if (ret < 0) { rte_exit(EXIT_FAILURE, "Cannot configure device: err=%d\n", ret); } //     for (uint16_t j=0; j<RX_QUEUE_COUNT; ++j) { ret = rte_eth_rx_queue_setup(0, j, 1024, rte_eth_dev_socket_id(0), NULL, g_packet_mbuf_pool); if (ret < 0) { rte_exit(EXIT_FAILURE, "rte_eth_rx_queue_setup:err=%d\n", ret); } } //    struct rte_eth_txconf txconf = { .offloads = port_conf.txmode.offloads, }; for (uint16_t j=0; j<TX_QUEUE_COUNT; ++j) { ret = rte_eth_tx_queue_setup(0, j, 1024, rte_eth_dev_socket_id(0), &txconf); if (ret < 0) { rte_exit(EXIT_FAILURE, "rte_eth_tx_queue_setup:err=%d\n", ret); } } // NIC ret = rte_eth_dev_start(0); if (ret < 0) { rte_exit(EXIT_FAILURE, "rte_eth_dev_start:err=%d\n", ret); } //   lcore_hello(NULL); return 0; } 

那时,当我们通过dpdk-setup.py将选定的网络接口绑定到dpdk驱动程序时,该网络接口将不再可供内核访问。 此后,网卡将把通过DMA到达此接口的所有数据包记录在我们提供给它的队列中。

这是数据包处理循环。

 struct rte_mbuf* packets[MAX_PACKETS]; uint16_t rx_current_queue = 0; while (1) { //  unsigned packet_count = rte_eth_rx_burst(0, (++rx_current_queue) % RX_QUEUE_COUNT, packets, MAX_PACKETS); for (unsigned j=0; j<packet_count; ++j) { struct rte_mbuf* m = packets[j]; //   ethernet  struct ether_hdr* eth_header = rte_pktmbuf_mtod(m, struct ether_hdr*); //  IP  if (RTE_ETH_IS_IPV4_HDR(m->packet_type)) { do { //   if (rte_pktmbuf_data_len(m) < sizeof(struct ether_hdr) + sizeof(struct ipv4_hdr) + sizeof(struct tcp_hdr)) { TRACE; break; } struct ipv4_hdr* ip_header = (struct ipv4_hdr*)((char*)eth_header + sizeof(struct ether_hdr)); if ((ip_header->next_proto_id != 0x6) || (ip_header->version_ihl != 0x45)) { TRACE; break; } if (ip_header->dst_addr != MY_IP_ADDRESS) { TRACE; break; } if (rte_pktmbuf_data_len(m) < htons(ip_header->total_length) + sizeof(struct ether_hdr)) { TRACE; break; } if (htons(ip_header->total_length) < sizeof(struct ipv4_hdr) + sizeof(struct tcp_hdr)) { TRACE; break; } struct tcp_hdr* tcp_header = (struct tcp_hdr*)((char*)ip_header + sizeof(struct ipv4_hdr)); size_t tcp_header_size = (tcp_header->data_off >> 4) * 4; if (rte_pktmbuf_data_len(m) < sizeof(struct ether_hdr) + sizeof(struct ipv4_hdr) + tcp_header_size) { TRACE; break; } if (tcp_header->dst_port != 0x5000) { TRACE; break; } size_t data_size = htons(ip_header->total_length) - sizeof(struct ipv4_hdr) - tcp_header_size; void* data = (char*)tcp_header + tcp_header_size; //     hash table struct tcp_key key = { .ip = ip_header->src_addr, .port = tcp_header->src_port }; //    tcp process_tcp(m, tcp_header, &key, data, data_size); } while(0); } else if (eth_header->ether_type == 0x0608) // ARP { //     -       ARP- do { if (rte_pktmbuf_data_len(m) < sizeof(struct arp) + sizeof(struct ether_hdr)) { TRACE; break; } struct arp* arp_packet = (struct arp*)((char*)eth_header + sizeof(struct ether_hdr)); if (arp_packet->opcode != 0x100) { TRACE; break; } if (arp_packet->dst_pr_add != MY_IP_ADDRESS) { TRACE; break; } send_arp_response(arp_packet); } while(0); } else { TRACE; } rte_pktmbuf_free(m); } } 

rte_eth_rx_burst函数用于从队列中读取数据包,如果队列中有内容,它将读取数据包并将其放入数组中。 如果队列中没有任何内容,将返回0,在这种情况下,您必须立即再次调用它。 是的,如果当前网络上没有数据,这种方法将“ CPU时间”“花费”为零,但是如果我们已经使用了dpdk,则认为情况并非如此。 *重要的是,该函数不是线程安全的 ,无法从不同进程中的同一队列中读取*处理包后,必须调用rte_pktmbuf_free。 要发送数据包,可以使用rte_eth_tx_burst函数,该函数将从rte_pktmbuf_alloc接收的rte_mbuf放入网卡队列中。

分解包头后,有必要建立一个tcp会话。 tcp协议充满了各种特殊情况,特殊情况和拒绝服务的危险。 对于有经验的开发人员来说,或多或少完整的tcp的实现是一个极好的练习,但是,这里所描述的框架中并未包含该tcp。 在该示例中,仅执行了tcp即可进行测试。 基于dpdk提供哈希表实现会话表,设置和断开tcp连接,发送和接收数据,而无需考虑丢失和数据包重新排序。 dpdk的哈希表有一个重要限制,您可以读取但不能写入多个线程。 该示例是单线程的,此问题在这里并不重要,如果要在多个内核上处理流量,则可以使用RSS,发送哈希表,并且可以不阻塞。

TCP实施
 tatic void process_tcp(struct rte_mbuf* m, struct tcp_hdr* tcp_header, struct tcp_key* key, void* data, size_t data_size) { TRACE; struct tcp_state* state; if (rte_hash_lookup_data(g_clients, key, (void**)&state) < 0) //Documentaion lies!!! { TRACE; if ((tcp_header->tcp_flags & 0x2) != 0) // SYN { TRACE; struct ether_hdr* eth_header = rte_pktmbuf_mtod(m, struct ether_hdr*); if (rte_mempool_get(g_tcp_state_pool, (void**)&state) < 0) { ERROR("tcp state alloc fail"); return; } memcpy(&state->tcp_template, &g_tcp_packet_template, sizeof(g_tcp_packet_template)); memcpy(&state->tcp_template.eth.d_addr, &eth_header->s_addr, 6); state->tcp_template.ip.dst_addr = key->ip; state->tcp_template.tcp.dst_port = key->port; state->remote_seq = htonl(tcp_header->sent_seq); #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wpointer-to-int-cast" state->my_seq_start = (uint32_t)state; // not very secure. #pragma GCC diagnostic pop state->fin_sent = 0; state->http.state = HTTP_START; state->http.request_url_size = 0; //not thread safe! only one core used if (rte_hash_add_key_data(g_clients, key, state) == 0) { struct tcp_hdr* new_tcp_header; struct rte_mbuf* packet = build_packet(state, 12, &new_tcp_header); if (packet != NULL) { new_tcp_header->rx_win = TX_WINDOW_SIZE; new_tcp_header->sent_seq = htonl(state->my_seq_start); state->my_seq_sent = state->my_seq_start+1; ++state->remote_seq; new_tcp_header->recv_ack = htonl(state->remote_seq); new_tcp_header->tcp_flags = 0x12; // mss = 1380, no window scaling uint8_t options[12] = {0x02, 0x04, 0x05, 0x64, 0x03, 0x03, 0x00, 0x01, 0x01, 0x01, 0x01, 0x01}; memcpy((uint8_t*)new_tcp_header + sizeof(struct tcp_hdr), options, 12); new_tcp_header->data_off = 0x80; send_packet(new_tcp_header, packet); } else { ERROR("rte_pktmbuf_alloc, tcp synack"); } } else { ERROR("can't add connection to table"); rte_mempool_put(g_tcp_state_pool, state); } } else { ERROR("lost connection"); } return; } if ((tcp_header->tcp_flags & 0x2) != 0) // SYN retransmit { //not thread safe! only one core used if (rte_hash_del_key(g_clients, key) < 0) { ERROR("can't delete key"); } else { rte_mempool_put(g_tcp_state_pool, state); return process_tcp(m, tcp_header, key, data, data_size); } } if ((tcp_header->tcp_flags & 0x10) != 0) // ACK { TRACE; uint32_t ack_delta = htonl(tcp_header->recv_ack) - state->my_seq_start; uint32_t my_max_ack_delta = state->my_seq_sent - state->my_seq_start; if (ack_delta == 0) { if ((data_size == 0) && (tcp_header->tcp_flags == 0x10)) { ERROR("need to retransmit. not supported"); } } else if (ack_delta <= my_max_ack_delta) { state->my_seq_start += ack_delta; } else { ERROR("ack on unsent seq"); } } if (data_size > 0) { TRACE; uint32_t packet_seq = htonl(tcp_header->sent_seq); if (state->remote_seq == packet_seq) { feed_http(data, data_size, state); state->remote_seq += data_size; } else if (state->remote_seq-1 == packet_seq) // keepalive { struct tcp_hdr* new_tcp_header; struct rte_mbuf* packet = build_packet(state, 0, &new_tcp_header); if (packet != NULL) { new_tcp_header->rx_win = TX_WINDOW_SIZE; new_tcp_header->sent_seq = htonl(state->my_seq_sent); new_tcp_header->recv_ack = htonl(state->remote_seq); send_packet(new_tcp_header, packet); } else { ERROR("rte_pktmbuf_alloc, tcp ack keepalive"); } } else { struct tcp_hdr* new_tcp_header; struct rte_mbuf* packet = build_packet(state, state->http.last_message_size, &new_tcp_header); TRACE; if (packet != NULL) { new_tcp_header->rx_win = TX_WINDOW_SIZE; new_tcp_header->sent_seq = htonl(state->my_seq_sent - state->http.last_message_size); new_tcp_header->recv_ack = htonl(state->remote_seq); memcpy((char*)new_tcp_header+sizeof(struct tcp_hdr), &state->http.last_message, state->http.last_message_size); send_packet(new_tcp_header, packet); } else { ERROR("rte_pktmbuf_alloc, tcp fin ack"); } //ERROR("my bad tcp stack implementation((("); } } if ((tcp_header->tcp_flags & 0x04) != 0) // RST { TRACE; //not thread safe! only one core used if (rte_hash_del_key(g_clients, key) < 0) { ERROR("can't delete key"); } else { rte_mempool_put(g_tcp_state_pool, state); } } else if ((tcp_header->tcp_flags & 0x01) != 0) // FIN { struct tcp_hdr* new_tcp_header; struct rte_mbuf* packet = build_packet(state, 0, &new_tcp_header); TRACE; if (packet != NULL) { new_tcp_header->rx_win = TX_WINDOW_SIZE; new_tcp_header->sent_seq = htonl(state->my_seq_sent); new_tcp_header->recv_ack = htonl(state->remote_seq + 1); if (!state->fin_sent) { TRACE; new_tcp_header->tcp_flags = 0x11; // !@#$ the last ack } send_packet(new_tcp_header, packet); } else { ERROR("rte_pktmbuf_alloc, tcp fin ack"); } //not thread safe! only one core used if (rte_hash_del_key(g_clients, key) < 0) { ERROR("can't delete key"); } else { rte_mempool_put(g_tcp_state_pool, state); } } } 


http解析器仅支持GET从那里读取URL,并返回带有请求URL的html。

HTTP解析器
 static void feed_http(void* data, size_t data_size, struct tcp_state* state) { TRACE; size_t remaining_data = data_size; char* current = (char*)data; struct http_state* http = &state->http; if (http->state == HTTP_BAD_STATE) { TRACE; return; } while (remaining_data > 0) { switch(http->state) { case HTTP_START: { if (*current == 'G') { http->state = HTTP_READ_G; } else { http->state = HTTP_BAD_STATE; } break; } case HTTP_READ_G: { if (*current == 'E') { http->state = HTTP_READ_E; } else { http->state = HTTP_BAD_STATE; } break; } case HTTP_READ_E: { if (*current == 'T') { http->state = HTTP_READ_T; } else { http->state = HTTP_BAD_STATE; } break; } case HTTP_READ_T: { if (*current == ' ') { http->state = HTTP_READ_SPACE; } else { http->state = HTTP_BAD_STATE; } break; } case HTTP_READ_SPACE: { if (*current != ' ') { http->request_url[http->request_url_size] = *current; ++http->request_url_size; if (http->request_url_size > MAX_URL_SIZE) { http->state = HTTP_BAD_STATE; } } else { http->state = HTTP_READ_URL; http->request_url[http->request_url_size] = '\0'; } break; } case HTTP_READ_URL: { if (*current == '\r') { http->state = HTTP_READ_R1; } break; } case HTTP_READ_R1: { if (*current == '\n') { http->state = HTTP_READ_N1; } else if (*current == '\r') { http->state = HTTP_READ_R1; } else { http->state = HTTP_READ_URL; } break; } case HTTP_READ_N1: { if (*current == '\r') { http->state = HTTP_READ_R2; } else { http->state = HTTP_READ_URL; } break; } case HTTP_READ_R2: { if (*current == '\n') { TRACE; char content_length[32]; sprintf(content_length, "%lu", g_http_part2_size - 4 + http->request_url_size + g_http_part3_size); size_t content_length_size = strlen(content_length); size_t total_data_size = g_http_part1_size + g_http_part2_size + g_http_part3_size + http->request_url_size + content_length_size; struct tcp_hdr* tcp_header; struct rte_mbuf* packet = build_packet(state, total_data_size, &tcp_header); if (packet != NULL) { tcp_header->rx_win = TX_WINDOW_SIZE; tcp_header->sent_seq = htonl(state->my_seq_sent); tcp_header->recv_ack = htonl(state->remote_seq + data_size); #ifdef KEEPALIVE state->my_seq_sent += total_data_size; #else state->my_seq_sent += total_data_size + 1; //+1 for FIN tcp_header->tcp_flags = 0x11; state->fin_sent = 1; #endif char* new_data = (char*)tcp_header + sizeof(struct tcp_hdr); memcpy(new_data, g_http_part1, g_http_part1_size); new_data += g_http_part1_size; memcpy(new_data, content_length, content_length_size); new_data += content_length_size; memcpy(new_data, g_http_part2, g_http_part2_size); new_data += g_http_part2_size; memcpy(new_data, http->request_url, http->request_url_size); new_data += http->request_url_size; memcpy(new_data, g_http_part3, g_http_part3_size); memcpy(&http->last_message, (char*)tcp_header+sizeof(struct tcp_hdr), total_data_size); http->last_message_size = total_data_size; send_packet(tcp_header, packet); } else { ERROR("rte_pktmbuf_alloc, tcp data"); } http->state = HTTP_START; http->request_url_size = 0; } else if (*current == '\r') { http->state = HTTP_READ_R1; } else { http->state = HTTP_READ_URL; } break; } default: { ERROR("bad http state"); return; } } if (http->state == HTTP_BAD_STATE) { return; } --remaining_data; ++current; } } 


准备好示例后,可以将性能与nginx进行比较。 因为 我无法在家里组装一个真正的架子,我使用了Amazon EC2。 EC2在测试中进行了更正-我不得不放弃Connection:关闭请求,因为 在测试开始几秒钟后,SYN数据包以300k rps的速率开始下降。 显然,存在针对SYN洪水的某种保护,因此使请求保持活动状态。 在EC2上,dpdk不适用于所有实例,例如,在m1.medium上它不起作用。 支架在应用程序中使用了1个实例r4.8xlarge和2个实例r4.8xlarge来创建负载。 它们通过专用VPC子网在酒店网络接口上进行通信。 我尝试加载不同的实用程序:ab,wrk,h2load,siege。 最方便的是wrk,因为 ab是单线程的,并且如果网络上存在错误,则会生成失真的统计信息。

在EC2中有大量流量时,可以观察到一定数量的丢弃,对于普通应用来说这是不可见的,但是在ab的情况下,任何重传都会占用总时间和ab,因此每秒平均请求数的数据不合适。 下降的原因是一个单独的谜,需要解决,但是,不仅在使用dpdk时存在问题,而且在使用nginx时也存在问题,这表明这似乎并不是错误的例子。

我分两个阶段进行了测试,首先在1个实例上运行wrk,然后在2个上运行。 如果2个实例的总性能为1,则意味着我没有遇到wrk本身的性能。

dpdk示例在r4.8xlarge上的测试结果
启动wrk c 1实例

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 32.19ms 63.43ms 817.26ms 85.01%
Req/Sec 15.97k 4.04k 113.97k 93.47%
Latency Distribution
50% 2.58ms
75% 17.57ms
90% 134.94ms
99% 206.03ms
10278064 requests in 10.10s, 1.70GB read
Socket errors: connect 0, read 17, write 0, timeout 0
Requests/sec: 1017645.11
Transfer/sec: 172.75MB

同时从2个实例运行wrk
root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 67.28ms 119.20ms 1.64s 88.90%
Req/Sec 7.99k 4.58k 132.62k 96.67%
Latency Distribution
50% 2.31ms
75% 103.45ms
90% 191.51ms
99% 563.56ms
5160076 requests in 10.10s, 0.86GB read
Socket errors: connect 0, read 2364, write 0, timeout 1
Requests/sec: 510894.92
Transfer/sec: 86.73MB

root@ip-172-30-0-225:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 74.87ms 148.64ms 1.64s 93.45%
Req/Sec 8.22k 2.59k 42.51k 81.21%
Latency Distribution
50% 2.41ms
75% 110.42ms
90% 190.66ms
99% 739.67ms
5298083 requests in 10.10s, 0.88GB read
Socket errors: connect 0, read 0, write 0, timeout 148
Requests/sec: 524543.67
Transfer/sec: 89.04MB


Nginx给出了这样的结果
启动wrk c 1实例

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 14.36ms 56.41ms 1.92s 95.26%
Req/Sec 15.27k 3.30k 72.23k 83.53%
Latency Distribution
50% 3.38ms
75% 6.82ms
90% 10.95ms
99% 234.99ms
9813464 requests in 10.10s, 2.12GB read
Socket errors: connect 0, read 1, write 0, timeout 3
Requests/sec: 971665.79
Transfer/sec: 214.94MB


同时从2个实例运行wrk

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 52.91ms 82.19ms 1.04s 82.93%
Req/Sec 8.05k 3.09k 55.62k 89.11%
Latency Distribution
50% 3.66ms
75% 94.87ms
90% 171.83ms
99% 354.26ms
5179253 requests in 10.10s, 1.12GB read
Socket errors: connect 0, read 134, write 0, timeout 0
Requests/sec: 512799.10
Transfer/sec: 113.43MB

root@ip-172-30-0-225:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 64.38ms 121.56ms 1.67s 90.32%
Req/Sec 7.30k 2.54k 34.94k 82.10%
Latency Distribution
50% 3.68ms
75% 103.32ms
90% 184.05ms
99% 561.31ms
4692290 requests in 10.10s, 1.01GB read
Socket errors: connect 0, read 2, write 0, timeout 21
Requests/sec: 464566.93
Transfer/sec: 102.77MB


Nginx配置
用户www-data;
worker_processes自动;
pid /run/nginx.pid;
worker_rlimit_nofile 50000;
事件{
worker_connections 10000;
}
http {
发送文件
tcp_nopush on;
tcp_nodelay;
keepalive_timeout 65;
types_hash_max_size 2048;

包括/etc/nginx/mime.types;
default_type文本/纯文本;

error_log /var/log/nginx/error.log;
access_log关闭;

服务器{
监听80 default_server backlog = 10000复用端口;
位置/ {
返回200“ answer.padding _____________________________________________________________”;
}
}
}

总的来说,我们看到在两个示例中,我们每秒都能收到约100万个请求,只有nginx为此使用了全部32个cpu,而dpdk仅使用了一个。 也许EC2再次放了一个小猪,1M rps是网络的限制,但是即使这样,结果也不会有太大的失真,因为在示例中增加了for(int x = 0; x <100; ++ x)http之类的延迟在发送数据包之前, request_url [0] ='a'+(http-> request_url [0]%10) ,它已经降低了rps,这意味着几乎所有的cpu加载都非常有用。

在实验过程中,发现了一个谜,我仍然无法解决。 如果启用校验和卸载,即由网卡本身为ip和tcp标头计算校验和,则总体性能会下降,并且延迟会增加。
这是启用卸载的开始

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.91ms 614.33us 28.35ms 96.17%
Req/Sec 10.48k 1.51k 69.89k 98.78%
Latency Distribution
50% 5.91ms
75% 6.01ms
90% 6.19ms
99% 6.99ms
6738296 requests in 10.10s, 1.12GB read
Requests/sec: 667140.71
Transfer/sec: 113.25MB


这与CPU上的校验和

root@ip-172-30-0-127:~# wrk -t64 -d10s -c4000 --latency http://172.30.4.10/testu
Running 10s test @ http://172.30.4.10/testu
64 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 32.19ms 63.43ms 817.26ms 85.01%
Req/Sec 15.97k 4.04k 113.97k 93.47%
Latency Distribution
50% 2.58ms
75% 17.57ms
90% 134.94ms
99% 206.03ms
10278064 requests in 10.10s, 1.70GB read
Socket errors: connect 0, read 17, write 0, timeout 0
Requests/sec: 1017645.11
Transfer/sec: 172.75MB


好的,我可以通过网卡变慢的事实来解释性能下降,尽管这很奇怪,但它应该加速。 但是,为什么在延迟图上计算校验和时,结果却几乎恒定等于6ms,如果您使用cpu,那么它会从2.5ms浮动到817ms? 通过具有直接连接的非虚拟立场将大大简化这项任务,但是不幸的是,我没有这样做。 DPDK本身并不适用于所有网卡,在使用它之前,您需要检查一下列表

最后是一项调查。

Source: https://habr.com/ru/post/zh-CN416651/


All Articles