ssh总断 (by quqi99)

张开发
2026/4/21 14:56:35 15 分钟阅读

分享文章

ssh总断 (by quqi99)
作者张华 发表于2020-10-28版权声明可以任意转载转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明公司服务器今天升级了结果遇到了一个问题登录在该服务器上的bastion虚机在运行 一个名为configure的脚本的下列代码时总是中断set_img_properties () { img_name$1 img_version$2 img_file$3 tsstat ~/images/$img_file| sed -rn s/Modify:\s([[:digit:]-])\s./\1/p| tr -d - declare -A props( [architecture]x86_64 [os_distro]ubuntu [os_version]$img_version [version_name]$ts [product_name]com.ubuntu.cloud:server:${img_version}:amd64 ) for p in ${!props[]}; do openstack image set --property $p${props[$p]} $img_name done wait } set_img_properties bionic 18.04 bionic-server-cloudimg-amd64.img ssh中断之后只能重启bastion才能解决因为一直没有问题只是今天服务器升级才遇到问题所以没想别的只是要求同事也测了一下他没遇到问题接着以为是mtu问题将mtu从8930改为1450依旧有问题ip netns exec qrouter-1c6f53cf-9b6d-4794-ae8e-61ad1b8e4042 ping -O -M do -s 8930 10.5.0.8 ip netns exec qrouter-1c6f53cf-9b6d-4794-ae8e-61ad1b8e4042 nc -vz 10.5.0.8 22所以就以为是底层openstack有问题但我没有底层openstack的管理权限所以只好找有权限的同事帮忙最后确定没有这方面的问题最后就是怀疑bastion这台虚机有问题了发现里面什么时候安装了devstack删除devstack后正常了但为什么之前一又没问题呢在有devstack的情况下代码作下列更改问题也消失#set_img_properties bionic 18.04 bionic-server-cloudimg-amd64.img set_img_properties bionic 18.04 bionic-server-cloudimg-amd64.imgdevstack这些东西不要乱在机器上装啊习惯要好上了一大课另外如果ssh连接慢的话可以尝试在sshd.conf中添加UseDNS no GSSAPIAuthentication no20210826更新ssh连不上也可检查一下是否加入了代理白名单20210910更新ssh连接不上发现其他机器可以连接只是从我working machine无法登录重启远程机器就可以了20210916更新 - time-waitssh连不上或者一个服务忽然连不上了要重启才能连上可能是因为time-wait连接数过多所致$ netstat -ant|awk /^tcp/ {S[$NF]} END {for(a in S) print (a,S[a])} LAST_ACK 1 LISTEN 14 SYN_RECV 3 CLOSE_WAIT 2 ESTABLISHED 194 FIN_WAIT1 16 FIN_WAIT2 2 SYN_SENT 23 TIME_WAIT 1016一些网络优化的参数如下1, 列队优化 #添加列队最大数据包数防止数据包在10G网络中被丢弃 sysctl net.core.netdev_max_backlog4096 #连接请求的最大数量低内存默认为128超128M内存的系统默认为1024这里默认是512, 如果服务器过载请增加此数量 sysctl net.ipv4.tcp_max_syn_backlog512 #最大待请求连接数, 对于发布大量连接的web服务器等此值高一些才能使这些连接正常工作 sysctl net.core.somaxconn 4096 2, TCP FIN超时优化 #TCP主动关闭的一方会进入到time_wait状态直到再收到对方的ack才会释放。默认值60非常高可以减小以使TCP关闭连接并释放资源以进行另一个连接 sysctl sysctl net.ipv4.tcp_fin_timeout 30 3, 重用TIME_WAIT状态的套接字进行新连接 #https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id79e9fed460385a3d8ba0b5782e9e74405cb199b1 #允许重新使用time_wait状态的套接字进行新连接。在处理必须time_wait状态下的许多短TCP连接的服务时是不错的选择 net.ipv4.tcp_tw_reuse 1 4, tcp_keepalive_time优化 #TCP一方想终止连接时会发送RST数据包给另一方并关闭socket. 然而在此之前双方将无限期地保持其套接字开放。 # 这使得一方可能有意或由于某些错误而关闭其插座而无需通过RST通知另一端。 为了检测此场景并关闭过时连接使用TCP Keep Alive处理 # 本例一旦另一端无响应120075*9秒则会定期移除死TCP连接 net.ipv4.tcp_keepalive_time 1200 # time-wait socket的数目超过此数系统将开始销毁套接字。用于防止DoS攻击若内存大可调大它 net.ipv4.tcp_max_tw_buckets 32768 5, 启用MTU黑洞检测优化 #https://blog.cloudflare.com/path-mtu-discovery-in-practice/ #Path MTU Discovery通过ICMP在CS两端找到正确的MTU,但如果防火墙丢失ICMP包的情况称为ICMP black hole. #此时整个连接会卡死, 发送方不断尝试重新发送丢失的数据包而接收方仅确认传送的小数据包 #tcp_mtu_probing1代表默认情况下禁用但检测到ICMP黑洞时启用 sysctl net.ipv4.tcp_mtu_probing 1 6, 内核缓存区优化 #最大写入、读取缓冲区大小, 默认设置比较保守提高它有利于如nfs服务的性能。将它们增加到256k/4b将最有效 # 256 KB / 4 MB net.core.rmem_default 262144 net.core.wmem_default 262144 net.core.rmem_max 4194304 net.core.wmem_max 4194304 # Or 256 Kb / 64 MB net.core.rmem_default 262144 net.core.wmem_default 262144 net.core.rmem_max 67108864 net.core.wmem_max 67108864 7, TCP缓冲区优化 #自动优化net.ipv4.tcp_rmem与net.ipv4.tcp_wmem sysctl -w net.ipv4.tcp_moderate_rcvbuf 1最终的设置是cat EOF | sudo tee /etc/sysctl.d/98-network-custom.conf net.core.netdev_max_backlog 4096 net.ipv4.tcp_max_syn_backlog 4096 net.core.somaxconn 4096 net.ipv4.tcp_fin_timeout 30 net.ipv4.tcp_tw_reuse 1 net.ipv4.tcp_keepalive_time 1200 net.ipv4.tcp_mtu_probing 1 net.ipv4.tcp_mtu_probing 1 EOF sudo sysctl --system sudo modprobe tcp_bbr echo tcp_bbr |sudo tee -a /etc/modules-load.d/modules.conf echo net.core.default_qdiscfq |sudo tee -a /etc/sysctl.conf echo net.ipv4.tcp_congestion_controlbbr |sudo tee -a /etc/sysctl.conf sudo sysctl -p但运行了上面的优化设置还是不行感觉好像是服务器上也运行了公司的办公软件造成的它将到us.archive.ubuntu.com的一些路由改走tun0了, 另外可能和路由器上系统也有关20220614 - debug1: SSH2_MSG_KEXINIT sentssh总是间隔连不上去, 看到日志: debug1: SSH2_MSG_KEXINIT sent. 只有1462能worksping -c 2 -s 1464 -M do justhost为什么MTU是1492 (146428)呢? 路由器明明有下列TCPMSS设置啊# iptables-save |grep mss ... -A FORWARD -o pppoe-wan -p tcp -m tcp --tcp-flags SYN,RST SYN -m comment --comment !fw3: Zone wan MTU fixing -j TCPMSS --clamp-mss-to-pmtu可能还是遇到了此问题吧 https://blog.csdn.net/quqi99/article/details/82346840现实生活中的悖论真多, 本来pmtud是设计用来在mtu不一致的情况下协商mss值的, 结果很多服务端或者中间路由器会错误地禁用掉icmp-type3或者icmptype4, 于是ptmud不可用, 于是很多路由器中的clamp-mss-to-pmtu设置(iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu)也失效, 这样tcp访问某些特定mtu值不一致的网站时就会出现各种莫名其妙的问题.在将网卡的MTU设置为1492后这个问题似乎消失了待多观察一段时间20220718更新与mtu 关系不大目前正在测试 reneg-sec 020221119 - 加固sshhosts.allow只能包装实现了TCPWrap的服务如sshd等daemon进程cat EOF |sudo tee -a /etc/hosts.deny sshd:ALL EOF cat EOF |sudo tee -a /etc/hosts.allow sshd: /etc/ssh_dyn_allow/hosts_sshd.allow EOF sudo mkdir -p /etc/ssh_dyn_allow cat EOF |sudo tee /etc/ssh_dyn_allow/allowed_domain.list xxx.oicp.net EOF $ cat /etc/ssh_dyn_allow/renew_allowed_ip_list.sh #!/bin/bash SCRIPT$(readlink -f $0) WORKDIR$(dirname $SCRIPT) ALLOWFILE$WORKDIR/hosts_sshd.allow DOMAINFILE$WORKDIR/allowed_domain.list DOMAINS$(cat $DOMAINFILE |grep -v ^#) echo # automatic generated : $(date) $ALLOWFILE for DOMAIN in $DOMAINS do echo IP$(dig $DOMAIN A | grep ^$DOMAIN | sed -e s/\s\s*/ /g | cut -d -f5) echo # $DOMAIN echo $IP echo done $ALLOWFILE echo OTHER-IP $ALLOWFILE sudo chmod x /etc/ssh_dyn_allow/renew_allowed_ip_list.sh $ sudo crontab -l |tail -n1 */60 * * * * /etc/ssh_dyn_allow/renew_allowed_ip_list.sh tail -f /var/log/auth.log sudo ufw status #sudo ufw delete allow proto tcp to 0::/0 port 22 sudo ufw allow proto tcp to 0.0.0.0/0 port 22 sudo ufw enable sudo ufw reload cat /etc/apt/apt.conf.d/20auto-upgrades vim /etc/apt/apt.conf.d/50unattended-upgrades Unattended-Upgrade::Automatic-Reboot true; Unattended-Upgrade::Automatic-Reboot-Time 12:00; sudo systemctl restart unattended-upgrades20221229遇到一个问题登录不进去原因是路由器上的ddns不work了最后在onecloud机器上安装了phddns解决(这样一就不需要路由器上的ddns了), 但这个也不能说稳定当天就发现https://console.hsk.oray.com/device 上的IP有时候不对(原因找到了是因为使用了pptunnel它给出来的是proxy point的IP. 另一个原因是之前的免费壳域名估计过期了重新申请了一个就OK了)改用了dnsexit的free dns。#openwrt ddns always has the problem, so change to use phddns, but phddns can report the fake IP as well wget https://dl-cdn.oray.com/hsk/linux/phddns_5.1.0_rapi_armhf.deb -O phddns_5.1.0_rapi_armhf.deb systemctl enable phtunnel.service phddns status curl http://ip.3322.net 1, apply for a free domain in the site: https://dnsexit.com/ 2, create a api key: https://dnsexit.com/Direct.sv?cmduserApiKey 3, use the router to update IP, or do it by hand: curl https://api.dnsexit.com/dns/ud/?apikeyAPI-Key -d hostyour-free-domain其实我们也没未必要使用host.deny那么麻烦可以只用ufw 来实现IP的控制. 但下列脚本有一个问题:对于有一些proxy的domain用dig $domain查看时有CNAME那此时domain应该用CNAME这个命令’ufw allow proto tcp from $IP to 0.0.0.0/0 port 7070’是好使的但不清楚为什么刚刚对vx不好使# cat /etc/ssh_dyn_allow/renew_allowed_ip_list.sh #!/bin/bash # NOTE: if dig $DOMAIN has CNAME, DOMAIN should use CNAME SCRIPT$(readlink -f $0) WORKDIR$(dirname $SCRIPT) ALLOWFILE$WORKDIR/hosts_sshd.allow DOMAINFILE$WORKDIR/allowed_domain.list DOMAINS$(cat $DOMAINFILE |grep -v ^#) echo # automatic generated : $(date) $ALLOWFILE UFW_NUM$(ufw status |wc -l) if [ $UFW_NUM -gt 1000 ]; then echo # RUN: ufw delete ... for num in $(ufw status numbered |grep -v tcp |grep ALLOW |awk -F[][] {print $2} |tr --delete [:blank:] |sort -rn); do yes |ufw delete $num; done rm -rf ~/lastip_* fi for DOMAIN in $DOMAINS do IP$(dig $DOMAIN A | grep ^$DOMAIN | sed -e s/\s\s*/ /g | cut -d -f5) echo # $DOMAIN if [ ! -f ~/lastip_$DOMAIN.txt ]; then touch ~/lastip_$DOMAIN.txt; fi LASTIP$(cat ~/lastip_$DOMAIN.txt) echo # for $DOMAIN , lastip is: $LASTIP , now IP is: $IP if [ $LASTIP ! $IP ]; then echo $IP ~/lastip_$DOMAIN.txt echo # RUN: ufw allow from $IP ... echo # UFW_NUM$UFW_NUM #do not know why the following rule does not work for vx proxy, but it works for others ufw allow proto tcp from $IP to 0.0.0.0/0 port 7070 /dev/null 21 ufw allow proto tcp from $IP to 0.0.0.0/0 port 22 /dev/null 21 # two backdoor for port 22 ufw allow proto tcp from xx.xx.94.107 to 0.0.0.0/0 port 22 /dev/null 21 ufw allow proto tcp from xx.xx.241.186 to 0.0.0.0/0 port 22 /dev/null 21 fi echo $IP #done done $ALLOWFILE其他的设置节省流量(注apt-cacher-ng和privoxy时一起用时容易出错另外apt-cacher-ng的选项PassThroughPattern是为了支持https (这一款apt cache就是不支持https的 - https://soulteary.com/2022/11/20/linux-package-download-acceleration-tool-apt-proxy.html ) :# privoxy doesnt work on openwrt, so use armbian instead sudo apt install privoxy -y echo listen-address 192.168.99.194:8118 |tee -a /var/etc/privoxy.conf echo forward-socks5 / 127.0.0.1:7071 . |tee -a /var/etc/privoxy.conf systemctl enable privoxy systemctl restart privoxy sudo apt install apt-cacher-ng -y echo PassThroughPattern: .* |sudo tee -a /etc/apt-cacher-ng/acng.conf #echo Proxy: http://127.0.0.1:8118 |sudo tee -a /etc/apt-cacher-ng/acng.conf sudo systemctl restart apt-cacher-ng.service sudo systemctl enable apt-cacher-ng.service #client side echo Acquire::http::Proxy http://192.168.99.194:3142; | sudo tee /etc/apt/apt.conf.d/01acng家里没有公网IP时可以考虑采用natfrp.com 理由是它的限速是10M每月10G流量 (其它的如cpolar都是限速1M, 缺点是它是需要实名的#https://www.natfrp.com/tunnel/download wget https://getfrp.sh/d/frpc_linux_armv7 #wget https://getfrp.sh/d/frpc_linux_amd64 #create tunnel in: https://www.natfrp.com/tunnel/ #key is in: https://www.natfrp.com/user/profile ./frpc_linux_armv7 cat ./frpc.ini cat EOF |sudo tee /etc/systemd/system/frpc.service [Unit] DescriptionFRP Client Daemon Afternetwork.target Wantsnetwork.target [Service] Typesimple ExecStart/root/frpc_linux_armv7 -c /root/frpc.ini Restartalways RestartSec20s #Usernobody LimitNOFILEinfinity [Install] WantedBymulti-user.target EOF systemctl restart frpc20230506 - ddns采用dnsexit.com的免费域名#!/bin/sh # dig tcp xx.dynv6.net 101.6.6.6 -p5353 AAAA API_KEYxx hostnamexxy.dynv6.net tokenxx IP$(ip -6 addr show dev eth0 |grep global |awk {print $2} |sed s/\/64// |head -n1 |sed s/\/[^ ]*//) if [ ! -e /root/previous_ip.txt ]; then echo touch file echo 0 /root/previous_ip.txt fi if [ -s /root/previous_ip.txt ]; then PREVIOUS_IP$(cat /root/previous_ip.txt) echo PREVIOUS_IP$PREVIOUS_IP if [ $IP ! $PREVIOUS_IP ]; then #for dnsexit.com, no double quotations here echo https://api.dnsexit.com/dns/ud/?apikey${API_KEY} -d hostxx.publicvm.com -d ip${IP} curl https://api.dnsexit.com/dns/ud/?apikey${API_KEY} -d hostxx.publicvm.com -d ip${IP} sleep 1 #for dynv6.com, have double quotations here echo https://dynv6.com/api/update?hostname${hostname}ipv6${IP}token${token} curl https://dynv6.com/api/update?hostname${hostname}ipv6${IP}token${token} echo $IP /root/previous_ip.txt fi else echo wring IP to previous_ip.txt echo $IP /root/previous_ip.txt fi20240908# 禁止入方向的 ping iptables -A INPUT -p icmp --icmp-type echo-request -j DROP # 禁止服务器回应 ICMP Port Unreachable 和 Host Unreachable 消息 iptables -A OUTPUT -p icmp --icmp-type port-unreachable -j DROP iptables -A OUTPUT -p icmp --icmp-type host-unreachable -j DROP # 禁止服务器回应没有监听端口的 TCP Reset 消息标志位为 RST,ACK iptables -A OUTPUT -p tcp --tcp-flags ALL RST,ACK -j DROP netfilter-persistent save #dpkg-reconfigure iptables-persistent net.ipv4.ip_forward 1 net.ipv4.conf.default.rp_filter1 net.ipv4.tcp_tw_reuse 1 net.ipv4.tcp_fin_timeout 30 net.ipv4.tcp_keepalive_time 1200 net.ipv4.tcp_mtu_probing 1 net.ipv4.tcp_mtu_probing 1 net.core.default_qdiscfq net.ipv4.tcp_congestion_controlbbr net.ipv4.ip_forward 1 net.ipv6.conf.all.forwarding 1 net.ipv4.ip_forward 1 net.ipv6.conf.all.forwarding 1 net.ipv4.tcp_timestamps120240926 - tcp_timestamps服务端和客户端都得设置tcp_timestamps1来防rst包, 这个在现在的是默认值.sysctl -w net.ipv4.tcp_timestamps120260420 - 一个奇怪的ssh问题这两天发现一个奇怪的问题在y9000p上运行’ssh huaminipc’命令的窗口在缩放之后就鼠标不动了只有再开个tab运行’ssh huaminipc’一下或者多等一会才恢复。此时1, ssh至其他机器的窗口没这问题只有ssh到minipc的这窗口有问题2, disable minipc上的byobu之后问题消失所以问题仅出现在 minipc byobu ssh resize这特殊的组合上我运行了apt upgrade之后再重启就解决了Reference[1] https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux

更多文章