docs/usecases/contiv/BUG_REPORTS.md

   1 # Debugging and Reporting Bugs in Contiv-VPP
   2
   3 ## Bug Report Structure
   4
   5 - [Deployment description](#describe-deployment):
   6 Briefly describes the deployment, where an issue was spotted,
   7 number of k8s nodes, is DHCP/STN/TAP used.
   8
   9 - [Logs](#collecting-the-logs):
  10 Attach corresponding logs, at least from the vswitch pods.
  11
  12 - [VPP config](#inspect-vpp-config):
  13 Attach output of the show commands.
  14
  15 - [Basic Collection Example](#basic-example)
  16
  17 ### Describe Deployment
  18 Since contiv-vpp can be used with different configurations, it is helpful
  19 to attach the config that was applied. Either attach `values.yaml` passed to the helm chart,
  20 or attach the [corresponding part](https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38) from the deployment yaml file.
  21
  22 ```
  23   contiv.yaml: |-
  24     TCPstackDisabled: true
  25     UseTAPInterfaces: true
  26     TAPInterfaceVersion: 2
  27     NatExternalTraffic: true
  28     MTUSize: 1500
  29     IPAMConfig:
  30       PodSubnetCIDR: 10.1.0.0/16
  31       PodNetworkPrefixLen: 24
  32       PodIfIPCIDR: 10.2.1.0/24
  33       VPPHostSubnetCIDR: 172.30.0.0/16
  34       VPPHostNetworkPrefixLen: 24
  35       NodeInterconnectCIDR: 192.168.16.0/24
  36       VxlanCIDR: 192.168.30.0/24
  37       NodeInterconnectDHCP: False
  38 ```
  39
  40 Information that might be helpful:
  41  - Whether node IPs are statically assigned, or if DHCP is used
  42  - STN is enabled
  43  - Version of TAP interfaces used
  44  - Output of `kubectl get pods -o wide --all-namespaces`
  45
  46
  47 ### Collecting the Logs
  48
  49 The most essential thing that needs to be done when debugging and **reporting an issue**
  50 in Contiv-VPP is **collecting the logs from the contiv-vpp vswitch containers**.
  51
  52 #### a) Collecting Vswitch Logs Using kubectl
  53 In order to collect the logs from individual vswitches in the cluster, connect to the master node
  54 and then find the POD names of the individual vswitch containers:
  55
  56 ```
  57 $ kubectl get pods --all-namespaces | grep vswitch
  58 kube-system   contiv-vswitch-lqxfp               2/2       Running   0          1h
  59 kube-system   contiv-vswitch-q6kwt               2/2       Running   0          1h
  60 ```
  61
  62 Then run the following command, with *pod name* replaced by the actual POD name:
  63 ```
  64 $ kubectl logs <pod name> -n kube-system -c contiv-vswitch
  65 ```
  66
  67 Redirect the output to a file to save the logs, for example:
  68
  69 ```
  70 kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
  71 ```
  72
  73 #### b) Collecting Vswitch Logs Using Docker
  74 If option a) does not work, then you can still collect the same logs using the plain docker
  75 command. For that, you need to connect to each individual node in the k8s cluster, and find the container ID of the vswitch container:
  76
  77 ```
  78 $ docker ps | grep contivvpp/vswitch
  79 b682b5837e52        contivvpp/vswitch                                        "/usr/bin/supervisor…"   2 hours ago         Up 2 hours                              k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
  80 ```
  81
  82 Now use the ID from the first column to dump the logs into the `logs-master.txt` file:
  83 ```
  84 $ docker logs b682b5837e52 > logs-master.txt
  85 ```
  86
  87 #### Reviewing the Vswitch Logs
  88
  89 In order to debug an issue, it is good to start by grepping the logs for the `level=error` string, for example:
  90 ```
  91 $ cat logs-master.txt | grep level=error
  92 ```
  93
  94 Also, VPP or contiv-agent may crash with some bugs. To check if some process crashed, grep for the string `exit`, for example:
  95 ```
  96 $ cat logs-master.txt | grep exit
  97 2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
  98 2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
  99 ```
 100
 101 #### Collecting the STN Daemon Logs
 102 In STN (Steal The NIC) deployment scenarios, often need to collect and review the logs
 103 from the STN daemon. This needs to be done on each node:
 104 ```
 105 $ docker logs contiv-stn > logs-stn-master.txt
 106 ```
 107
 108 #### Collecting Logs in Case of Crash Loop
 109 If the vswitch is crashing in a loop (which can be determined by increasing the number in the `RESTARTS`
 110 column of the `kubectl get pods --all-namespaces` output), the `kubectl logs` or `docker logs` would
 111 give us the logs of the latest incarnation of the vswitch. That might not be the original root cause
 112 of the very first crash, so in order to debug that, we need to disable k8s health check probes to not
 113 restart the vswitch after the very first crash. This can be done by commenting-out the `readinessProbe`
 114 and `livenessProbe` in the contiv-vpp deployment YAML:
 115
 116 ```diff
 117 diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
 118 index 3676047..ffa4473 100644
 119 --- a/k8s/contiv-vpp.yaml
 120 +++ b/k8s/contiv-vpp.yaml
 121 @@ -224,18 +224,18 @@ spec:
 122            ports:
 123              # readiness + liveness probe
 124              - containerPort: 9999
 125 -          readinessProbe:
 126 -            httpGet:
 127 -              path: /readiness
 128 -              port: 9999
 129 -            periodSeconds: 1
 130 -            initialDelaySeconds: 15
 131 -          livenessProbe:
 132 -            httpGet:
 133 -              path: /liveness
 134 -              port: 9999
 135 -            periodSeconds: 1
 136 -            initialDelaySeconds: 60
 137 + #         readinessProbe:
 138 + #           httpGet:
 139 + #             path: /readiness
 140 + #             port: 9999
 141 + #           periodSeconds: 1
 142 + #           initialDelaySeconds: 15
 143 + #         livenessProbe:
 144 + #           httpGet:
 145 + #             path: /liveness
 146 + #             port: 9999
 147 + #           periodSeconds: 1
 148 + #           initialDelaySeconds: 60
 149            env:
 150              - name: MICROSERVICE_LABEL
 151                valueFrom:
 152 ```
 153
 154 If VPP is the crashing process, please follow the [CORE_FILES](CORE_FILES.html) guide and provide the coredump file.
 155
 156
 157 ### Inspect VPP Config
 158 Inspect the following areas:
 159 - Configured interfaces (issues related basic node/pod connectivity issues):
 160 ```
 161 vpp# sh int addr
 162 GigabitEthernet0/9/0 (up):
 163   192.168.16.1/24
 164 local0 (dn):
 165 loop0 (up):
 166   l2 bridge bd_id 1 bvi shg 0
 167   192.168.30.1/24
 168 tapcli-0 (up):
 169   172.30.1.1/24
 170 ```
 171
 172 - IP forwarding table:
 173 ```
 174 vpp# sh ip fib
 175 ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
 176 0.0.0.0/0
 177   unicast-ip4-chain
 178   [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
 179     [0] [@0]: dpo-drop ip4
 180 0.0.0.0/32
 181   unicast-ip4-chain
 182   [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
 183     [0] [@0]: dpo-drop ip4
 184
 185 ...
 186 ...
 187
 188 255.255.255.255/32
 189   unicast-ip4-chain
 190   [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
 191     [0] [@0]: dpo-drop ip4
 192 ```
 193 - ARP Table:
 194 ```
 195 vpp# sh ip arp
 196     Time           IP4       Flags      Ethernet              Interface
 197     728.6616  192.168.16.2     D    08:00:27:9c:0e:9f GigabitEthernet0/8/0
 198     542.7045  192.168.30.2     S    1a:2b:3c:4d:5e:02 loop0
 199       1.4241   172.30.1.2      D    86:41:d5:92:fd:24 tapcli-0
 200      15.2485    10.1.1.2      SN    00:00:00:00:00:02 tapcli-1
 201     739.2339    10.1.1.3      SN    00:00:00:00:00:02 tapcli-2
 202     739.4119    10.1.1.4      SN    00:00:00:00:00:02 tapcli-3
 203 ```
 204 - NAT configuration (issues related to services):
 205 ```
 206 DBGvpp# sh nat44 addresses
 207 NAT44 pool addresses:
 208 192.168.16.10
 209   tenant VRF independent
 210   0 busy udp ports
 211   0 busy tcp ports
 212   0 busy icmp ports
 213 NAT44 twice-nat pool addresses:
 214 ```
 215 ```
 216 vpp# sh nat44 static mappings
 217 NAT44 static mappings:
 218  tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0  out2in-only
 219  tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0  out2in-only
 220  tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0  out2in-only
 221  tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0  out2in-only
 222  tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0  out2in-only
 223  tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0  out2in-only
 224  udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0  out2in-only
 225  tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0  out2in-only
 226 ```
 227 ```
 228 vpp# sh nat44 interfaces
 229 NAT44 interfaces:
 230  loop0 in out
 231  GigabitEthernet0/9/0 out
 232  tapcli-0 in out
 233 ```
 234 ```
 235 vpp# sh nat44 sessions
 236 NAT44 sessions:
 237   192.168.20.2: 0 dynamic translations, 3 static translations
 238   10.1.1.3: 0 dynamic translations, 0 static translations
 239   10.1.1.4: 0 dynamic translations, 0 static translations
 240   10.1.1.2: 0 dynamic translations, 6 static translations
 241   10.1.2.18: 0 dynamic translations, 2 static translations
 242 ```
 243 - ACL config (issues related to policies):
 244 ```
 245 vpp# sh acl-plugin acl
 246 ```
 247 - "Steal the NIC (STN)" config (issues related to host connectivity when STN is active):
 248 ```
 249 vpp# sh stn rules
 250 - rule_index: 0
 251   address: 10.1.10.47
 252   iface: tapcli-0 (2)
 253   next_node: tapcli-0-output (410)
 254 ```
 255 - Errors:
 256 ```
 257 vpp# sh errors
 258 ```
 259 - Vxlan tunnels:
 260 ```
 261 vpp# sh vxlan tunnels
 262 ```
 263 - Vxlan tunnels:
 264 ```
 265 vpp# sh vxlan tunnels
 266 ```
 267 - Hardware interface information:
 268 ```
 269 vpp# sh hardware-interfaces
 270 ```
 271
 272 ### Basic Example
 273
 274 [contiv-vpp-bug-report.sh][1] is an example of a script that may be a useful starting point to gathering the above information using kubectl.
 275
 276 Limitations:
 277 - The script does not include STN daemon logs nor does it handle the special
 278   case of a crash loop
 279
 280 Prerequisites:
 281 - The user specified in the script must have passwordless access to all nodes
 282   in the cluster; on each node in the cluster the user must have passwordless
 283   access to sudo.
 284
 285 #### Setting up Prerequisites
 286 To enable looging into a node without a password, copy your public key to the following
 287 node:
 288 ```
 289 ssh-copy-id <user-id>@<node-name-or-ip-address>
 290 ```
 291
 292 To enable running sudo without a password for a given user, enter:
 293 ```
 294 $ sudo visudo
 295 ```
 296
 297 Append the following entry to run ALL command without a password for a given
 298 user:
 299 ```
 300 <userid> ALL=(ALL) NOPASSWD:ALL
 301 ```
 302
 303 You can also add user `<user-id>` to group `sudo` and edit the `sudo`
 304 entry as follows:
 305
 306 ```
 307 # Allow members of group sudo to execute any command
 308 %sudo   ALL=(ALL:ALL) NOPASSWD:ALL
 309 ```
 310
 311 Add user `<user-id>` to group `<group-id>` as follows:
 312 ```
 313 sudo adduser <user-id> <group-id>
 314 ```
 315 or as follows:
 316 ```
 317 usermod -a -G <group-id> <user-id>
 318 ```
 319 #### Working with the Contiv-VPP Vagrant Test Bed
 320 The script can be used to collect data from the [Contiv-VPP test bed created with Vagrant][2].
 321 To collect debug information from this Contiv-VPP test bed, do the
 322 following steps:
 323 * In the directory where you created your vagrant test bed, do:
 324 ```
 325   vagrant ssh-config > vagrant-ssh.conf
 326 ```
 327 * To collect the debug information do:
 328 ```
 329   ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf
 330 ```
 331
 332 [1]: https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh
 333 [2]: https://github.com/contiv/vpp/blob/master/vagrant/README.md