docs/usecases/contiv/BUG_REPORTS.rst

   1 Debugging and Reporting Bugs in Contiv-VPP
   2 ==========================================
   3
   4 Bug Report Structure
   5 --------------------
   6
   7 -  `Deployment description <#describe-deployment>`__: Briefly describes
   8    the deployment, where an issue was spotted, number of k8s nodes, is
   9    DHCP/STN/TAP used.
  10
  11 -  `Logs <#collecting-the-logs>`__: Attach corresponding logs, at least
  12    from the vswitch pods.
  13
  14 -  `VPP config <#inspect-vpp-config>`__: Attach output of the show
  15    commands.
  16
  17 -  `Basic Collection Example <#basic-example>`__
  18
  19 Describe Deployment
  20 ~~~~~~~~~~~~~~~~~~~
  21
  22 Since contiv-vpp can be used with different configurations, it is
  23 helpful to attach the config that was applied. Either attach
  24 ``values.yaml`` passed to the helm chart, or attach the `corresponding
  25 part <https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38>`__
  26 from the deployment yaml file.
  27
  28 .. code:: yaml
  29
  30     contiv.yaml: |-
  31       TCPstackDisabled: true
  32       UseTAPInterfaces: true
  33       TAPInterfaceVersion: 2
  34       NatExternalTraffic: true
  35       MTUSize: 1500
  36       IPAMConfig:
  37         PodSubnetCIDR: 10.1.0.0/16
  38         PodNetworkPrefixLen: 24
  39         PodIfIPCIDR: 10.2.1.0/24
  40         VPPHostSubnetCIDR: 172.30.0.0/16
  41         VPPHostNetworkPrefixLen: 24
  42         NodeInterconnectCIDR: 192.168.16.0/24
  43         VxlanCIDR: 192.168.30.0/24
  44         NodeInterconnectDHCP: False
  45
  46 Information that might be helpful: - Whether node IPs are statically
  47 assigned, or if DHCP is used - STN is enabled - Version of TAP
  48 interfaces used - Output of
  49 ``kubectl get pods -o wide --all-namespaces``
  50
  51 Collecting the Logs
  52 ~~~~~~~~~~~~~~~~~~~
  53
  54 The most essential thing that needs to be done when debugging and
  55 **reporting an issue** in Contiv-VPP is **collecting the logs from the
  56 contiv-vpp vswitch containers**.
  57
  58 a) Collecting Vswitch Logs Using kubectl
  59 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  60
  61 In order to collect the logs from individual vswitches in the cluster,
  62 connect to the master node and then find the POD names of the individual
  63 vswitch containers:
  64
  65 ::
  66
  67     $ kubectl get pods --all-namespaces | grep vswitch
  68     kube-system   contiv-vswitch-lqxfp               2/2       Running   0          1h
  69     kube-system   contiv-vswitch-q6kwt               2/2       Running   0          1h
  70
  71 Then run the following command, with *pod name* replaced by the actual
  72 POD name:
  73
  74 ::
  75
  76     $ kubectl logs <pod name> -n kube-system -c contiv-vswitch
  77
  78 Redirect the output to a file to save the logs, for example:
  79
  80 ::
  81
  82     kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
  83
  84 b) Collecting Vswitch Logs Using Docker
  85 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  86
  87 If option a) does not work, then you can still collect the same logs
  88 using the plain docker command. For that, you need to connect to each
  89 individual node in the k8s cluster, and find the container ID of the
  90 vswitch container:
  91
  92 ::
  93
  94     $ docker ps | grep contivvpp/vswitch
  95     b682b5837e52        contivvpp/vswitch                                        "/usr/bin/supervisor…"   2 hours ago         Up 2 hours                              k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
  96
  97 Now use the ID from the first column to dump the logs into the
  98 ``logs-master.txt`` file:
  99
 100 ::
 101
 102     $ docker logs b682b5837e52 > logs-master.txt
 103
 104 Reviewing the Vswitch Logs
 105 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 106
 107 In order to debug an issue, it is good to start by grepping the logs for
 108 the ``level=error`` string, for example:
 109
 110 ::
 111
 112     $ cat logs-master.txt | grep level=error
 113
 114 Also, VPP or contiv-agent may crash with some bugs. To check if some
 115 process crashed, grep for the string ``exit``, for example:
 116
 117 ::
 118
 119     $ cat logs-master.txt | grep exit
 120     2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
 121     2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
 122
 123 Collecting the STN Daemon Logs
 124 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 125
 126 In STN (Steal The NIC) deployment scenarios, often need to collect and
 127 review the logs from the STN daemon. This needs to be done on each node:
 128
 129 ::
 130
 131     $ docker logs contiv-stn > logs-stn-master.txt
 132
 133 Collecting Logs in Case of Crash Loop
 134 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 135
 136 If the vswitch is crashing in a loop (which can be determined by
 137 increasing the number in the ``RESTARTS`` column of the
 138 ``kubectl get pods --all-namespaces`` output), the ``kubectl logs`` or
 139 ``docker logs`` would give us the logs of the latest incarnation of the
 140 vswitch. That might not be the original root cause of the very first
 141 crash, so in order to debug that, we need to disable k8s health check
 142 probes to not restart the vswitch after the very first crash. This can
 143 be done by commenting-out the ``readinessProbe`` and ``livenessProbe``
 144 in the contiv-vpp deployment YAML:
 145
 146 .. code:: diff
 147
 148     diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
 149     index 3676047..ffa4473 100644
 150     --- a/k8s/contiv-vpp.yaml
 151     +++ b/k8s/contiv-vpp.yaml
 152     @@ -224,18 +224,18 @@ spec:
 153                 ports:
 154                   # readiness + liveness probe
 155                   - containerPort: 9999
 156     -          readinessProbe:
 157     -            httpGet:
 158     -              path: /readiness
 159     -              port: 9999
 160     -            periodSeconds: 1
 161     -            initialDelaySeconds: 15
 162     -          livenessProbe:
 163     -            httpGet:
 164     -              path: /liveness
 165     -              port: 9999
 166     -            periodSeconds: 1
 167     -            initialDelaySeconds: 60
 168     + #         readinessProbe:
 169     + #           httpGet:
 170     + #             path: /readiness
 171     + #             port: 9999
 172     + #           periodSeconds: 1
 173     + #           initialDelaySeconds: 15
 174     + #         livenessProbe:
 175     + #           httpGet:
 176     + #             path: /liveness
 177     + #             port: 9999
 178     + #           periodSeconds: 1
 179     + #           initialDelaySeconds: 60
 180                 env:
 181                   - name: MICROSERVICE_LABEL
 182                     valueFrom:
 183
 184 If VPP is the crashing process, please follow the
 185 [CORE_FILES](CORE_FILES.html) guide and provide the coredump file.
 186
 187 Inspect VPP Config
 188 ~~~~~~~~~~~~~~~~~~
 189
 190 Inspect the following areas: - Configured interfaces (issues related
 191 basic node/pod connectivity issues):
 192
 193 ::
 194
 195     vpp# sh int addr
 196     GigabitEthernet0/9/0 (up):
 197       192.168.16.1/24
 198     local0 (dn):
 199     loop0 (up):
 200       l2 bridge bd_id 1 bvi shg 0
 201       192.168.30.1/24
 202     tapcli-0 (up):
 203       172.30.1.1/24
 204
 205 -  IP forwarding table:
 206
 207 ::
 208
 209     vpp# sh ip fib
 210     ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
 211     0.0.0.0/0
 212       unicast-ip4-chain
 213       [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
 214         [0] [@0]: dpo-drop ip4
 215     0.0.0.0/32
 216       unicast-ip4-chain
 217       [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
 218         [0] [@0]: dpo-drop ip4
 219
 220     ...
 221     ...
 222
 223     255.255.255.255/32
 224       unicast-ip4-chain
 225       [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
 226         [0] [@0]: dpo-drop ip4
 227
 228 -  ARP Table:
 229
 230 ::
 231
 232     vpp# sh ip arp
 233         Time           IP4       Flags      Ethernet              Interface
 234         728.6616  192.168.16.2     D    08:00:27:9c:0e:9f GigabitEthernet0/8/0
 235         542.7045  192.168.30.2     S    1a:2b:3c:4d:5e:02 loop0
 236           1.4241   172.30.1.2      D    86:41:d5:92:fd:24 tapcli-0
 237           15.2485    10.1.1.2      SN    00:00:00:00:00:02 tapcli-1
 238         739.2339    10.1.1.3      SN    00:00:00:00:00:02 tapcli-2
 239         739.4119    10.1.1.4      SN    00:00:00:00:00:02 tapcli-3
 240
 241 -  NAT configuration (issues related to services):
 242
 243 ::
 244
 245     DBGvpp# sh nat44 addresses
 246     NAT44 pool addresses:
 247     192.168.16.10
 248       tenant VRF independent
 249       0 busy udp ports
 250       0 busy tcp ports
 251       0 busy icmp ports
 252     NAT44 twice-nat pool addresses:
 253
 254 ::
 255
 256     vpp# sh nat44 static mappings
 257     NAT44 static mappings:
 258       tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0  out2in-only
 259       tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0  out2in-only
 260       tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0  out2in-only
 261       tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0  out2in-only
 262       tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0  out2in-only
 263       tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0  out2in-only
 264       udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0  out2in-only
 265       tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0  out2in-only
 266
 267 ::
 268
 269     vpp# sh nat44 interfaces
 270     NAT44 interfaces:
 271       loop0 in out
 272       GigabitEthernet0/9/0 out
 273       tapcli-0 in out
 274
 275 ::
 276
 277     vpp# sh nat44 sessions
 278     NAT44 sessions:
 279       192.168.20.2: 0 dynamic translations, 3 static translations
 280       10.1.1.3: 0 dynamic translations, 0 static translations
 281       10.1.1.4: 0 dynamic translations, 0 static translations
 282       10.1.1.2: 0 dynamic translations, 6 static translations
 283       10.1.2.18: 0 dynamic translations, 2 static translations
 284
 285 -  ACL config (issues related to policies):
 286
 287 ::
 288
 289     vpp# sh acl-plugin acl
 290
 291 -  “Steal the NIC (STN)” config (issues related to host connectivity
 292    when STN is active):
 293
 294 ::
 295
 296     vpp# sh stn rules
 297     - rule_index: 0
 298       address: 10.1.10.47
 299       iface: tapcli-0 (2)
 300       next_node: tapcli-0-output (410)
 301
 302 -  Errors:
 303
 304 ::
 305
 306     vpp# sh errors
 307
 308 -  Vxlan tunnels:
 309
 310 ::
 311
 312     vpp# sh vxlan tunnels
 313
 314 -  Vxlan tunnels:
 315
 316 ::
 317
 318     vpp# sh vxlan tunnels
 319
 320 -  Hardware interface information:
 321
 322 ::
 323
 324     vpp# sh hardware-interfaces
 325
 326 Basic Example
 327 ~~~~~~~~~~~~~
 328
 329 `contiv-vpp-bug-report.sh <https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh>`__
 330 is an example of a script that may be a useful starting point to
 331 gathering the above information using kubectl.
 332
 333 Limitations: - The script does not include STN daemon logs nor does it
 334 handle the special case of a crash loop
 335
 336 Prerequisites: - The user specified in the script must have passwordless
 337 access to all nodes in the cluster; on each node in the cluster the user
 338 must have passwordless access to sudo.
 339
 340 Setting up Prerequisites
 341 ^^^^^^^^^^^^^^^^^^^^^^^^
 342
 343 To enable logging into a node without a password, copy your public key
 344 to the following node:
 345
 346 ::
 347
 348     ssh-copy-id <user-id>@<node-name-or-ip-address>
 349
 350 To enable running sudo without a password for a given user, enter:
 351
 352 ::
 353
 354     $ sudo visudo
 355
 356 Append the following entry to run ALL command without a password for a
 357 given user:
 358
 359 ::
 360
 361     <userid> ALL=(ALL) NOPASSWD:ALL
 362
 363 You can also add user ``<user-id>`` to group ``sudo`` and edit the
 364 ``sudo`` entry as follows:
 365
 366 ::
 367
 368     # Allow members of group sudo to execute any command
 369     %sudo   ALL=(ALL:ALL) NOPASSWD:ALL
 370
 371 Add user ``<user-id>`` to group ``<group-id>`` as follows:
 372
 373 ::
 374
 375     sudo adduser <user-id> <group-id>
 376
 377 or as follows:
 378
 379 ::
 380
 381     usermod -a -G <group-id> <user-id>
 382
 383 Working with the Contiv-VPP Vagrant Test Bed
 384 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 385
 386 The script can be used to collect data from the `Contiv-VPP test bed
 387 created with
 388 Vagrant <https://github.com/contiv/vpp/blob/master/vagrant/README.md>`__.
 389 To collect debug information from this Contiv-VPP test bed, do the
 390 following steps: \* In the directory where you created your vagrant test
 391 bed, do:
 392
 393 ::
 394
 395     vagrant ssh-config > vagrant-ssh.conf
 396
 397 -  To collect the debug information do:
 398
 399 ::
 400
 401     ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf