1 Debugging and Reporting Bugs in Contiv-VPP
2 ==========================================
7 - `Deployment description <#describe-deployment>`__: Briefly describes
8 the deployment, where an issue was spotted, number of k8s nodes, is
11 - `Logs <#collecting-the-logs>`__: Attach corresponding logs, at least
12 from the vswitch pods.
14 - `VPP config <#inspect-vpp-config>`__: Attach output of the show
17 - `Basic Collection Example <#basic-example>`__
22 Since contiv-vpp can be used with different configurations, it is
23 helpful to attach the config that was applied. Either attach
24 ``values.yaml`` passed to the helm chart, or attach the `corresponding
25 part <https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38>`__
26 from the deployment yaml file.
31 TCPstackDisabled: true
32 UseTAPInterfaces: true
33 TAPInterfaceVersion: 2
34 NatExternalTraffic: true
37 PodSubnetCIDR: 10.1.0.0/16
38 PodNetworkPrefixLen: 24
39 PodIfIPCIDR: 10.2.1.0/24
40 VPPHostSubnetCIDR: 172.30.0.0/16
41 VPPHostNetworkPrefixLen: 24
42 NodeInterconnectCIDR: 192.168.16.0/24
43 VxlanCIDR: 192.168.30.0/24
44 NodeInterconnectDHCP: False
46 Information that might be helpful: - Whether node IPs are statically
47 assigned, or if DHCP is used - STN is enabled - Version of TAP
48 interfaces used - Output of
49 ``kubectl get pods -o wide --all-namespaces``
54 The most essential thing that needs to be done when debugging and
55 **reporting an issue** in Contiv-VPP is **collecting the logs from the
56 contiv-vpp vswitch containers**.
58 a) Collecting Vswitch Logs Using kubectl
59 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
61 In order to collect the logs from individual vswitches in the cluster,
62 connect to the master node and then find the POD names of the individual
67 $ kubectl get pods --all-namespaces | grep vswitch
68 kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h
69 kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h
71 Then run the following command, with *pod name* replaced by the actual
76 $ kubectl logs <pod name> -n kube-system -c contiv-vswitch
78 Redirect the output to a file to save the logs, for example:
82 kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
84 b) Collecting Vswitch Logs Using Docker
85 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
87 If option a) does not work, then you can still collect the same logs
88 using the plain docker command. For that, you need to connect to each
89 individual node in the k8s cluster, and find the container ID of the
94 $ docker ps | grep contivvpp/vswitch
95 b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
97 Now use the ID from the first column to dump the logs into the
98 ``logs-master.txt`` file:
102 $ docker logs b682b5837e52 > logs-master.txt
104 Reviewing the Vswitch Logs
105 ^^^^^^^^^^^^^^^^^^^^^^^^^^
107 In order to debug an issue, it is good to start by grepping the logs for
108 the ``level=error`` string, for example:
112 $ cat logs-master.txt | grep level=error
114 Also, VPP or contiv-agent may crash with some bugs. To check if some
115 process crashed, grep for the string ``exit``, for example:
119 $ cat logs-master.txt | grep exit
120 2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
121 2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
123 Collecting the STN Daemon Logs
124 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
126 In STN (Steal The NIC) deployment scenarios, often need to collect and
127 review the logs from the STN daemon. This needs to be done on each node:
131 $ docker logs contiv-stn > logs-stn-master.txt
133 Collecting Logs in Case of Crash Loop
134 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
136 If the vswitch is crashing in a loop (which can be determined by
137 increasing the number in the ``RESTARTS`` column of the
138 ``kubectl get pods --all-namespaces`` output), the ``kubectl logs`` or
139 ``docker logs`` would give us the logs of the latest incarnation of the
140 vswitch. That might not be the original root cause of the very first
141 crash, so in order to debug that, we need to disable k8s health check
142 probes to not restart the vswitch after the very first crash. This can
143 be done by commenting-out the ``readinessProbe`` and ``livenessProbe``
144 in the contiv-vpp deployment YAML:
148 diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
149 index 3676047..ffa4473 100644
150 --- a/k8s/contiv-vpp.yaml
151 +++ b/k8s/contiv-vpp.yaml
152 @@ -224,18 +224,18 @@ spec:
154 # readiness + liveness probe
155 - containerPort: 9999
161 - initialDelaySeconds: 15
167 - initialDelaySeconds: 60
173 + # initialDelaySeconds: 15
179 + # initialDelaySeconds: 60
181 - name: MICROSERVICE_LABEL
184 If VPP is the crashing process, please follow the
185 [CORE_FILES](CORE_FILES.html) guide and provide the coredump file.
190 Inspect the following areas: - Configured interfaces (issues related
191 basic node/pod connectivity issues):
196 GigabitEthernet0/9/0 (up):
200 l2 bridge bd_id 1 bvi shg 0
205 - IP forwarding table:
210 ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
213 [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
214 [0] [@0]: dpo-drop ip4
217 [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
218 [0] [@0]: dpo-drop ip4
225 [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
226 [0] [@0]: dpo-drop ip4
233 Time IP4 Flags Ethernet Interface
234 728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0
235 542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0
236 1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0
237 15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1
238 739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2
239 739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3
241 - NAT configuration (issues related to services):
245 DBGvpp# sh nat44 addresses
246 NAT44 pool addresses:
248 tenant VRF independent
252 NAT44 twice-nat pool addresses:
256 vpp# sh nat44 static mappings
257 NAT44 static mappings:
258 tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only
259 tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only
260 tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only
261 tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only
262 tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only
263 tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only
264 udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
265 tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
269 vpp# sh nat44 interfaces
272 GigabitEthernet0/9/0 out
277 vpp# sh nat44 sessions
279 192.168.20.2: 0 dynamic translations, 3 static translations
280 10.1.1.3: 0 dynamic translations, 0 static translations
281 10.1.1.4: 0 dynamic translations, 0 static translations
282 10.1.1.2: 0 dynamic translations, 6 static translations
283 10.1.2.18: 0 dynamic translations, 2 static translations
285 - ACL config (issues related to policies):
289 vpp# sh acl-plugin acl
291 - “Steal the NIC (STN)” config (issues related to host connectivity
300 next_node: tapcli-0-output (410)
312 vpp# sh vxlan tunnels
318 vpp# sh vxlan tunnels
320 - Hardware interface information:
324 vpp# sh hardware-interfaces
329 `contiv-vpp-bug-report.sh <https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh>`__
330 is an example of a script that may be a useful starting point to
331 gathering the above information using kubectl.
333 Limitations: - The script does not include STN daemon logs nor does it
334 handle the special case of a crash loop
336 Prerequisites: - The user specified in the script must have passwordless
337 access to all nodes in the cluster; on each node in the cluster the user
338 must have passwordless access to sudo.
340 Setting up Prerequisites
341 ^^^^^^^^^^^^^^^^^^^^^^^^
343 To enable logging into a node without a password, copy your public key
344 to the following node:
348 ssh-copy-id <user-id>@<node-name-or-ip-address>
350 To enable running sudo without a password for a given user, enter:
356 Append the following entry to run ALL command without a password for a
361 <userid> ALL=(ALL) NOPASSWD:ALL
363 You can also add user ``<user-id>`` to group ``sudo`` and edit the
364 ``sudo`` entry as follows:
368 # Allow members of group sudo to execute any command
369 %sudo ALL=(ALL:ALL) NOPASSWD:ALL
371 Add user ``<user-id>`` to group ``<group-id>`` as follows:
375 sudo adduser <user-id> <group-id>
381 usermod -a -G <group-id> <user-id>
383 Working with the Contiv-VPP Vagrant Test Bed
384 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
386 The script can be used to collect data from the `Contiv-VPP test bed
388 Vagrant <https://github.com/contiv/vpp/blob/master/vagrant/README.md>`__.
389 To collect debug information from this Contiv-VPP test bed, do the
390 following steps: \* In the directory where you created your vagrant test
395 vagrant ssh-config > vagrant-ssh.conf
397 - To collect the debug information do:
401 ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf