1 # Debugging and Reporting Bugs in Contiv-VPP
3 ## Bug Report Structure
5 - [Deployment description](#describe-deployment):
6 Briefly describes the deployment, where an issue was spotted,
7 number of k8s nodes, is DHCP/STN/TAP used.
9 - [Logs](#collecting-the-logs):
10 Attach corresponding logs, at least from the vswitch pods.
12 - [VPP config](#inspect-vpp-config):
13 Attach output of the show commands.
15 - [Basic Collection Example](#basic-example)
17 ### Describe Deployment
18 Since contiv-vpp can be used with different configurations, it is helpful
19 to attach the config that was applied. Either attach `values.yaml` passed to the helm chart,
20 or attach the [corresponding part](https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38) from the deployment yaml file.
24 TCPstackDisabled: true
25 UseTAPInterfaces: true
26 TAPInterfaceVersion: 2
27 NatExternalTraffic: true
30 PodSubnetCIDR: 10.1.0.0/16
31 PodNetworkPrefixLen: 24
32 PodIfIPCIDR: 10.2.1.0/24
33 VPPHostSubnetCIDR: 172.30.0.0/16
34 VPPHostNetworkPrefixLen: 24
35 NodeInterconnectCIDR: 192.168.16.0/24
36 VxlanCIDR: 192.168.30.0/24
37 NodeInterconnectDHCP: False
40 Information that might be helpful:
41 - Whether node IPs are statically assigned, or if DHCP is used
43 - Version of TAP interfaces used
44 - Output of `kubectl get pods -o wide --all-namespaces`
47 ### Collecting the Logs
49 The most essential thing that needs to be done when debugging and **reporting an issue**
50 in Contiv-VPP is **collecting the logs from the contiv-vpp vswitch containers**.
52 #### a) Collecting Vswitch Logs Using kubectl
53 In order to collect the logs from individual vswitches in the cluster, connect to the master node
54 and then find the POD names of the individual vswitch containers:
57 $ kubectl get pods --all-namespaces | grep vswitch
58 kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h
59 kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h
62 Then run the following command, with *pod name* replaced by the actual POD name:
64 $ kubectl logs <pod name> -n kube-system -c contiv-vswitch
67 Redirect the output to a file to save the logs, for example:
70 kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
73 #### b) Collecting Vswitch Logs Using Docker
74 If option a) does not work, then you can still collect the same logs using the plain docker
75 command. For that, you need to connect to each individual node in the k8s cluster, and find the container ID of the vswitch container:
78 $ docker ps | grep contivvpp/vswitch
79 b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
82 Now use the ID from the first column to dump the logs into the `logs-master.txt` file:
84 $ docker logs b682b5837e52 > logs-master.txt
87 #### Reviewing the Vswitch Logs
89 In order to debug an issue, it is good to start by grepping the logs for the `level=error` string, for example:
91 $ cat logs-master.txt | grep level=error
94 Also, VPP or contiv-agent may crash with some bugs. To check if some process crashed, grep for the string `exit`, for example:
96 $ cat logs-master.txt | grep exit
97 2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
98 2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
101 #### Collecting the STN Daemon Logs
102 In STN (Steal The NIC) deployment scenarios, often need to collect and review the logs
103 from the STN daemon. This needs to be done on each node:
105 $ docker logs contiv-stn > logs-stn-master.txt
108 #### Collecting Logs in Case of Crash Loop
109 If the vswitch is crashing in a loop (which can be determined by increasing the number in the `RESTARTS`
110 column of the `kubectl get pods --all-namespaces` output), the `kubectl logs` or `docker logs` would
111 give us the logs of the latest incarnation of the vswitch. That might not be the original root cause
112 of the very first crash, so in order to debug that, we need to disable k8s health check probes to not
113 restart the vswitch after the very first crash. This can be done by commenting-out the `readinessProbe`
114 and `livenessProbe` in the contiv-vpp deployment YAML:
117 diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
118 index 3676047..ffa4473 100644
119 --- a/k8s/contiv-vpp.yaml
120 +++ b/k8s/contiv-vpp.yaml
121 @@ -224,18 +224,18 @@ spec:
123 # readiness + liveness probe
124 - containerPort: 9999
130 - initialDelaySeconds: 15
136 - initialDelaySeconds: 60
142 + # initialDelaySeconds: 15
148 + # initialDelaySeconds: 60
150 - name: MICROSERVICE_LABEL
154 If VPP is the crashing process, please follow the [CORE_FILES](CORE_FILES.html) guide and provide the coredump file.
157 ### Inspect VPP Config
158 Inspect the following areas:
159 - Configured interfaces (issues related basic node/pod connectivity issues):
162 GigabitEthernet0/9/0 (up):
166 l2 bridge bd_id 1 bvi shg 0
172 - IP forwarding table:
175 ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
178 [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
179 [0] [@0]: dpo-drop ip4
182 [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
183 [0] [@0]: dpo-drop ip4
190 [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
191 [0] [@0]: dpo-drop ip4
196 Time IP4 Flags Ethernet Interface
197 728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0
198 542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0
199 1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0
200 15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1
201 739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2
202 739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3
204 - NAT configuration (issues related to services):
206 DBGvpp# sh nat44 addresses
207 NAT44 pool addresses:
209 tenant VRF independent
213 NAT44 twice-nat pool addresses:
216 vpp# sh nat44 static mappings
217 NAT44 static mappings:
218 tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only
219 tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only
220 tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only
221 tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only
222 tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only
223 tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only
224 udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
225 tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
228 vpp# sh nat44 interfaces
231 GigabitEthernet0/9/0 out
235 vpp# sh nat44 sessions
237 192.168.20.2: 0 dynamic translations, 3 static translations
238 10.1.1.3: 0 dynamic translations, 0 static translations
239 10.1.1.4: 0 dynamic translations, 0 static translations
240 10.1.1.2: 0 dynamic translations, 6 static translations
241 10.1.2.18: 0 dynamic translations, 2 static translations
243 - ACL config (issues related to policies):
245 vpp# sh acl-plugin acl
247 - "Steal the NIC (STN)" config (issues related to host connectivity when STN is active):
253 next_node: tapcli-0-output (410)
261 vpp# sh vxlan tunnels
265 vpp# sh vxlan tunnels
267 - Hardware interface information:
269 vpp# sh hardware-interfaces
274 [contiv-vpp-bug-report.sh][1] is an example of a script that may be a useful starting point to gathering the above information using kubectl.
277 - The script does not include STN daemon logs nor does it handle the special
281 - The user specified in the script must have passwordless access to all nodes
282 in the cluster; on each node in the cluster the user must have passwordless
285 #### Setting up Prerequisites
286 To enable looging into a node without a password, copy your public key to the following
289 ssh-copy-id <user-id>@<node-name-or-ip-address>
292 To enable running sudo without a password for a given user, enter:
297 Append the following entry to run ALL command without a password for a given
300 <userid> ALL=(ALL) NOPASSWD:ALL
303 You can also add user `<user-id>` to group `sudo` and edit the `sudo`
307 # Allow members of group sudo to execute any command
308 %sudo ALL=(ALL:ALL) NOPASSWD:ALL
311 Add user `<user-id>` to group `<group-id>` as follows:
313 sudo adduser <user-id> <group-id>
317 usermod -a -G <group-id> <user-id>
319 #### Working with the Contiv-VPP Vagrant Test Bed
320 The script can be used to collect data from the [Contiv-VPP test bed created with Vagrant][2].
321 To collect debug information from this Contiv-VPP test bed, do the
323 * In the directory where you created your vagrant test bed, do:
325 vagrant ssh-config > vagrant-ssh.conf
327 * To collect the debug information do:
329 ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf
332 [1]: https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh
333 [2]: https://github.com/contiv/vpp/blob/master/vagrant/README.md