Ethernet-Be-Gone with RT5350F-OLinuXino-EVB


RT5350F-OLinuXino-EVB-1

We proudly announce that we got new product – LAN killer🙂

We found this new feature during our normal development work. If you power up RT5350F-OLinuXino-EVB connected to local network and you do not up the Ethernet interface within 5 minutes (like if you loading new firmware in the SPI Flash or debug) the local LAN stops working! All attempts to hit sites outside return DNS error.

It’s weird as if you up the Ethernet interface everything is fine, also this DNS block begins not immediately but about 5 minutes after you power the board connected in the network.

Our internal LAN is hairball for sure, but we are puzzled why this happens?

Can someone with RT5350F-OLinuXino-EVB try and confirm our findings?

We did another experiment with USB-Ethernet-AX88772B, we plug it to one of our Linux boards with USB host like A33-OLinuXino and do not up the interface and same problem is reproduced.

I’m sure this is probably our intranet lame settings, but it will be interesting people with more experience to share what cause this problem.

Meantime if you want to piss off your boss or colleague … you know what to try :D

 

26 Comments (+add yours?)

  1. Nikolay Hamov
    Sep 01, 2016 @ 15:38:24

    ARP poisoning?

    Reply

    • OLIMEX Ltd
      Sep 01, 2016 @ 15:40:39

      how to diagnose it?

      Reply

      • Nikolay Hamov
        Sep 01, 2016 @ 15:48:24

        Attach a linux computer to the network and start sniffing ARP packets
        sudo tcpdump -i enp0s25 -n -vv -e arp
        See which device (MAC address) answers the ARP requests of teh other devices. I’ve seen ARP poisining due to wrong configuration of a PIX firewall. I don’t know which setting was wrong, as it was configured by another department in a company I worked for in the past; probably something was wrong with the proxy APR settings.

      • Nikolay Hamov
        Sep 01, 2016 @ 15:58:18

        If your switch supports port mirroring, make a mirror of the port where the device is connected and sniff everything on that port. This would be the most useful approach.
        You can also sniff from a computer trying to connec to another computer.

        You can also see (also sniffing the APR requests) if the device takes the IP address of the default gateway, but if so, you should have problems for connections to the outside only.

      • Nikolay Hamov
        Sep 01, 2016 @ 16:00:29

        Please read ‘ARP requests and responses’ instead of ‘ARP requests’.

  2. Drone
    Sep 01, 2016 @ 20:22:09

    Are you Serious OLIMEX? You don’t know how to test and diagnose an issue like this? Or maybe this post is Click-Bait (or hopefully, a Joke)?

    Are you shipping production computer and/or peripheral boards without fully testing and validating them under a reference environment? It seems so. You do not publish your validation policies and methods – as far as I can see…

    Reply

    • 99guspuppet
      Sep 02, 2016 @ 02:20:36

      Drone …… Your comments are appreciated….. and they could be presented in a friendlier fashion. I doubt that Olimex has such deep pockets that they can do the testing that you allude to.

      Reply

    • Thomas
      Sep 02, 2016 @ 08:10:32

      What are you talking about? They clearly wrote they run into the same problem when connecting an USB-Ethernet-AX88772B to their network without ‘bringing it up’ (whatever that means).

      There’s something wrong in their network (and it might be interesting for Olimex to learn about to identify the device that causes the trouble which is obviously neither RT5350F-OLinuXino-EVB nor USB-Ethernet-AX88772B but something else) and also methodology since talking about ‘the local LAN stops working’ since ‘All attempts to hit sites outside return DNS error’ is the wrong conclusion.

      I would not even start with diagnosing what’s happening around RT5350F-OLinuXino-EVB but looking where the ‘DNS errors’ originate. If there is an internet access router in place that acts as both DHCP server and DDNS server and simply gets mad since a new MAC address appears on the LAN but no DHCP request happens nor DNS tables can be updated within 5 minutes and then $internal-bug happens… I would look there first. And if this thing does not run with OpenWRT, DD-WRT or something else with an open firmware then I would through it out of the windows and replace with something better.🙂

      I’ve seen stuff like that happening at a customer with an IDS (intrusion detection system) in place. A few unknown devices on the net caused the IDS to efficiently DOS the network down. But I doubt Olimex is running an IDS there.

      Reply

    • Thomas
      Sep 02, 2016 @ 08:11:56

      Oops, DOS is ‘denial of service’ this time. Not the CP/M clone.

      Reply

    • OLIMEX Ltd
      Sep 02, 2016 @ 08:15:06

      We do *very* extensive testing on every board we have, after each kernel build, every single peripheral is tested with selected number of our other boards (WiFi dongles, converters, sensors) so we are sure that we have tested our boards to work correctly with almost everything we have and can be used with this board.

      Is this enough?

      No it it never enough, we often get questions like: I bought USB sound card XXX and when I connect to your board it works on 44.1 kHz sampling rate but not on 48kHz sampling rate. Without having the actual hardware in our hands we can’t test. So testing is never enough, but we do our the best.

      In this weird case board should boot and do not up Ethernet interface for several minutes, this is not to happen if board works normally neither we have though that there will be a problem in such corner case.

      Testing is our bottleneck, two developers do this all day long and still this is the slowest part of our development process. We have many boards in the pipeline like ESP8266 IoT platform, MT7620-OLinuXino, A64-OLinuXino, AM3359-SOM, RT5350-DIN, etc etc which hardware is ready for a very long time and just building and testing reliable software is holding us back to release them.

      Reply

      • SK
        Sep 02, 2016 @ 09:33:04

        Maybe time to recruit some more Software Engineers, Linux & embedded gurus, software QAs, etc.🙂

    • bib
      Sep 02, 2016 @ 12:02:02

      Are you serious drone ?

      Reply

    • LinuxUser
      Sep 06, 2016 @ 23:27:14

      Hey, Drone, everyone could run into the problem and there is nothing wrong for asking help of ppl around. That’s what community for, btw. Pretending one knows everything is lame, there is always chance someone knows things about particular corner case better and it is really smart idea to use this option. That’s what Olimex did. Only stubborn fools do it the other way. Eventually wasting plenty of time to fight some thing which happens to be known to others, etc.

      Reply

  3. Jerome Flesch
    Sep 01, 2016 @ 23:38:19

    It looks like the symptoms of an Ethernet loop.

    Maybe your Ethernet interface is sending back a copy of every Ethernet trame it receives ? In which case, ARP broadcast packets (used for network autodetection and stuff) will keep looping between your switch and your Ethernet interface. Each time they hit back the switch, they will be sent to all the systems connected to the switch. These broadcast packets keep accumulating until the network is overloaded.

    You can check that quite easily : On any computer connected to the network, with no network program start (no web browser for instance) start a ‘tcpdump -nei eth0’. Before reproducing the problem, you should see a bunch of packets from time to time, but not a lot. When reproducing the problem, the amount of packet shown by tcpdump should increase progressively and visibly.

    Reply

  4. Morgaine
    Sep 02, 2016 @ 03:00:58

    Are you sure that you’re in the right place, Drone? You appear to be lost.

    This is an OSHW manufacturer’s site, on which the hardware experts talk openly to the software experts and vice versa and continually help each other out, and on which full openness about designs and faults is valued immensely. Success in this community is achieved through open collaboration and helpful suggestions, not destructive criticism.

    That doesn’t appear to be what you’re looking for. Perhaps a closed and proprietary manufacturer who will not inform you of any problems they encounter would suit you better. I expect you’ll feel much happier left in the dark.

    Reply

  5. Nikolay Hamov
    Sep 02, 2016 @ 09:43:13

    You can also check if this board takes the IP address of your DNS server after 5 min having its interface down.

    I have to comment the term ‘bring up’ [an interface]. While ‘to bring up’ means ‘to vomit’, this term is used as ‘to enable’ [an interface]. I was once criticised by the doc team for using the term ‘to bring up’. But yes, this term is documented🙂 http://tldp.org/HOWTO/Linux+IPv6-HOWTO/x1021.html . When used for a devicein a network context, it means to enable its network interface.

    Reply

    • OLIMEX Ltd
      Sep 02, 2016 @ 09:51:08

      I guess it comes from “ifup” command🙂 i.e. interface up

      Reply

    • Thomas
      Sep 02, 2016 @ 10:00:18

      I know what the term means in general. I just wanted to point out that starting from the opposite direction works faster. If the only issue is non-working DNS (not ‘LAN stopped working’) I would start here, on a machine that shows the symptoms and not on or around the machine who is blamed for the problem since adding it to the net is the only visible change.

      Take a Linux or OS X machine (sorry, no idea about Windows), look in /etc/resolve.conf for the IP address of the DNS server, do a ping to this address and a ping to http://www.google.com and then an ‘arp -a’, repeat the same 6 minutes after the ‘malicious device’ has been added to the net without bringing up the interface there.

      If ARP address of DNS server differs then it’s the ‘malicious device’ that has taken the address if not and pinging http://www.google.com now fails the DNS server stopped working. And if DNS and DHCP are done by the same machine I would look there first what happened.

      Reply

  6. __BriKs__
    Sep 02, 2016 @ 23:38:13

    So did you found root cause ?

    Reply

  7. Petr Moses
    Sep 04, 2016 @ 16:18:39

    Tsvetan: we (or our provider) had recently big problem with hacked Mikrotik routers. DNS on router not only faked queries to paypal.com, but even rejected DNS queries sent to other DNS server – fx. Google DNS at IP: 8.8.8.8

    As for testing bottleneck: you could after prototyping first boards outsource part of the testing to your customers. Your testers could prepare manual what to test and outsourcers could share results on bugzilla. Maybe there are customers who are willing to share the burden of developing new products, maybe they can’t pay more money, but they can share their time and expertise.

    Maybe you could share your early efforts with linux-sunxi, openwrt, ddwrt community.

    Reply

  8. Poul-Henning Kamp
    Sep 05, 2016 @ 01:32:21

    This sounds a lot like the good old “flow-control-storm” issue: The unconfigured interface ends up blasting ethernet level flow-control packets until all traffic stalls.

    Reply

  9. LinuxUser
    Sep 06, 2016 @ 23:23:44

    Obvious thing to try is to launch sniffer on other side (e.g. using another Linux computer) in promisc mode and take a look around which packets are going in and out. This may or may not work though.

    Also, it could be useful to take a look on what miitool and ethtool are reporting on what’s going on. This could give some ideas about low-level link-level issues and so on, but these are somewhat picky and it only works on “real” ethernet things and phys and only if their drivers support extensions to get such a low-level details on inner working. Though most drivers do. But I do not know how your 5350 implements Ethernet, unfortunately.

    Reply

    • Thomas
      Sep 07, 2016 @ 08:57:03

      Why should it be interesting how ‘5350 implements Ethernet’ if it seems the same symptom can be triggered by an USB dongle that is known to work well everywhere else? Lesson N°1 I learned from doing paid consultancy on network issues is to never trust in what the customer tells you ‘is the problem’ (‘local LAN stops working!’) but to try to figure out what’s really going on (‘All attempts to hit sites outside return DNS error’ which is something completely different).

      Without doing that you only waste your time since normally you’ve been sent on the wrong track which is nice of course when you’re paid by hours (not the case here) but can also be frustrating.

      Reply

      • LinuxUser
        Sep 14, 2016 @ 20:56:44

        Hey, Thomas, if your reply about 5350 & Ethernet was for me:

        1) miitool and ethtool are low level things. They could show useful stuff on how lowest level performs down to physical link state (e.g. if it negotiated link at all, features in use, …). This could give some clues. Say, some HW dislikes link power management and could go nuts. And so on. But mentioned tools are picky about hardware and drivers. To make it more fun, “router” SoCs similar 5350 often come with builtin on-chip switch and if they implement Eth port like switch port rather than CPU ethernet, tools like miitool or ethtool are just not going to work and its rather up to stuff like swconfig to chew on physical layer state (e.g. https://wiki.openwrt.org/doc/techref/swconfig) – in this case mileage may wary since it req’s some support in software.

        2) TBH I do not have this board, so I do not know how exactly it implements Ethernet (i.e. whether it just CPU ethernets or builtin switch port, etc). I guess best solution is to look into schematics, but since I do not even have this thing, my motivation to do so is quite weak and I have plenty of other stuff to chew on.

        From what I’ve observed, most “proper” Ethernet drivers ensure their HW actually goes down when “interface is not configured” (i.e. DOWN in terms of Linux). So no even carrier is on the wire and link supposed to be completely down. So it not supposed to cause more issues than unplugged wire. But its in notion of Linux and Ethernet,

        And uhm, you’re perfectly right one do not have to trust what customer tells and so on. I’ve just “advertised” some few tools which could shed some light on what’s going on at lowest levels. After all Olimex ppl aren’t noobs, so they could try it. Either way, you’re 100% right DNS error is unconvicing and could mean anything.

        Say, if network completely failed, DNS could be first thing which device attempts and which fails. This tells nothing how badly network is broken. Say, if e.g. packets sent by IP go over LAN? Or further, into Internet? Sometimes it could be even more obscure.

  10. Freek
    Sep 26, 2016 @ 15:41:28

    Hi guys,
    I am to file a bug report on my A20-OLinuXIno-LIME2 board with my ethernet connection introduced with Jessie and I reading this conversation I suspect it may be related, see also https://www.olimex.com/forum/index.php?topic=5503.new;topicseen#new.

    I will file the bug report as soon as I figured out where I can do so.

    To be continued,
    Freek

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: